I'm trying to read an HDFS file (directory really as typical) that holds data in a series of files bucket_NNNNN, but also contains another file _orc_acid_version that does not contain data and should be excluded. Here's a snapshot from the Ambari file browser:
I believe I should be using the "Regular Expression for File Path" feature of the HDFS data store definition screen, but nothing I try works for me. I find the online documentation a bit vague on this feature. My questions:
- Should the "Path:" parameter be left blank when providing a regular expression in the "Regular Expression for File Path:" parameter? Or should the "Path:" parameter contain the fixed portion of the path and the regular expression parameter specify the variable part?
- In my case, the directory path is of the form /some_directory/delta_0000001_0000001_0003/<bucket files>. I expected a regular expression like "/some_directory/[_0-9]*/bucket_[0-9][0-9]*" to work (the test feature says it's correct). Are there limitations on the format of the regular expression I must follow?
I'm stuck, so any help would be appreciated.
Please sign in to leave a comment.