So I need to collect a very large number of directories, themselves containing subdirectories, from HDFS, and I want to be able to use globStatus. My Path pattern essentially looks like this:
"/directory/*/{opt1,opt2}/{opt1,opt2,opt3}*"
Unfortunately, for some of the directories captured by the *, I don't have execute permissions (can't view contents), but the glob attempts to look inside, causing an exception. Is there any way to request that the glob simply skip over directories for which it doesn't have permissions, rather than failing completely?
I am aware that there are other methods through which I could achieve the same goal, but as far as I can tell it would be more complex, and I think require more requests to HDFS, than a simple glob.
Answering this in case anyone else comes across this question...
The filtering behavior for
globStatus
is done client-side as part of theFileSystem
/Globber
class. Under the hood it is really just submitting a series oflistStatus
commands and filtering the return value(s). To get the behavior described will require some custom logic, but won't be any less efficient than theglobStatus
API.