Force HDFS globStatus to skip directories it doesn't have permissions to

397 views Asked by At

So I need to collect a very large number of directories, themselves containing subdirectories, from HDFS, and I want to be able to use globStatus. My Path pattern essentially looks like this:

"/directory/*/{opt1,opt2}/{opt1,opt2,opt3}*"

Unfortunately, for some of the directories captured by the *, I don't have execute permissions (can't view contents), but the glob attempts to look inside, causing an exception. Is there any way to request that the glob simply skip over directories for which it doesn't have permissions, rather than failing completely?

I am aware that there are other methods through which I could achieve the same goal, but as far as I can tell it would be more complex, and I think require more requests to HDFS, than a simple glob.

1

There are 1 answers

0
xkrogen On

Answering this in case anyone else comes across this question...

The filtering behavior for globStatus is done client-side as part of the FileSystem / Globber class. Under the hood it is really just submitting a series of listStatus commands and filtering the return value(s). To get the behavior described will require some custom logic, but won't be any less efficient than the globStatus API.