Under Dataproc I setup a PySpark cluster with 1 Master Node and 2 Workers. In bucket I have directories of sub-directories of files.
In the Datalab notebook I run
import subprocess
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()
This gives me all the sub-directories with no problem.
Then I hope to gsutil ls all the files in the sub-directories, so in master node I got:
def get_sub_dir(path):
import subprocess
p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
return p.stdout.read(), p.stderr.read()
and run get_sub_dir(sub-directory), this gives all files with no problem.
However,
sub_dir = sc.parallelize([sub-directory])
sub_dir.map(get_sub_dir).collect()
gives me:
Traceback (most recent call last):
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
main()
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
project, account = bootstrapping.GetActiveProjectAndAccount()
File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
project_name = properties.VALUES.core.project.Get(validate=False)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
required)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
value = _GetPropertyWithoutDefault(prop, properties_file)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
value = callback()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
return c_gce.Metadata().Project()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
_metadata_lock.lock(function=_CreateMetadata, argument=None)
File "/usr/lib/python2.7/mutex.py", line 44, in lock
function(argument)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
_metadata = _GCEMetadata()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
self.connected = gce_cache.GetOnGCE()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
self._WriteDisk(on_gce)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
MakeDir(full_parent_dir_path, mode=0700)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
(u'Please verify that you have permissions to write to the parent '
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.
Please verify that you have permissions to write to the parent directory.
After checking, on the worker nodes with whoami, it shows yarn.
So the question is, how to authorize yarn to use gsutil, or is there any other ways to access bucket from the Dataproc PySpark Worker nodes?
The CLI looks at the current homedir for a location to place a cached credential file when it fetches a token from the metadata service. The relevant code in
googlecloudsdk/core/config.pylooks like this:For things running in YARN containers, despite being run as user
yarn, where if you just runsudo su yarnyou'll see~resolve to/var/lib/hadoop-yarnon a Dataproc node, YARN actually propagatesyarn.nodemanager.user-home-diras the container's homedir, and this defaults to/home/. For this reason, even though you cansudo -u yarn gsutil ..., it doesn't behave the same way as gsutil in a YARN container, and naturally, onlyrootis able to create directories in the base/home/directory.Long story short, you have two options:
HOME=/var/lib/hadoop-yarnright before yourgsutilstatement.Example:
Example:
For an existing cluster, you could also manually add the config to
/etc/hadoop/conf/yarn-site.xmlon all your workers and then reboot the worker machines (or just runsudo systemctl restart hadoop-yarn-nodemanager.service) but that can be a hassle to manually run on all worker nodes.