Under Dataproc I setup a PySpark cluster with 1 Master Node and 2 Workers. In bucket I have directories of sub-directories of files.
In the Datalab notebook I run
import subprocess
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()
This gives me all the sub-directories with no problem.
Then I hope to gsutil ls
all the files in the sub-directories, so in master node I got:
def get_sub_dir(path):
import subprocess
p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
return p.stdout.read(), p.stderr.read()
and run get_sub_dir(sub-directory)
, this gives all files with no problem.
However,
sub_dir = sc.parallelize([sub-directory])
sub_dir.map(get_sub_dir).collect()
gives me:
Traceback (most recent call last):
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
main()
File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
project, account = bootstrapping.GetActiveProjectAndAccount()
File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
project_name = properties.VALUES.core.project.Get(validate=False)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
required)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
value = _GetPropertyWithoutDefault(prop, properties_file)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
value = callback()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
return c_gce.Metadata().Project()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
_metadata_lock.lock(function=_CreateMetadata, argument=None)
File "/usr/lib/python2.7/mutex.py", line 44, in lock
function(argument)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
_metadata = _GCEMetadata()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
self.connected = gce_cache.GetOnGCE()
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
self._WriteDisk(on_gce)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
MakeDir(full_parent_dir_path, mode=0700)
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
(u'Please verify that you have permissions to write to the parent '
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.
Please verify that you have permissions to write to the parent directory.
After checking, on the worker nodes with whoami
, it shows yarn
.
So the question is, how to authorize yarn
to use gsutil
, or is there any other ways to access bucket from the Dataproc PySpark Worker nodes?
The CLI looks at the current homedir for a location to place a cached credential file when it fetches a token from the metadata service. The relevant code in
googlecloudsdk/core/config.py
looks like this:For things running in YARN containers, despite being run as user
yarn
, where if you just runsudo su yarn
you'll see~
resolve to/var/lib/hadoop-yarn
on a Dataproc node, YARN actually propagatesyarn.nodemanager.user-home-dir
as the container's homedir, and this defaults to/home/
. For this reason, even though you cansudo -u yarn gsutil ...
, it doesn't behave the same way as gsutil in a YARN container, and naturally, onlyroot
is able to create directories in the base/home/
directory.Long story short, you have two options:
HOME=/var/lib/hadoop-yarn
right before yourgsutil
statement.Example:
Example:
For an existing cluster, you could also manually add the config to
/etc/hadoop/conf/yarn-site.xml
on all your workers and then reboot the worker machines (or just runsudo systemctl restart hadoop-yarn-nodemanager.service
) but that can be a hassle to manually run on all worker nodes.