What is the similar function to Distributed cache of Hadoop Distribution File system in Google File System

181 views Asked by At

I have deployed a 6-node Hadoop Cluster in Google Compute Engine.

I am using Google file system(GFS) instead of Hadoop File Distribution System(HFS).
. So, I want to access files in GFS in the same way as distributed cache method does in HDFS

Please tell me a way to access files this way.

1

There are 1 answers

0
Dennis Huo On

When running Hadoop on Google Compute Engine with the Google Cloud Storage connector for Hadoop as the "default filesystem", the GCS connector is able to be treated exactly the same way HDFS is treated, including for usage in the DistributedCache. So, to access files in Google Cloud Storage, you'd use it exactly the same way you would use HDFS, no need to change anything. For example, if you had deployed your cluster with your GCS connector's CONFIGBUCKET set to foo-bucket, and you had local files you wanted to place in the DistributedCache, you'd do:

# Copies mylib.jar into gs://foo-bucket/myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar

And in your Hadoop job:

JobConf job = new JobConf();

// Retrieves gs://foo-bucket/myapp/mylib.jar as a cached file.
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);

If you want to access files in a different bucket than your CONFIGBUCKET, you just need to specify a full path, using gs:// instead of hdfs://:

# Copies mylib.jar into gs://other-bucket/myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mylib.jar gs://other-bucket/myapp/mylib.jar

and then in Java

JobConf job = new JobConf();

// Retrieves gs://other-bucket/myapp/mylib.jar as a cached file.
DistributedCache.addFileToClassPath(new Path("gs://other-bucket/myapp/mylib.jar"), job);