How to check if a file exists in Google Storage from Spark Dataproc?

2.8k views Asked by At

I was assuming that Google Storage connector would allow to query GS directly as if it was HDFS from Spark in Dataproc, but it looks like the following does not work (from Spark Shell):

scala> import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.FileSystem

scala> import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.Path

scala> FileSystem.get(sc.hadoopConfiguration).exists(new Path("gs://samplebucket/file"))
java.lang.IllegalArgumentException: Wrong FS: gs://samplebucket/file, expected: hdfs://dataprocmaster-m

Is there a way to access Google Storage files using just the Hadoop API?

3

There are 3 answers

1
Pradeep Gollakota On BEST ANSWER

That's because FileSystem.get(...) returns the default FileSystem which according to your configuration is HDFS and can only work with paths starting with hdfs://. Use the following to get the correct FS.

Path p = new Path("gs://...");
FileSystem fs = p.getFileSystem(...);
fs.exists(p);
0
aletts54 On

I translated @Pradeep Gollakota answer to PySpark, thanks!!

def path_exists(spark, path): #path = gs://.... return true if exists 
    p = spark._jvm.org.apache.hadoop.fs.Path(path)
    fs = p.getFileSystem(spark._jsc.hadoopConfiguration())
    return fs.exists(p)
1
Rubber Duck On
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.fs.{FileSystem, Path}

val p = "gs://<your dir>"

val path = new Path(p)
val fs = path.getFileSystem(sc.hadoopConfiguration)
fs.exists(path)

fs.isDirectory(path)