Cannot read credential_key.json in bitnami spark image on docker when connect to google cloud storage

27 views Asked by At

currently I'm trying to do read/write to google cloud storage (gs://xxx) from a spark container on docker. I have followed the steps described on gcs, install gcs-hadoop3-connector jar, spark-bigquery jar and set these two things when creating the spark session:

google.cloud.auth.service.account.enable: true
google.cloud.auth.service.account.json.keyfile: <path-to-key.json>

The spark session is initialed successfully and jars are also successfully loaded.

Now the problem is, when I try to write a parquet to gs://path/to/output. For example,

spark.createDataFrame(['a'],['b'],['c']]).write.mode("overwrite").parquet('gs://path/to/output')

It shows the java.io.FileNotFoundException and said my credential_key.json was not there.

I'm using bitnami spark image 3.5.1 and gcs hadoop3 connector 2.2.20.

I'm just confused where would spark in docker look for the gcs key credentials json file, The local machine or in the container? Since I tried to put the credential key in local, master container and worker container, but all don't work at all.

==================================================================

(Update)

I think I solve the problem, refer to this site.

In one word, I should use hadoopConfiguration to set the credential_key path but not in the sparkContext. spark._jsc.hadoopConfiguration().set("google.cloud.auth.service.account.json.keyfile","<path_to_your_credentials_json>")

The path is in the container, and everything is fine now.

0

There are 0 answers