How to use stocator from IBM Jupyter notebook running pyspark?

495 views Asked by At

I want to use stocator to access IBM cloud storage from a Jupyter notebook (on IBM Watson Studio) running pyspark. Can someone please tell me how to go about this?

I understand that stocator is pre-installed but do you have to put in credentials or settings from within the notebook first (if there's a specific bucket on COS I'm trying to access)

For eg. I have a bucket name: my-bucket

How do I access it?

I know I can use ibm_boto3 to directly access COS but this is for a spark application due to which I need to be able to do so through stocator.

2

There are 2 answers

0
fwong_pong On

Okay so to get it to work in my case, I had to add the access key too plus you have to make sure that you're using the service name correctly as it applies to you but it should be the same in all instances you use it.

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.sname.iam.api.key", "API_KEY")
hconf.set("fs.cos.sname.access.key","ACCESS_KEY")
hconf.set("fs.cos.sname.endpoint", "ENDPOINT")
rdd = sc.textFile('file.txt')
rdd.saveAsTextFile('cos://bname.sname/test.txt')
0
charles gomes On

All you need to do is set the hadoop configuration parameters for spark and then you should be able to write the dataframe as csv inside your COS bucket. Make sure the credentials you use do have writer or higher IAM access to the COS bucket..

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.servicename.iam.api.key", "**********")
hconf.set("fs.cos.servicename.endpoint", "<BUCKET_ENDPOINT>")
df.write.format("csv").save("cos://<bucket>.myservice/filename.csv")

The above code was reference from this medium article:- https://medium.com/@rachit1arora/efficient-way-to-connect-to-object-storage-in-ibm-watson-studio-spark-environments-d6c1199f9f97