How to access the AWS public dataset using Databrick?

600 views Asked by At

I am new to databricks. I am looking for public big data dataset for my school project, then I came across AWS public dataset on this link: https://registry.opendata.aws/target/

I am using python on Databricks, and I don't know how to establish a connection to the data. I have found the following how to guide:

https://databricks.com/wp-content/uploads/2015/08/Databricks-how-to-data-import.pdf?_ga=2.25033139.881714623.1602433762-982722630.1598480448

It mentioned Screenshot

I am not sure how to find the respective access_key, secret_key, AWS_bucket_name and the mount_name.

1

There are 1 answers

0
Alex Ott On BEST ANSWER

This documentation is for non-public S3 buckets.

For this dataset you can simply read using the s3://... URL, like this:

df = spark.read.format("text").load("s3://gdc-target-phs000218-2-open/")

I used text file format just for example, but because this dataset uses XML to store the data, you'll need to use something like spark-xml library to extract necessary data.