How to access the AWS public dataset using Databrick?

Question

How to access the AWS public dataset using Databrick?

601 views Asked by kimhkh At 11 October 2020 at 19:05

I am new to databricks. I am looking for public big data dataset for my school project, then I came across AWS public dataset on this link: https://registry.opendata.aws/target/

I am using python on Databricks, and I don't know how to establish a connection to the data. I have found the following how to guide:

https://databricks.com/wp-content/uploads/2015/08/Databricks-how-to-data-import.pdf?_ga=2.25033139.881714623.1602433762-982722630.1598480448

It mentioned

I am not sure how to find the respective access_key, secret_key, AWS_bucket_name and the mount_name.

Original Q&A

There are 1 answers

**Alex Ott** · Accepted Answer · 2020-10-13T14:12:44+00:00

This documentation is for non-public S3 buckets.

For this dataset you can simply read using the s3://... URL, like this:

df = spark.read.format("text").load("s3://gdc-target-phs000218-2-open/")

I used text file format just for example, but because this dataset uses XML to store the data, you'll need to use something like spark-xml library to extract necessary data.

TechQA.

How to access the AWS public dataset using Databrick?

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in DATASET

Related Questions in DATABRICKS

Related Questions in AWS-DATABRICKS

Popular Questions

Popular Tags

Trending Questions