To read JSON files faster from AWS S3 using PySpark

84 views Asked by At

I have written a PySpark code in AWS Glue ETL where I need to read multiple small (2KB) JSON files from the AWS S3 path. S3 has a bucket, prefix, and then partition keys. Partitions like accountid, region, and type. So, for the S3 path, I am using a wild card search as follows:

path = 'S3://bucket//accountid=*/region=*/type=/*.json'

And to read:

df = spark.read.json(path).select(selected_columns).dropDuplicates()

But it is taking too much time to read all the data, an hour. How can I read this data quickly within a few minutes or seconds? I tried using custom schema also but it is not helping much. Please help.

0

There are 0 answers