To read JSON files faster from AWS S3 using PySpark

85 views Asked by user1021822 At 21 November 2023 at 13:45

I have written a PySpark code in AWS Glue ETL where I need to read multiple small (2KB) JSON files from the AWS S3 path. S3 has a bucket, prefix, and then partition keys. Partitions like accountid, region, and type. So, for the S3 path, I am using a wild card search as follows:

path = 'S3://bucket//accountid=*/region=*/type=/*.json'

And to read:

df = spark.read.json(path).select(selected_columns).dropDuplicates()

But it is taking too much time to read all the data, an hour. How can I read this data quickly within a few minutes or seconds? I tried using custom schema also but it is not helping much. Please help.

Original Q&A

TechQA.

To read JSON files faster from AWS S3 using PySpark

There are 0 answers

Related Questions in AMAZON-WEB-SERVICES

Related Questions in APACHE-SPARK

Related Questions in AMAZON-S3

Related Questions in PYSPARK

Related Questions in PYSPARK-TRANSFORMER

Popular Questions

Popular Tags

Trending Questions