Reading data from S3 with partitions of unequal columns

Question

Reading data from S3 with partitions of unequal columns

754 views Asked by bunnylorr At 05 May 2021 at 22:53

I have some partitioned data in S3 and each partition is having different number of columns like below. When I read the data in pyspark and tru to print schema I can only read columns which are commonly present in all partitions but not all. What is the best way to read all columns and rename few columns.

aws s3 ls s3://my-bkt/test_data/
            PRE occ_dt=20210426/
            PRE occ_dt=20210428/
            PRE occ_dt=20210429/
            PRE occ_dt=20210430/
            PRE occ_dt=20210503/
            PRE occ_dt=20210504/
            

spark.read.parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
 |-- map_api__450jshb457: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- map_api_592yd749dn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- map_api_has_join: string (nullable = true)


# When I read partition 20210504
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/occ_dt=20210504/").printSchema()
 |-- map_api__450jshb457: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- map_api_592yd749dn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- map_api_has_join: string (nullable = true)
 |-- cust_activity: string (nullable = true)
 |-- map_api__592rtddvid: string (nullable = true)



# When I read partition 20210503
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/occ_dt=20210503/").printSchema()
 |-- map_api__450jshb457: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- map_api_592yd749dn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- map_api_4js3nnju8572d93: string (nullable = true)
 |-- map_api_58943h64u47v: string (nullable = true)
 |-- map_api__58943h6220dh: string (nullable = true)

As shown above, there are more fields in partitions 20210503 & 20210504 than other partitions. When I read the s3 bucket to get the schema only fields that are common in all partitions are displayed. I'd like to get the expected outcome to be as below with all fields returned when I read an s3 loc.

Expected Output : 
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)

Thanks in advance!!

Original Q&A

There are 1 answers

**bunnylorr** · Answer 1 · 2021-05-05T23:33:43+00:00

Added mergeSchema in option.

spark.read.option("mergeSchema", "true").parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)

TechQA.

Reading data from S3 with partitions of unequal columns

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in SPARK-SHELL

Popular Questions

Popular Tags

Trending Questions