Reading data from S3 with partitions of unequal columns

735 views Asked by At

I have some partitioned data in S3 and each partition is having different number of columns like below. When I read the data in pyspark and tru to print schema I can only read columns which are commonly present in all partitions but not all. What is the best way to read all columns and rename few columns.

aws s3 ls s3://my-bkt/test_data/
            PRE occ_dt=20210426/
            PRE occ_dt=20210428/
            PRE occ_dt=20210429/
            PRE occ_dt=20210430/
            PRE occ_dt=20210503/
            PRE occ_dt=20210504/
            

spark.read.parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
 |-- map_api__450jshb457: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- map_api_592yd749dn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- map_api_has_join: string (nullable = true)


# When I read partition 20210504
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/occ_dt=20210504/").printSchema()
 |-- map_api__450jshb457: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- map_api_592yd749dn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- map_api_has_join: string (nullable = true)
 |-- cust_activity: string (nullable = true)
 |-- map_api__592rtddvid: string (nullable = true)



# When I read partition 20210503
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/occ_dt=20210503/").printSchema()
 |-- map_api__450jshb457: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- map_api_592yd749dn: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- map_api_4js3nnju8572d93: string (nullable = true)
 |-- map_api_58943h64u47v: string (nullable = true)
 |-- map_api__58943h6220dh: string (nullable = true)
 

As shown above, there are more fields in partitions 20210503 & 20210504 than other partitions. When I read the s3 bucket to get the schema only fields that are common in all partitions are displayed. I'd like to get the expected outcome to be as below with all fields returned when I read an s3 loc.

Expected Output : 
spark.read.parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)

Thanks in advance!!

1

There are 1 answers

0
bunnylorr On

Added mergeSchema in option.

spark.read.option("mergeSchema", "true").parquet("aws s3 ls s3://my-bkt/test_data/").printSchema()
|-- map_api__450jshb457: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- map_api_592yd749dn: string (nullable = true)
|-- last_name: string (nullable = true)
|-- map_api_has_join: string (nullable = true)
|-- map_api_4js3nnju8572d93: string (nullable = true)
|-- map_api_58943h64u47v: string (nullable = true)
|-- map_api__58943h6220dh: string (nullable = true)
|-- cust_activity: string (nullable = true)
|-- map_api__592rtddvid: string (nullable = true)