xlsx - columns getting mapped incorrectly while reading into spark dataframe using pyspark

59 views Asked by At

I'm trying to read excel file using below pyspark code

df_data = spark.read.format("com.crealytics.spark.excel") \
            .option("header", "true") \
            .option("dataAddress", f"'{sheet_name}'!A1") \
            .option("treatEmptyValuesAsNulls", "false")\
            .schema(custom_schema) \
            .load(file_path)

Mapping of the column names are not in correct order as per the file. for example

file:
col1 col2 col3
12    23   null

Df output: 
  col2 col3 col1
  null 12    23

Let me know how this can be fixed in sorting correcting column mapping. Thanks in Advance.

1

There are 1 answers

0
DileeprajnarayanThumula On

I have tried the below approach:

from pyspark.sql.types import StringType, StructField, StructType
file_path = "/FileStore/tables/exclk.xlsx"
sheet_name = "Sheet1" 
schema = StructType([
    StructField("col1", StringType(), nullable=True),
    StructField("col2", StringType(), nullable=True),
    StructField("col3", StringType(), nullable=True)
])
desired_order = ['col1', 'col2', 'col3']
df_data = spark.read.format("com.crealytics.spark.excel") \
            .option("header", "true") \
            .option("dataAddress", f"'{sheet_name}'!A1") \
            .option("treatEmptyValuesAsNulls", "false") \
            .schema(schema) \
            .load(file_path)
df_data = df_data.select(desired_order)
df_data.show()

Results:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|  12|  23|NULL|
|  34|  45|  56|
+----+----+----+

In the above code reading the Excel file, applied the specified schema, and select the columns in the desired order.