Write Spark Dataset to Excel File along with partitioning

685 views Asked by At

I have a Dataset similar to the below structure:

col_A      col_B        date
  1          5       2021-04-14
  2          7       2021-04-14
  3          5       2021-04-14
  4          9       2021-04-14

I am trying to use the below code in Spark Java to write the dataaset to a file in HDFS.

Dataset<Row> outputDataset; // This is a valid dataset and works flawlessly when written to csv
/*
   some code which sets the outputDataset
*/
outputDataset
    .repartition(1)
    .write()
    .partitionBy("date")
    .format("com.crealytics.spark.excel")
    .option("header", "true")
    .save("/saveLoc/sales");

Normal Working Case:

When I pass use .format("csv"), the above code creates a folder with the name date=2021-04-14 in the path /saveLoc/sales that is passed in .save() which is exactly as expected. The full path of the end file is /saveLoc/sales/date=2021-04-14/someFileName.csv. Also, the column date is removed from the file since it was partitioned on.

What I need to do:

However, when I use .format("com.crealytics.spark.excel"), it just creates a plain file called sales in the folder saveLoc and doesn't remove the partitioned(date) column from the end file. Does that mean it isn't partitioning on the column "date"? Full path of the file created is /saveLoc/sales. Please note that it overrides the folder "sales" with a file sales.

Excel plugin used is descibed here: https://github.com/crealytics/spark-excel

How can I make it parition when writing in excel? In other words, how can I make it behave exactly as it did in case of csv?

Versions used:

spark-excel: com.crealytics.spark-excel_2.11
scala: org.apache.spark.spark-core_2.11

Thanks.

0

There are 0 answers