I have a Dataset similar to the below structure:
col_A col_B date
1 5 2021-04-14
2 7 2021-04-14
3 5 2021-04-14
4 9 2021-04-14
I am trying to use the below code in Spark Java to write the dataaset to a file in HDFS.
Dataset<Row> outputDataset; // This is a valid dataset and works flawlessly when written to csv
/*
some code which sets the outputDataset
*/
outputDataset
.repartition(1)
.write()
.partitionBy("date")
.format("com.crealytics.spark.excel")
.option("header", "true")
.save("/saveLoc/sales");
Normal Working Case:
When I pass use .format("csv")
, the above code creates a folder with the name date=2021-04-14
in the path /saveLoc/sales
that is passed in .save()
which is exactly as expected. The full path of the end file is /saveLoc/sales/date=2021-04-14/someFileName.csv
. Also, the column date
is removed from the file since it was partitioned on.
What I need to do:
However, when I use .format("com.crealytics.spark.excel")
, it just creates a plain file called sales
in the folder saveLoc
and doesn't remove the partitioned(date) column from the end file. Does that mean it isn't partitioning on the column "date"? Full path of the file created is /saveLoc/sales
. Please note that it overrides the folder "sales" with a file sales.
Excel plugin used is descibed here: https://github.com/crealytics/spark-excel
How can I make it parition when writing in excel? In other words, how can I make it behave exactly as it did in case of csv?
Versions used:
spark-excel: com.crealytics.spark-excel_2.11
scala: org.apache.spark.spark-core_2.11
Thanks.