I'm trying to save a dataframe as CSV file partitioned by a column.
val schema = new StructType(
Array(
StructField("ID",IntegerType,true),
StructField("State",StringType,true),
StructField("Age",IntegerType,true)
)
)
val df = sqlContext.read.format("com.databricks.spark.csv")
.options(Map("path" -> filePath).schema(schema).load()
df.write.partitionBy("State").format("com.databricks.spark.csv").save(outputPath)
But the output is not saved with any partition info. It looks like partitionBy was completely ignored. There were no errors. It works if I try the same with parquet format.
df.write.partitionBy("State").parquet(outputPath)
What am I missing here?
partitionBy
support has to be implemented as a part of a given data source and as for now (v1.3) is not supported in Spark CSV. See: https://github.com/databricks/spark-csv/issues/123