I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but it is not working as expected.
Below is the one I have used and tested on a 90 GB table from hive- ORC formatted. At the export (write) level it's giving random file sizes other than 4 GB
Any suggestion here to split the files with limit size while writing. Here I don't want to use repartition or coalesce as the df is going through a lot of wide transformations.
df.write.format("csv").mode("overwrite").option("maxPartitionBytes", 4*1024*1024(1024).save(outputpath)
According to the Spark documentation
spark.sql.files.maxPartitionBytesis working on read, if you are doing some shuffles later final size of tasks and due to that final files on write may changeYou may try to use
spark.sql.files.maxRecordsPerFileas according to the documentation it is working on writeIf it is not going to do the trick i think that other option is, as you mentioned, to repartition this dataset just before write