Pyspark split the file while writing with specific limit

Question

Pyspark split the file while writing with specific limit

1k views Asked by Vikas T At 23 September 2022 at 17:26

I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but it is not working as expected.

Below is the one I have used and tested on a 90 GB table from hive- ORC formatted. At the export (write) level it's giving random file sizes other than 4 GB

Any suggestion here to split the files with limit size while writing. Here I don't want to use repartition or coalesce as the df is going through a lot of wide transformations.

df.write.format("csv").mode("overwrite").option("maxPartitionBytes", 4*1024*1024(1024).save(outputpath)

Original Q&A

There are 1 answers

**M_S** · Answer 1 · 2022-10-06T20:06:16+00:00

According to the Spark documentation spark.sql.files.maxPartitionBytes is working on read, if you are doing some shuffles later final size of tasks and due to that final files on write may change

You may try to use spark.sql.files.maxRecordsPerFile as according to the documentation it is working on write

spark.sql.files.maxRecordsPerFile Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit.

If it is not going to do the trick i think that other option is, as you mentioned, to repartition this dataset just before write

TechQA.

Pyspark split the file while writing with specific limit

There are 1 answers

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in SPARK2.4.4

Popular Questions

Trending Questions