How can we stop increasing the size of parquet files when writting on disk ,by doing repartition

487 views Asked by At

I am trying to merge small files less than 512 mb in a hdfs directory. After merging the files size on disk is more than input size. Is there any way to control the size efficiently.

Df=spark.read.parquet("/./")
Magic_number=(total size of input file / 512)

Df.repartition(Magic_number).write.save("/./")

Repartition is causing lot of shuffling and input files are in parquet format.

1

There are 1 answers

3
Kieran On
import org.apache.spark.util.SizeEstimator
val numBytes = SizeEstimator.estimate(df)

val desiredBytesPerFile = ???

df.coalesce(numBytes / desiredBytesPerFile).write.save("/./")

This will give you approximately the write number of bytes per file.