I am trying to merge small files less than 512 mb in a hdfs directory. After merging the files size on disk is more than input size. Is there any way to control the size efficiently.
Df=spark.read.parquet("/./")
Magic_number=(total size of input file / 512)
Df.repartition(Magic_number).write.save("/./")
Repartition is causing lot of shuffling and input files are in parquet format.
This will give you approximately the write number of bytes per file.