How can we stop increasing the size of parquet files when writting on disk ,by doing repartition

Question

How can we stop increasing the size of parquet files when writting on disk ,by doing repartition

491 views Asked by rupesh kumar At 27 September 2021 at 13:51

I am trying to merge small files less than 512 mb in a hdfs directory. After merging the files size on disk is more than input size. Is there any way to control the size efficiently.

Df=spark.read.parquet("/./")
Magic_number=(total size of input file / 512)

Df.repartition(Magic_number).write.save("/./")

Repartition is causing lot of shuffling and input files are in parquet format.

Original Q&A

There are 1 answers

**Kieran** · Answer 1 · 2021-09-27T17:04:47+00:00

Kieran On 27 September 2021 at 17:04

import org.apache.spark.util.SizeEstimator
val numBytes = SizeEstimator.estimate(df)

val desiredBytesPerFile = ???

df.coalesce(numBytes / desiredBytesPerFile).write.save("/./")

This will give you approximately the write number of bytes per file.

TechQA.

How can we stop increasing the size of parquet files when writting on disk ,by doing repartition

There are 1 answers

Related Questions in FILE

Related Questions in APACHE-SPARK

Related Questions in DISK

Related Questions in CONTROLLING

Popular Questions

Popular Tags

Trending Questions