The spark will automatically decide the number of partitions base on the size of the input file. I have two questions:
Can I specify the number of the partition rather than let the spark decide how much partitions?
How bad is shuffle when doing the repartition? Is it really expensive for the performance? My case is that I need repartition to "1" to write into the one Parquet file, the partition was "31". How bad is it? why?
Repartition and coalesce are the two functions that are used for repartitioning of data once it is read.