I am running some experiments on block size (dfs.block.size) and row group size (parquet.block.size) in hdfs.
I have a large set of data in hdfs, and I want to replicate the data with various block sizes and row group sizes for testing. I'm able to copy the data with a different block size using:
hdfs dfs -D dfs.block.size=67108864 -D parquet.block.size=67108864 -cp /new_sample_parquet /new_sample_parquet_64M
But only the dfs.block.size gets changed. I am verifying with hdfs dfs -stat
for the block size, and parquet-tools meta
for the row group size. In fact, if I replace parquet.block.size
with blah.blah.blah
it has the same effect. I even went into spark-shell and set the parquet.block.size
property manually using
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864).
I am using hadoop 3.1.0. I got the property name of parquet.block.size
from here.
Here is the first 10 rows of the output of my attempt
row group 1: RC:4140100 TS:150147503 OFFSET:4
row group 2: RC:3520100 TS:158294646 OFFSET:59176084
row group 3: RC:880100 TS:80122359 OFFSET:119985867
row group 4: RC:583579 TS:197303521 OFFSET:149394540
row group 5: RC:585594 TS:194850776 OFFSET:213638039
row group 6: RC:2620100 TS:130170698 OFFSET:277223867
row group 7: RC:2750100 TS:136761819 OFFSET:332088066
row group 8: RC:1790100 TS:86766854 OFFSET:389772650
row group 9: RC:2620100 TS:125876377 OFFSET:428147454
row group 10: RC:1700100 TS:83791047 OFFSET:483600973
As you can se, the TS (total size) is way larger than 64MB (67108864 bytes)
My current theory:
I am doing this in spark-shell:
sc.hadoopConfiguration.setInt("parquet.block.size", 67108864)
val a = spark.read.parquet("my_sample_data")
a.rdd.getNumPartitions // 1034
val s = a.coalesce(27)
s.write.format("parquet").mode("Overwrite").options(Map("dfs.block.size" -> "67108864")).save("/my_new_sample_data")
So perhaps it's because my input data already has 1034 partitions. I'm really not sure. My data has about 118 columns per row.
The
parquet.block.size
property only affects Parquet writers. Thehdfs dfs -cp
command copies files regardless of their contents on the other hand. Theparquet.block.size
property is therefore ignored byhdfs dfs -cp
.Imagine that you have an application that takes screenshots in either JPG or PNG format, depending on a config file. You make a copy of those screenshots with the
cp
command. Naturally, even if you change the desired image format in the config file, thecp
command will always create output files in the image format of the original files, regardless of the config file. The config file is only used by the screenshot application and not bycp
. This is how theparquet.block.size
property works as well.What you can do to change the block size is to rewrite the file. You mentioned that you have
spark-shell
. Use that to rewrite the Parquet file by issuingUpdate: Since you mentioned in the comments below that it does not work for you, I made an experiment and posting the session transcript below:
By looking at the TS values you can see that the input file had a row group size of 4.5-5M and the output files have row groups sizes of 200K and 300K, respectively. This shows that the value set using
sc.hadoopConfiguration
becomes the "default", while the other method you mentioned in a comment below involvingdf.options
overrides this default.Update 2: Now that you have posted your output, I can see what is going on. In your case, compression is taking place, increasing the amount of data that will fit in row groups. The row group size applies to the compressed data, but TS shows the size of uncompressed data. However, you can deduce the size of row groups by subtracting their starting offsets. For example, the compressed size of your first row group is 59176084 - 4 = 59176080 bytes or less (since padding can take place as well). I copied your results into /tmp/rowgroups.dat on my computer and calculated your row group sizes by issuing the following command:
(The
numinterval
command is in thenum-utils
package on Ubuntu.) As you can see, all of your row groups are smaller than the row group size you specified. (The reason why they are not exactly the specified size is PARQUET-1337.)