I wanted to convert one days avro data (~2 TB) to parquet.
I ran a hive query and data successfully got converted to parquet.
But the data size became 6 TB.
What would have happened that data became thrice the size?
I wanted to convert one days avro data (~2 TB) to parquet.
I ran a hive query and data successfully got converted to parquet.
But the data size became 6 TB.
What would have happened that data became thrice the size?
Typically, Parquet can be more efficient than Avro, as it's a columnar format columns of the same type are adjacent on the disk. This allows compression algorithms to be more effective in some cases. Typically we use Snappy which is sufficient, easy on CPU and has several properties that make it suitable for Hadoop relative to other compression methods like zip or gzip. Mainly snappy is splittable; each block retains information necessary to determine schema. MParquet is a great format and we have been very happy with query performance after moving from Avro (and we also can use Impapla which is super-fast).