retro-actively add partitions to parquet files?

46 views Asked by At

i have a spark job that uses Apache Hudi to write parquet into our AwS S3 data lake. I have a pretty decent sized dataset (about ~20M rows and growing) that i would like to add a new partition to. Is this possible to do with my existing dataset? Or do i need to restart my spark job to re-create all the parquet files with the new partition configuration?

I am on spark 3.3.2 and hudi 0.13.1

1

There are 1 answers

2
parisni On

As for curent hudi version <= 0.14, yes you have to rewrite the whole table with the new partition scheme.

The main blocker is that parquet files contains the partition path in the hudi internal columns. So you could manually modify some files (such as hoodie.properties, recreate from scratch the metadata table and so on) but at the end of the day you need to also rewrite the parquet to overwrite that column.

Otherwise you will end up with no support for deletion and maybe other complications