I have a hudi table with a composite primary key made from two columns: col1
and col2
with the following configuration:
commonConfig = {
'className' : 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.write.precombine.field': 'tx_commit_time',
'hoodie.table.name': 'tableName',
'hoodie.datasource.hive_sync.database': 'dbName',
'hoodie.datasource.hive_sync.table': 'tableName',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.recordkey.field': 'col1,col2',
'hoodie.datasource.write.partitionpath.field': 'partitionField',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.partition_fields': 'partitionField'
}
incrementalConfig = {
'hoodie.upsert.shuffle.parallelism': 20,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
'hoodie.cleaner.commits.retained': 0
}
deleteDataConfig = {
'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload'
}
For each value of col1
there are two records, with different values for col2
. Something like that:
| col1 | col2 |
| ---- | ---- |
| x | 1 |
| x | 2 |
| y | 1 |
| y | 2 |
I have a pyspark Glue script that receives col1
and col2
as parameters and deletes the relevant record:
sqlStatement = f"select * from {dbName}.{tableName} where col1='{col1}' and col2={col2}"
rowToBeDeletedDF = spark.sql(sqlStatement)
print(rowToBeDeletedDF.count())
combinedConf = {**commonConfig, **incrementalConfig, **deleteDataConfig}
rowToBeDeletedDF.write.format("hudi").options(**combinedConf).mode("Append").save(targetPath)
Problem is that although only one row is returned in the dataframe (as expected), when I apply the deletion it deletes both rows with the same col1
value, although only one of them is returned in the dataframe.
Seems like hudi ignores the second part of the composite primary key when applying deletion.
What am I missing?