Deleting hudi records with composite primary key ignores some parts of the key

102 views Asked by At

I have a hudi table with a composite primary key made from two columns: col1 and col2 with the following configuration:

commonConfig = {
    'className' : 'org.apache.hudi',
    'hoodie.datasource.hive_sync.use_jdbc':'false',
    'hoodie.datasource.write.precombine.field': 'tx_commit_time',
    'hoodie.table.name': 'tableName',
    'hoodie.datasource.hive_sync.database': 'dbName',
    'hoodie.datasource.hive_sync.table': 'tableName',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
    'hoodie.datasource.write.recordkey.field': 'col1,col2',
    'hoodie.datasource.write.partitionpath.field': 'partitionField',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.partition_fields': 'partitionField'
}

incrementalConfig = {
    'hoodie.upsert.shuffle.parallelism': 20,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
    'hoodie.cleaner.commits.retained': 0
}

deleteDataConfig = {
    'hoodie.datasource.write.payload.class': 'org.apache.hudi.common.model.EmptyHoodieRecordPayload'
}

For each value of col1 there are two records, with different values for col2. Something like that:

| col1 | col2 |
| ---- | ---- |
| x    | 1    | 
| x    | 2    | 
| y    | 1    | 
| y    | 2    | 

I have a pyspark Glue script that receives col1 and col2 as parameters and deletes the relevant record:

sqlStatement = f"select * from {dbName}.{tableName} where col1='{col1}' and col2={col2}"
rowToBeDeletedDF = spark.sql(sqlStatement)
print(rowToBeDeletedDF.count())
combinedConf = {**commonConfig, **incrementalConfig, **deleteDataConfig}
rowToBeDeletedDF.write.format("hudi").options(**combinedConf).mode("Append").save(targetPath)

Problem is that although only one row is returned in the dataframe (as expected), when I apply the deletion it deletes both rows with the same col1 value, although only one of them is returned in the dataframe.

Seems like hudi ignores the second part of the composite primary key when applying deletion.

What am I missing?

0

There are 0 answers