I am trying incremental write to a Hudi table with Hive sync enabled, but it is failing with following error:

23/07/24 11:52:48 INFO org.apache.hudi.hive.HiveSyncTool: Schema difference found for table1
23/07/24 11:52:48 INFO org.apache.hudi.hive.ddl.HMSDDLExecutor: partition table,need cascade
Traceback (most recent call last):
 java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.IMetaStoreClient.alter_table_with_environmentContext(Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/hadoop/hive/metastore/api/EnvironmentContext;)V

I have examined logs and found there is some additional filter getting applied to IncrementalRelation. I don't know what this exactly means.

INFO org.apache.hudi.IncrementalRelation: Additional Filters to be applied to incremental source are :[Ljava.lang.String;@5ec488f6

I have also observed that incremental data is reflected in the underlying storage location. Because when I try to read that data using Pyspark console, I am getting the updated versions of the records.

Further observation is, If I again try to do the same above operation it is failing with a different error, which is as below:

java.io.InvalidClassException: com.fasterxml.jackson.core.io.SerializedString; local class incompatible: stream classdesc serialVersionUID = 4312806453773505982, local class serialVersionUID = 1

The whole thing is very confusing for me. Any help wold be appreciated here. Thanks in advance.

Spark Version: 2.4.8 Scala: 2.12 Hudi and other package versions: org.apache.hudi:hudi-spark-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:2.4.8,com.fasterxml.jackson.core:jackson-core:2.6.7,com.fasterxml.jackson.core:jackson-databind:2.6.7.3

Hudi Options used while writing:

"hoodie.table.name": "table1",
            "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
            "hoodie.datasource.write.recordkey.field": "col1,col2",
            "hoodie.datasource.write.precombine.field": "col3",
            "hoodie.datasource.write.partitionpath.field": "col4",
            "hoodie.datasource.write.hive_style_partitioning": "true",
            "hoodie.datasource.write.table.name": "table1",
            "hoodie.upsert.shuffle.parallelism": 1,
            "hoodie.insert.shuffle.parallelism": 1,
            "hoodie.consistency.check.enabled": True,
            "hoodie.index.type": "BLOOM",
            "hoodie.index.bloom.num_entries": 60000,
            "hoodie.index.bloom.fpp": 0.000000001,
            "hoodie.cleaner.commits.retained": 2,
            "hoodie.datasource.hive_sync.enable": "true",
            "hoodie.datasource.hive_sync.mode": "hms",
            "hoodie.datasource.hive_sync.metastore.uris": "metastore_uris",
            "hoodie.datasource.hive_sync.database": "db",
            "hoodie.datasource.hive_sync.table": "table1_cow",

I was facing same java.io.InvalidClassException: com.fasterxml.jackson.core.io.SerializedString error when I first tried the incremental write. I resolved this error by providing explicit Jackson libraries which I have mentioned above.

I have also made sure that column names and order of columns is exactly same for Hive table and incremental DF. Only difference is the Hudi metadata columns like commit_time, commit_seqno, record_key, partition_path.

If I disable the Hive Sync, then incremental write are going all fine.

1

There are 1 answers

0
parisni On

partition table,need cascade

This usually happens with the hive metastore, when adding data with new columns on partitioned table. In this case the columns shall be added last. In contrast the glue catalog does not have this limitation.

Did you verified the partition fields are also after the columns?

I cannot see hoodie.datasource.hive_sync.partition_fields in your config but this is needed.

In last resort, you can drop the table in the megastore, and the next increment would restore it from scratch.

I have also observed that incremental data is reflected in the underlying storage location

Depending on hudi version and config hive sync might handle failure silently and the data would be committed in any case. For eg this is the related config: hoodie.datasource.hive_sync.ignore_exceptions