Rationale behind partition specific schema in Hive/Glue tables

167 views Asked by At

I'm trying to understand the rationale behind the partition specific schema managed for Hive/Glue tables. Albeit, I couldn't find any documentation, specifically talking about this but during my search, I found a couple of Hive JIRAs (as attached in references) which hint at its purpose. From what I gathered, partition schema is a snapshot of table schema when it is registered, and it allows Hive to support schema evolution without invalidating existing table partitions and the underlying data. Also, it enables Hive to support different partitions and table level file formats, giving clients more flexibility.

The exact purpose is still not clear to me, so requesting the experts to comment on the following set of questions:

  • What is the rationale behind maintaining partition specific schema?
  • How does Hive/Glue behave in case there is a discrepancy in the partition and table schema? Does the resolution criteria consider or is dependent on the underlying data file format?
  • What are the repercussions of not maintaining partition specific schema in table metadata?

Experimentation and observations:

I ran an experiment on my end, in which I tested a few count, count with partition filters and schema description queries against Glue table without explicit schema definition in partition properties (underlying data files are written in parquet) using Spark-Shell, Hive CLI and Athena. The results retrieved were consistent with the ones computed from the original table.

References:

Thanks!

0

There are 0 answers