Hive View Query Performance: Union tables with different schemas

1.9k views Asked by At

I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).

Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date

Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date

In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:

Create VIEW customer_info AS 
select * from Table_B 
UNION 
select business_date, name, age, null as address from Table_A

I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?

Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?

Edit#2: Storing this data in one table is not possible due to tons of other problems.

1

There are 1 answers

2
leftjoin On

You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.

UNION is the same as UNION ALL+DISTINCT. Use UNION ALL instead if applicable, it will perform much better.

On the other hand, partitioning by something unique will create partitions with single row, this will kill your hive metastore probably. Hope you mean something else saying that

business_date is actually a unique value across all partitions

Remove partitioning in this case and the performance will be significantly better.