I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).
Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date
Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date
In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:
Create VIEW customer_info AS
select * from Table_B
UNION
select business_date, name, age, null as address from Table_A
I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?
Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?
Edit#2: Storing this data in one table is not possible due to tons of other problems.
You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.
UNION
is the same asUNION ALL
+DISTINCT
. UseUNION ALL
instead if applicable, it will perform much better.On the other hand, partitioning by something unique will create partitions with single row, this will kill your hive metastore probably. Hope you mean something else saying that
Remove partitioning in this case and the performance will be significantly better.