we are using Hive 3.1.x clusters on HDI 4.0, with 1 being LLAP and another Just HIVE.
we've created a managed tables on both the clusters with the row count being 272409
.
Before merge on both clusters
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-26 23:42:17.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
Based on the delta, we'd perform a merge operation (which updates 17 rows).
After merging on the hive-llap cluster (before compaction)
+---------------------+------------+---------------------+------------------------+------------------------+ | order_created_date | col_count | col_distinct_count | min_lmd | max_lmd | +---------------------+------------+---------------------+------------------------+------------------------+ | 20200615 | 272409 | 272392 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 | +---------------------+------------+---------------------+------------------------+------------------------+
After merging on the hive-llap cluster (after compaction)
+---------------------+------------+---------------------+------------------------+------------------------+ | order_created_date | col_count | col_distinct_count | min_lmd | max_lmd | +---------------------+------------+---------------------+------------------------+------------------------+ | 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 | +---------------------+------------+---------------------+------------------------+------------------------+
After merging on just hive cluster (without compacting deltas)
+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date | col_count | col_distinct_count | min_lmd | max_lmd |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615 | 272409 | 272409 | 2020-06-15 00:00:12.0 | 2020-07-27 22:52:34.0 |
+---------------------+------------+---------------------+------------------------+------------------------+
This is the inconsistency observed
However, after compacting the table on hive-llap, the result-set inconsistency is not seen, both the clusters are returning same result.
We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.
We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.
We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result
This is very strange and peculiar, can someone help out here.
Qubole does not support Hive LLAP yet. (However, we (at Qubole) are evaluating whether to support this in the future)