Result-set inconsistency between hive and hive-llap

306 views Asked by At

we are using Hive 3.1.x clusters on HDI 4.0, with 1 being LLAP and another Just HIVE.

we've created a managed tables on both the clusters with the row count being 272409.

Before merge on both clusters

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-26 23:42:17.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

Based on the delta, we'd perform a merge operation (which updates 17 rows).

After merging on the hive-llap cluster (before compaction)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272392              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

After merging on the hive-llap cluster (after compaction)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

After merging on just hive cluster (without compacting deltas)

+---------------------+------------+---------------------+------------------------+------------------------+
| order_created_date  | col_count  | col_distinct_count  |        min_lmd         |        max_lmd         |
+---------------------+------------+---------------------+------------------------+------------------------+
| 20200615            | 272409     | 272409              | 2020-06-15 00:00:12.0  | 2020-07-27 22:52:34.0  |
+---------------------+------------+---------------------+------------------------+------------------------+

This is the inconsistency observed

However, after compacting the table on hive-llap, the result-set inconsistency is not seen, both the clusters are returning same result.

We thought it might be due to either caching or llap issue, so we restarted the hive-server2 process which will clear the cache. The issue is still persistent.

We also created a dummy table with same schema on just hive cluster and pointed the location of that table to that of llap one, which in turn is producing result as expected.

We even queried on spark using **Qubole spark-acid reader** (direct hive managed table reader), which is also producing expected result

This is very strange and peculiar, can someone help out here.

2

There are 2 answers

0
Anushan On

Qubole does not support Hive LLAP yet. (However, we (at Qubole) are evaluating whether to support this in the future)

2
Durga On

We also faced a similar issue in the HDInsight Hive llap cluster. On setting hive.llap.io.enabled as false resolved the issue