I am very new to Qubole.We recently migrated Oracle ebiz data to Saleforce.We have both Ebiz and Salesforce data in the Qubole Data Lake.There are some discrepancies between Ebiz and Salesforce.What is the technology I can use on Qubole to find these discrepancies?
This is approach I am using to compare two tables. Aggregate all metrics in two tables group by all dimensions, then compare using FULL JOIN, it will return all joined and not joined records from both tables. In such way you can found absent data in both tables and differences in metrics.
For example like this, using Hive:
Also you can easily compare in Excel instead of filtering in the WHERE.
Metrics are everything that can be aggregated. You can use some dimensions as metrics also like this
count(distinct user) as user_cntandgroup by date, site_namefor example. Query with full join will show differences. If some dimensions used in join condition can be null, use nvl() to match such rows like in my example. Of course do not use too many dimensions in the groupby, you can skip some of them and drill down only after finding discrepancies on aggregated level.After you got discrepancy in aggregations, you can drill down and compare rows not aggregated, filtered by some metrics.
See also: https://stackoverflow.com/a/67382947/2700344