I am having some bizarre issues with two fairly large tables, where observations are lost as part of creating one table from the other, but the number of observations that differ between the tables does not match the number of observations that are lost.
Table A: count(customer||date) yeilds 1004 Table B (which is created inner joining Table A with two other tables): count(customer||date) yields 1002.
However:
with T1 as (
select customer||date as customer_date_a from table A),
T2 as (
select customer||date as customer_date_b from table b)
select * from T1 full outer join T2
on t1.customer_date_a = t2.customer_date_b
where t1.customer_date_a is null or t2.customer_date_b is null
i.e selecting the observations in A not in B, and in B not in A, (which should give me the two lost observations) returns four observations.
What could be the cause for this behavior? What potential quirks could cause this behavior? This is found on Cloudera Hue.