I am trying to get the result in form of a contingency table over a couple of categorical variables in Hive.
Var1 = {val_11, val_12, val_13}
Var2 = {val_21, val_22, val_23}
The result should look like this
val11 val12 val13
------------------------------
val21 x x x
val22 x x x
val23 x x x
where x - count of respective values combinations. My data have much more then 3 categorical values for each variable, about 100.
So, I tried several steps:
CREATE TABLE tmp_1 as SELECT Var1, COUNT (DISTINCT Var2) AS Var2_count FROM my_table GROUP BY Var1;
CREATE TABLE tmp_2 as SELECT Var2, COUNT (DISTINCT Var2) AS Var1_count FROM my_table GROUP BY Var2;
That will give me for tmp_1
val11 num_val21
val11 num_val22
val11 num_val23
val12 num_val21
val12 num_val22
val12 num_val23
val13 num_val21
val13 num_val22
val13 num_val23
and similar for tmp_2.
Now,
SELECT Var1, collect_set(Var2_count) FROM tmp_1;
SELECT Var2, collect_set(Var1_count) FROM tmp_2;
Is there better way, correct way of doing it for many categorical values and more then 2 variables?