Contingency table / crosstabs in Hive

610 views Asked by At

I am trying to get the result in form of a contingency table over a couple of categorical variables in Hive.

Var1 = {val_11, val_12, val_13}  
Var2 = {val_21, val_22, val_23}

The result should look like this

       val11    val12   val13      
------------------------------  
val21    x        x       x   
val22    x        x       x   
val23    x        x       x

where x - count of respective values combinations. My data have much more then 3 categorical values for each variable, about 100.

So, I tried several steps:

CREATE TABLE tmp_1 as SELECT Var1, COUNT (DISTINCT Var2) AS Var2_count FROM my_table GROUP BY Var1;  
CREATE TABLE tmp_2 as SELECT Var2, COUNT (DISTINCT Var2) AS Var1_count FROM my_table GROUP BY Var2;  

That will give me for tmp_1

val11 num_val21  
val11 num_val22  
val11 num_val23  
val12 num_val21  
val12 num_val22  
val12 num_val23  
val13 num_val21  
val13 num_val22  
val13 num_val23  

and similar for tmp_2.

Now,
SELECT Var1, collect_set(Var2_count) FROM tmp_1;
SELECT Var2, collect_set(Var1_count) FROM tmp_2;

Is there better way, correct way of doing it for many categorical values and more then 2 variables?

0

There are 0 answers