How to find number of unique connection using hive/pig

352 views Asked by At

I have a sample table like below:

caller   receiver 
100         200
100         300
400         100
100         200

I need to find the number of unique connection for each number. For ex: 100 will have connections like 200,300 and 400.

My output should be like:

100      3  
200      1  
300      1  
400      1

I am trying this by using hive. If this can not be done by hive then is it ok to do it by pig

2

There are 2 answers

0
Aman On BEST ANSWER

This will solve your problem.

 select q1.caller,count(distinct(q1.receiver)) fromĀ 
(select caller , receiver from test_1 group by caller,receiver 
union all 
select receiver as caller,caller as receiver from test_1 group by receiver,caller)q1 group by q1.caller;
0
o-90 On

Here is a way to do what you require (although I'm not entirely convinced it is optimal, but I'll leave it to you to optimize). You'll need this jar, its pretty straight forward how to build.

Query:

add jar ./brickhouse-0.7.1.jar; -- name and path of yours will be different
create temporary function combine_unique as 'brickhouse.udf.collect.CombineUniqueUDAF';

select connection
  , size(combine_unique(arr)) c
from (
  select connection, arr
  from (
    select caller as connection
      , collect_set(receiver) arr
    from some_table
    group by caller ) x
  union all
  select connection, arr
  from (
    select receiver as connection
      , collect_set(caller) arr
    from some_table
    group by receiver ) y ) f
group by connection

Output:

connection    c
100           3
200           1
300           1
400           1