Apache Pig: Combine multiple records within a bag

1.2k views Asked by At

Any help in this would be greatly appreciated! Best way is with an example:

Input:

Schema:

Name|phone_type|phone_num

Example data:

Kyle|Cell|555-222-3333

Kyle|Home|453-444-5555

Tom|Home|555-555-5555

Tom|Pager|555-555-4344

Desired output:

Schema:

Name|Home_num|Cell_num|Pager_num

Example:

Kyle|453-444-5555|555-222-3333|null

Tom|555-555-5555|null|555-555-4344

Code:

data=Load 'test.txt' using PigStorage('|');

grpd= Group data by $0;

Foreach grpd{
    ???
}
1

There are 1 answers

4
glefait On

After the comment of @Murali lao, I rewrite the solution.

I now use FILTER, and then the trick to not filter empty bag with FLATTEN is to add an empty string when the bag is empty.

Here are my test data:

tom,home,555
tom,pager,666
tom,cell,777
bob,home,111
bob,cell,222

Here is my solution:

data = LOAD 'phone'  USING PigStorage(',') AS (name:chararray, phone_type: chararray, phone_num: chararray);

user = FOREACH (GROUP data BY name) {
    home = FILTER $1 BY phone_type == 'home';
    -- you add an empty string if the the bag is empty
    homenum = (IsEmpty(home) ? {('')} : home.phone_num);

    pager = FILTER $1 BY phone_type == 'pager';
    pagernum = (IsEmpty(pager) ? {('')} : pager.phone_num);

    cell = FILTER $1 BY phone_type == 'cell';
    cellnum = (IsEmpty(cell) ? {('')} : cell.phone_num);

    GENERATE group as name, FLATTEN(homenum) as home, FLATTEN(pagernum) as pager, FLATTEN(cellnum) as cell;
};

After a dump, I obtain the following result :

(bob,111,,222)
(tom,555,666,777)