Un-nesting nested tuples to single terms

404 views Asked by At

I have written an udf (extends EvalFunc<Tuple>) which has as output tuples with inner tuples (nested).

For example the dump looks like:

(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))

Now I want to process each of this terms, distinct it and give it an id (RANK). To do this, i need to get rid of the brackets. A simple FLATTENdoes not help in this case.

The final output should be like:

1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....

My code (not the udf part and not the raw parsing):

tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;
1

There are 1 answers

4
glefait On

I think your UDF is return is not optimal for your workflow. Instead of returning a tuple with variable number of fields (which are tuples), it would be a lot more convenient to return a bag of tuples.

Instead of

(((wedg,wedge),(audusd,audusd)))

you will have

({(wedg,wedge),(audusd,audusd)})

and you will be able to FLATTEN that bag to: 1. make the DISTINCT 2. RANK the tags

To do so, update your UDF like this :

class MyUDF extends EvalFunc <DataBag> {

    @Override
    public DataBag exec(Tuple input) throws IOException {
        // create DataBag
    }
}