I have written an udf (extends EvalFunc<Tuple>
) which has as output tuples with inner tuples (nested).
For example the dump looks like:
(((photo,photos,photo)))
(((wedg,wedge),(audusd,audusd)))
(((quantum,quantum),(mind,mind)))
(((cassi,cassie),(cancion,canciones)))
(((calda,caldas),(nova,novas),(rodada,rodada)))
(((fingerprint,fingerprint),(craft,craft),(easter,easter)))
Now I want to process each of this terms, distinct it and give it an id (RANK
). To do this, i need to get rid of the brackets. A simple FLATTEN
does not help in this case.
The final output should be like:
1 photo
2 photos
3 wedg
4 wedge
5 audusd
6 quantum
7 mind
....
My code (not the udf part and not the raw parsing):
tags = FOREACH raw GENERATE FLATTEN(tags) AS tag;
tags_distinct = DISTINCT tags;
tags_sorted = RANK tags_distinct BY tag;
DUMP tags_sorted;
I think your UDF is return is not optimal for your workflow. Instead of returning a tuple with variable number of fields (which are tuples), it would be a lot more convenient to return a bag of tuples.
Instead of
you will have
and you will be able to FLATTEN that bag to: 1. make the DISTINCT 2. RANK the tags
To do so, update your UDF like this :