Does anyone have an idea of how to make a stratified sampling in pig? (wikipedia)
For the moment, I do something like :
relation2 = SAMPLE relation1 0.05;
but my dataset contains a label columns with a few occurrences, some of them are rare (0.5 % for example) and I would like my random down sampling not to forget all of them.
Thanks a lot.
You could implement your own method of sampling by using
RANDOM()
and then filtering out rows with values below, say, 0.95. So, if you want to stratify this sampling, you could compute what fraction of your rows contain a certain value, and then scale your random value accordingly so that different values get sampled at different rates.