Pig Latin Filter by list of strings

2k views Asked by At

I have a file containing urls and 3 files that contain urls that I want to see if they exist in the first file

Example of first file is

http://www.url1.com/xxxx/xxx/xxx/xxx/,

http://www/urln.com/zzz/zzz/zzz/zzzz/zzzz/zz

example of filter files:

filter1

url1.com

url2.com

filter2

url5.com

url6.com/ddfd

urlx.org

What I want to do is that on the same pass (if possible) check if any url from filter1 or any of the urls in filter 2 and so on is in the first file and if it is write the match to a file named after that filter( filter name irrelevant). Important each filter

output will be something like this

filter1.out

http:// www.url1.com/xxx/sss http:// www.url2.com/xxx/xxxx/xxxx

2

There are 2 answers

0
ksh On

Assuming that filter files fit into memory on compute nodes - use Perl or other favorite language for matching and stream data through this filter, e.g.:

DEFINE MY_FILTER ` perl $script $filter1 $filter2 filter3 ` SHIP('$script','$filter1', '$filter2', '$filter3');
A = load '$input';
B = stream A through MY_FILTER;
store B into '$output';

This runs in one pass. Call this Pig script from a bash script that defines $filter and other parameters. Implement string matching and output in the $script which will load $filter1, $filter2 and $filter3, do matching from STDIN and produce output in the desired format.

0
Eli On

I'll give a high level description of what I'd do in your shoes:

  1. Load up all files as data sets. We can call them urls, filter1, filter2, and filter3
  2. If I understand correctly, there's no difference between the three filters, so just UNION them together as a new data set we'll call big_filter.
  3. JOIN urls with big_filter using a regex to extract the base url from urls. REGEX_EXTRACT is a built-in Pig function. The inner join will get rid of all items in url that are not in a filter.
  4. GENERATE just the url column from the resulting data set.
  5. Run a DISTINCT on the data set that was generated in step 4.
  6. Store the data set generated in step 5 using one of the various pig STORE functions in whatever form you like best.