I have a file containing urls and 3 files that contain urls that I want to see if they exist in the first file
Example of first file is
http://www.url1.com/xxxx/xxx/xxx/xxx/,
http://www/urln.com/zzz/zzz/zzz/zzzz/zzzz/zz
example of filter files:
filter1
url1.com
url2.com
filter2
url5.com
url6.com/ddfd
urlx.org
What I want to do is that on the same pass (if possible) check if any url from filter1 or any of the urls in filter 2 and so on is in the first file and if it is write the match to a file named after that filter( filter name irrelevant). Important each filter
output will be something like this
filter1.out
http:// www.url1.com/xxx/sss http:// www.url2.com/xxx/xxxx/xxxx
Assuming that filter files fit into memory on compute nodes - use Perl or other favorite language for matching and stream data through this filter, e.g.:
This runs in one pass. Call this Pig script from a bash script that defines $filter and other parameters. Implement string matching and output in the $script which will load $filter1, $filter2 and $filter3, do matching from STDIN and produce output in the desired format.