Search a word in all Common Crawl WARC files

1.1k views Asked by At

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content.

And I want to keep those WARC files in S3 itself. Just I need the urls from those WARC files as result.

Is there any modules or pre-built packages available for this?

May I use Solr indexing? (but it may need more memory)

Thanks in Advance.

0

There are 0 answers