Search a word in all Common Crawl WARC files

1.1k views Asked by Vanaja Jayaraman At 23 June 2015 at 11:45

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content.

And I want to keep those WARC files in S3 itself. Just I need the urls from those WARC files as result.

Is there any modules or pre-built packages available for this?

May I use Solr indexing? (but it may need more memory)

Thanks in Advance.

Original Q&A

TechQA.

Search a word in all Common Crawl WARC files

There are 0 answers

Related Questions in AMAZON-S3

Related Questions in SOLR

Related Questions in COMMON-CRAWL

Related Questions in WARC

Related Questions in LARGE-DATA

Popular Questions

Popular Tags

Trending Questions