I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content.
And I want to keep those WARC files in S3 itself. Just I need the urls from those WARC files as result.
Is there any modules or pre-built packages available for this?
May I use Solr indexing? (but it may need more memory)
Thanks in Advance.