I am using StormCrawler for Live Crawling. I am inserting Domain in ElasticSearch and Crawler is crawling fine, I have defined a limit of crawling URls for each Domain ( Using Redis in SimpleFetcherBolt).
Scenario : When I insert a domain, StormCrawler starts Crawling. Now enter a new Domain in ElasticSeeds, StormCrawler does not fetch it immediately.
It is busy in Fetching Pages of previous Domain. If the limit is high (say 1000 URLS), It takes 20 minutes atleast to start crawling on newly inserted domain.
I want results instant, Is there any priority one can set on new domain ? or StormCrawler starts crawling on new domain Whenever new domain gets inserted? Different queue (DB) for each domain ?
Any Suggestions would be appreciated.
could you please explain what you mean by that? You should not have to modify the Fetcher bolt, that's what URL filters are for.
What type of spout are you using? AggregationSpouts? How many instances of SimpleFetcherBolt are you using?
SC should start crawling on a new domain pretty quickly. Please set the log level accordingly and check the logs to see whether the spouts have emitted tuples for the new domains and whether the URLs are blocked further down.
EDIT: either specify more than one instance of SimpleFetcherBolt or use FetcherBolt instead. With a single instance of SFB the URLs will be stuck in the queue whereas FetcherBolt will process them in parallel.
Maybe do that as a separate URL filter, this will be a lot cleaner than hacking the fetcher class, it should also be more efficient.
No, see ESCrawlTopology