Nutch: How to re-try transient errors (and none of the other URLs)?

404 views Asked by At

Nutch sometimes gets a SocketTimeout or ConnectionRefused exception for some URLs. How do I ask Nutch to only retry these URLs? If I re-run the "crawl" command, it tells me that there is nothing to re-run. This is understandable since "db.fetch.interval.default" is set to 30 days. I do not want to change this since this affects even pages that were successful. What I need is a way to only re-crawl failed crawls.

Is there a way to do this?

Added later: I am using Nutch 1.10

1

There are 1 answers

0
jgloves On BEST ANSWER

If there was a temporary problem fetching, Nutch should retry the fetch for you three times by default. After that the page is marked as "gone" and Nutch will not try to fetch it again for the maxFetchInterval. http://wiki.apache.org/nutch/CrawlDatumStates

You can increase the number of retries by changing the db.fetch.retry.max property in nutch-default.xml.