Nutch 2.x run every URL every time

93 views Asked by At

In Nutch 2.2.1, when I run Nutch every time, it will crawl all URLs include I have already crawled. I want one URL only to be crawled one time no matter how many times Nutch runs. How can I configure it?

1

There are 1 answers

1
Do Do On BEST ANSWER

After fetching a website, Nutch marks the URL of the website as a FETCHED URL, and it will not crawl the URL again in the next crawling round. By default, Nutch will re-crawl after 30 days. You can change the default number of seconds between re-fetches of a page by modifying the db.fetch.interval.default property.

Hope this helps,

Le Quoc Do