In Nutch 2.2.1, when I run Nutch every time, it will crawl all URLs include I have already crawled. I want one URL only to be crawled one time no matter how many times Nutch runs. How can I configure it?
In Nutch 2.2.1, when I run Nutch every time, it will crawl all URLs include I have already crawled. I want one URL only to be crawled one time no matter how many times Nutch runs. How can I configure it?
After fetching a website, Nutch marks the URL of the website as a FETCHED URL, and it will not crawl the URL again in the next crawling round. By default, Nutch will re-crawl after 30 days. You can change the default number of seconds between re-fetches of a page by modifying the db.fetch.interval.default property.
Hope this helps,
Le Quoc Do