I have Nutch 2.3 setup with HBase as the backend and I run a crawl of which includes the index to Solr and Solr Deduplication.
I have recently noticed that the Solr index contains unwanted webpages.
In order to get Nutch to ignore these webpages I set the following metatag:
<meta name="robots" content="noindex,follow">
I have visited the apache nutch official website and it explains the following:
If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag
Searching the web for answers, I found a recommendations to set Protocol.CHECK_ROBOTS
or set protocol.plugin.check.robots
as a property in nutch-site.xml. None of these appear to work.
At current Nutch 2.3 ignores the noindex
rule, therefore indexing the content to the external datastore ie Solr.
The question is how do I configure Nutch 2.3 to honour robots metatags?
Also if Nutch 2.3 was previously configured to ignore robot metatag and during a previous crawl cycle indexed that webpage. Providing the rules for the robots metatag are correct, will this result in the page being removed from the Solr index in future crawls?
I've created a plugin to overcome the problem of Apache Nutch 2.3 NOT honouring the robots metatag rule
noindex
. The metarobots plugin forces Nutch to discard qualifying documents during index. This prevents the qualifying documents being indexed to your external datastore ie Solr.Please note: This plugin prevents the index of documents that contain robots metatag rule
noindex
, it does NOT remove any documents that were previously indexed to your external datastore.Visit this link for instructions