How do you configure Apache Nutch 2.3 to honour robots metatag?

Question

How do you configure Apache Nutch 2.3 to honour robots metatag?

622 views Asked by saintybalboa At 05 February 2016 at 15:36

I have Nutch 2.3 setup with HBase as the backend and I run a crawl of which includes the index to Solr and Solr Deduplication.

I have recently noticed that the Solr index contains unwanted webpages.

In order to get Nutch to ignore these webpages I set the following metatag:

<meta name="robots" content="noindex,follow">

I have visited the apache nutch official website and it explains the following:

If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag

Searching the web for answers, I found a recommendations to set Protocol.CHECK_ROBOTS or set protocol.plugin.check.robots as a property in nutch-site.xml. None of these appear to work.

At current Nutch 2.3 ignores the noindex rule, therefore indexing the content to the external datastore ie Solr.

The question is how do I configure Nutch 2.3 to honour robots metatags?

Also if Nutch 2.3 was previously configured to ignore robot metatag and during a previous crawl cycle indexed that webpage. Providing the rules for the robots metatag are correct, will this result in the page being removed from the Solr index in future crawls?

Original Q&A

There are 1 answers

**saintybalboa** · Accepted Answer · 2016-02-10T12:08:38+00:00

I've created a plugin to overcome the problem of Apache Nutch 2.3 NOT honouring the robots metatag rule noindex. The metarobots plugin forces Nutch to discard qualifying documents during index. This prevents the qualifying documents being indexed to your external datastore ie Solr.

Please note: This plugin prevents the index of documents that contain robots metatag rule noindex, it does NOT remove any documents that were previously indexed to your external datastore.

Visit this link for instructions

TechQA.

How do you configure Apache Nutch 2.3 to honour robots metatag?

There are 1 answers

Related Questions in SOLR

Related Questions in HBASE

Related Questions in NUTCH

Related Questions in ROBOTS.TXT

Related Questions in METATAG

Popular Questions

Popular Tags

Trending Questions