I am using Heritrix 3.2.0.
I want to grab everything from one site, including pages normally protected by robots.txt.
However, I do not want to ignore robots.txt for other sites. (Don't want Facebook or Google to get angry with us, you know)
I have tried to set up a sheet overlay, closely resembling the one in the 3.0/3.1 manual (At the end of the post)
The job builds without comments, but the overlay doesn't seem to be triggered, the local robots.txt is still obeyed.
So, what am I doing wrong?
Stig Hemmer
<beans>
... all the normal default crawler-beans.cxml stuff ...
<bean id="sheetOverLayManager" autowire="byType"
class="org.archive.crawler.spring.SheetOverlaysManager">
</bean>
<bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>
<property name='surtPrefixes'>
<list>
<value>
http://(no,kommune,trondheim,)/
https://(no,kommune,trondheim,)/
</value>
</list>
</property>
<property name='targetSheetNames'>
<list>
<value>noRobots</value>
</list>
</property>
</bean>
<bean id='noRobots' class='org.archive.spring.Sheet'>
<property name='map'>
<map>
<entry key='metadata.robotsPolicyName' value='ignore'/>
</map>
</property>
</bean>
</beans>
Original Poster here. As always, Problem Exists Between Keyboard And Chair.
It turns out I didn't understand how SURTs work.
New and improved configuration:
The important change was leaving the end of each SURT open since I actually wanted to include subsites in the rules.
I also split the two SURTs into two
<value>
s. Not sure if that was necessary, but at least it is more readable.I still have problems, but at least I have new problems!