On: Heritrix Usecases there is an Use Case for "Only Store Successful HTML Pages"
My Problem: i dont know how to implement it in my cxml File. Especially: Adding the ContentTypeRegExpFilter to the ARCWriterProcessor => set its regexp setting to text/html.*. ... There is no ContentTypeRegExpFilter in the sample cxml Files.
The use cases you cite are somewhat out of date and refer to Heritrix 1.x (filters have been replaced with decide rules, very different configuration framework). Still the basic concept is the same.
The cxml file is basically a Spring configuration file. You need to configure the property
shouldProcessRule
on the ARCWriter bean to be theContentTypeMatchesRegexDecideRule
A possible ARCWriter configuration:
This will cause the Processor to only process those items that match the DecideRule, which in turn only passes those whose content type (mime type) matches the provided regular expression.
Be careful about the 'decision' setting. Are you ruling things in our out? (My example rules things in, anything not matching is ruled out).
As
shouldProcessRule
is inherited from Processor, this can be applied to any processor.More information about configuring Heritrix 3 can be found on the Heritrix 3 Wiki (the user guide on crawler.archive.org is about Heritrix 1)