Issue parsing PDF with Apache Nutch - extractor plugin

1.2k views Asked by At

I am trying to index web pages AND pdf documents from a website. I am using Nutch 1.9.

I downloade the nutch-custom-search plugin from https://github.com/BayanGroup/nutch-custom-search. The plugin is awsome and indeed let me match selected divs to solr fieds.

The problem I am having is that, my site also contains numerous pdf files. I can see that they are fetched but never parsed. There is no pdf when I query solr. Just web pages. I am trying to use tika to parse .PDFs (I hope that I have the right idea)

If on cygwin, I run parsechecker see below, it seems to parse OK:

    $ bin/nutch parsechecker -dumptext -forceAs application/pdf http://www.immunisationscotland.org.uk/uploads/documents/18304-Tuberculosis.pdf 

I am not too sure what to do next (see below for my config)

extractor.xml

    <config xmlns="http://bayan.ir" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd" omitNonMatching="true">
<fields>
    <field name="pageTitleChris" />
    <field name="contentChris" />     
</fields>
<documents>
    <document url="^.*\.(?!pdf$)[^.]+$" engine="css">
        <extract-to field="pageTitleChris">
            <text>
                <expr value="head > title" />
            </text> 

        </extract-to>
        <extract-to field="contentChris">
            <text>
                <expr value="#primary-content" />
            </text> 

        </extract-to>

    </document>
</documents>

Inside my parse-plugins.xml i added

    <mimeType name="application/pdf">
       <plugin id="parse-tika" />
    </mimeType>

nutch-site.xml

    <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|text)|extractor|index-(basic|anchor)|query-(basic|site|url)|indexer-solr|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

<property>
  <name>http.content.limit</name>
  <value>65536666</value>
  <description></description>
</property>


<property>
  <name>extractor.file</name>
  <value>extractor.xml</value>
</property>

Help would be much appreciated,

Thanks

Chris

1

There are 1 answers

4
tahagh On

I think the problem relates to omitNonMatching="true" in your extractor.xml file.

omitNonMatching="true" means "don't index those pages that don't match in any extracto-to rules of extractor.xml". The default value is false.