Solr 5.1.0 - Apache TikaEntityProcessor Cannot Find My Files

610 views Asked by At

Solr, more specifically Tika, is having some problems finding my file whose filepath is retrieved from a database. Whenever I go to index it logs errors saying that this can't find the file.

I'm basically doing what is shown in this forum question, which is taking a file path from a database and using TikaEntityProcessor to analyze the document.

His problem was version issues with Tika but I'm using a version that is about five years older so I'm not sure if it's still issues with the current version of Tika or if I'm missing something extremely obvious (which is possible I'm extremely new to Solr) This is my data configuration. TextContentURL is the filepath!

<dataConfig> 
  <dataSource name="ds-db" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/EDMS_Metadata" user="root" password="**************" /> 
  <dataSource name="ds-file" type="BinFileDataSource"/> 

 <document name="doc1"> 
        <entity name="db-data" dataSource="ds-db"  query="select TextContentURL as 'id',ID,Title,AuthorCreator from MasterIndex" > 
        <field column="TextContentURL" name="id" /> 
        <field column="Title" name="title" /> 
        </entity> 
        <entity name="file" dataSource="ds-file" processor="TikaEntityProcessor" url="${db-data.TextContentURL}" format="text">
         <field column="text" name="text" />   
    </entity> 
  </document> 
</dataConfig> 

I'd like to note that when I delete the second entity and just run the database draw it works fine. I can run and query and I get this output when I run a faceted search

 "response": {
    "numFound": 283,
    "start": 0,
    "docs": [
      {
        "id": "/home/paden/Documents/LWP_Files/BIGDATA/6220106.pdf",
        "title": "ENGINEERING INITIATION",
      },

This means that it is pulling the document filepath JUST FINE. The id is the correct filepath. But when I re-add the second entity it logs errors saying it can't find the file? Am I missing something obvious?

Solr is logging these errors:

WARN FileDataSource FileDataSource.basePath is empty. Resolving to: /home/paden/Downloads/solr-5.1.0/server/.

ERROR DocBuilder

Exception while processing: file document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: (resolved to: /home/paden/Downloads/solr-5.1.0/server/.

ERROR DataImporter

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: (resolved to: /home/paden/Downloads/solr-5.1.0/server/.

1

There are 1 answers

1
Abhijit Bashetti On BEST ANSWER

Try with the below data-config.

<dataConfig> 
  <dataSource name="ds-db" type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/EDMS_Metadata" user="root" password="**************" /> 
  <dataSource name="ds-file" type="BinFileDataSource"/> 
  <document name="doc1"> 
        <entity name="db-data" dataSource="ds-db"  query="select TextContentURL,ID,Title,AuthorCreator from MasterIndex" > 
        <field column="TextContentURL" name="TextContentURL" /> 
        <field column="Title" name="title" /> 
        <entity name="file" dataSource="ds-file" processor="TikaEntityProcessor" url="${db-data.TextContentURL}" format="text" onError="continue">
         <field column="text" name="text" />   
         </entity>
    </entity> 
  </document> 
</dataConfig>