This is my data-config.xml. I can't use Tika EntityProcessor. Is there any way I can do it with LineEntityProcessor?
I am using solr4.4 to index million of documents . i want the file names and modified time to be indexed as well . But couldnot find the way to do it. In the data-config.xml I am fetching files using filelistentityprocessor and then parsing each and every line using lineentityprocessor.
<dataConfig>
<dataSource encoding="UTF-8" type="FileDataSource" name="fds" />
<document>
<entity
name="files"
dataSource="null"
rootEntity="false"
processor="FileListEntityProcessor"
baseDir="C:/Softwares/PlafFiles/"
fileName=".*\.PLF"
recursive="true"
>
<field column="fileLastModified" name="last_modified" />
<entity name="na_04"
processor="LineEntityProcessor"
dataSource="fds"
url="${files.fileAbsolutePath}"
transformer="script:parseRow23">
<field column="url" name="Plaf_filename"/>
<field column="source" />
<field column="pict_id" name="pict_id" />
<field column="pict_type" name="pict_type" />
<field column="hierarchy_id" name="hierarchy_id" />
<field column="book_id" name="book_id" />
<field column="ciscode" name="ciscode" />
<field column="plaf_line" />
</entity>
</entity>
</document>
</dataConfig>
From the documentation of FileListEntityProcessor:
You can move these values into differently named fields by referencing them:
This will require that you have a schema.xml that actually allows those two names.
If you need to use them in another string / manipulate it further before inserting:
You're already using
files.fileAbsolutePath
, so by using${files.file}
and${files.fileLastModified}
you should be able to extract the values you want.You can modify these values and insert them into a specific field by using the TemplateTransformer and referencing the generated fields: