Solr Cell fails to index image files with EXIF

133 views Asked by At

I just installed Solr6.6.0. on CentOS and have it working with the provided example 'sample_techproducts_configs'. I am able to index files, but as soon as I feed it an image file I get an exception about an invalid date. Solr cell extract a date from the EXIF and then seems to fail passing it on to Solr. I used the following image file:

http://www.imagemagick.org/Usage/photos/pagoda_sm.jpg

and the response from Solr is:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">400</int><int name="QTime">114</int></lst><lst name="error"><lst name="metadata"><str name="error-class">org.apache.solr.common.SolrException</str><str name="root-error-class">org.apache.solr.common.SolrException</str></lst><str name="msg">Invalid Date String:'2005-07-09T14:05:15'</str><int name="code">400</int></lst>
</response>

The date it complains about is formatted as yyyy-MM-dd'T'HH:mm:ss, which should be a default date format according to:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

I am looking for a fix or at least workaround, so it will skip the dates and just index other information from the EXIF.

1

There are 1 answers

1
TurbuLenz On

A very similar error occured to me in a procution environment that was running for years. I tracked it down to a change in SOLRs schema.xml. A new wildcard field was added for dynamic date fields:

<dynamicField name="date_*" type="tdate" indexed="true" stored="true" multiValued="true"/>

Tikas library for the exif extraction seems to try to create fields for the EXIF date-fields matching this dynamic field definition. Since EXIFs dateformat does not match SOLRs default date format (ISO 8601) used in TrieDateField class, a parsing error occurs.

Removing this wildcard field and switching to specific field definitions worked for me. The datefield-values are not indexed in this case, but the rest of the EXIF data is.

An alternative approach to import those date formats as well could be to implement a Filter checking the input date with a regular expression and transforming the result to a correct format.

Maybe your issue is somewhat related or it helps others debugging similar problems.