I have earlier succeeded in parsing all kinds of files with Tika by calling tika.parseToString()
without setting any custom configuration or metadata. Now I have the need to filter files to parse based on mime-type.
I can find the mime-type with tika.detect(new BufferedInputStream(inputStream), new Metadata());
, but when calling tika.parseToString()
afterwards tika uses EmptyParser and the content-type detected is "application/octet-stream". This is default, meaning that tika is unable to find what type of file it is. I have tried to set the content type in Metadata before trying to parse the file, but this leads to org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
. From what I've read this means that the file is malformed, but the same files gets parsed successfully without the check for mime-type beforehand.
Does detect() do something with the InputStream, making the parser unable to parse the files?
I'm using the same tika-instance for both checking the mime-type and parsing, version 1.13
My issue was caused by passing InputStream to the parse method directly. detect() marks and resets the stream passed, which InputStream does not support. Wrapping the InputStream into a TikaInputStream(
TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(inputStream));
) solved the issue.