I am using Tika Apache Parser
to parse
a file and elastic search
to index a file.
Let's suppose I have a doc file that need to be parsed. Here is the code example:
public String parseToPlainText() throws IOException, SAXException, TikaException {
BodyContentHandler handler = new BodyContentHandler();
InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
parser.parse(stream, handler, metadata);
return handler.toString();
} finally {
stream.close();
}
}
As you can see test.doc file has been read at the stretch and if the file size is too large then it may cause outofmemoryerror
. I need that I could read a file in small chuck of input-streams
and parser.parse(stream, handler, metadata);
could accept those stream chunks. I have another issue is that file can be of any type. So how could I split files in chuck of streams
and how could parser
accept it?
Importantly each file should be indexed as a single file even split into chunks while indexing, at the end.