How to parse and index a big file in multi parts so it can consume less memory while reading a file in input-stream?

418 views Asked by At

I am using Tika Apache Parser to parse a file and elastic search to index a file.

Let's suppose I have a doc file that need to be parsed. Here is the code example:

public String parseToPlainText() throws IOException, SAXException, TikaException {
    BodyContentHandler handler = new BodyContentHandler();

    InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    } finally {
        stream.close();
    }
}   

As you can see test.doc file has been read at the stretch and if the file size is too large then it may cause outofmemoryerror. I need that I could read a file in small chuck of input-streams and parser.parse(stream, handler, metadata); could accept those stream chunks. I have another issue is that file can be of any type. So how could I split files in chuck of streams and how could parser accept it? Importantly each file should be indexed as a single file even split into chunks while indexing, at the end.

0

There are 0 answers