How to remove unnecessary parsing related info from Tika parsing output

321 views Asked by At

I am parsing docx file with Apache Tika. Parsing is working file expect that it also prints some unnecessary texts in the beginning like below:

[Content_Types] .xml _rels / .rels word / _rels / document.xml.rels word / document.xml

and at the end like below:

word / theme / theme1.xml word / settings.xml word / fontTable.xml word / webSettings.xml docProps / app.xml Normal 13 3 460 2627 Microsoft Office Word 0 21 6 false XXXX XXXX false 3081 false false 12.0000 docProps / core. xml XXX XXXX 1 2016- 12-16T14: 57: 00Z 2016-12-16T15: 10: 00Z word / styles.xml

Code is :

public static String extractString(File file)
    {
        BodyContentHandler handler = new BodyContentHandler();

        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        try (InputStream stream = new FileInputStream(file)) 
        {
            parser.parse(stream, handler, metadata);
            return handler.toString();
        }
        catch (IOException | SAXException | TikaException e)
        {
            e.printStackTrace();
            return null;
        }
    }

How to remove the unnecessary crap from the beginning and end?

0

There are 0 answers