I am parsing docx file with Apache Tika. Parsing is working file expect that it also prints some unnecessary texts in the beginning like below:
[Content_Types] .xml _rels / .rels word / _rels / document.xml.rels word / document.xml
and at the end like below:
word / theme / theme1.xml word / settings.xml word / fontTable.xml word / webSettings.xml docProps / app.xml Normal 13 3 460 2627 Microsoft Office Word 0 21 6 false XXXX XXXX false 3081 false false 12.0000 docProps / core. xml XXX XXXX 1 2016- 12-16T14: 57: 00Z 2016-12-16T15: 10: 00Z word / styles.xml
Code is :
public static String extractString(File file)
{
BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(file))
{
parser.parse(stream, handler, metadata);
return handler.toString();
}
catch (IOException | SAXException | TikaException e)
{
e.printStackTrace();
return null;
}
}
How to remove the unnecessary crap from the beginning and end?