Apache Tika and document metadata

5.8k views Asked by At

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :

word count, author, title, timestamps, language etc.

which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.

I know that apache tika should remove the need for this, but the document formats are quite different right ?

For instance

InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();

for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}

I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?

Could please anybody who has experience with it share his experience ? Thank you

1

There are 1 answers

0
Gagravarr On BEST ANSWER

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.

You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();

Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());

if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
   // Do something special with the PDF metadata here
}