Apache Tika and document metadata

Question

Apache Tika and document metadata

5.8k views Asked by lisak At 26 February 2011 at 21:47

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :

word count, author, title, timestamps, language etc.

which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.

I know that apache tika should remove the need for this, but the document formats are quite different right ?

For instance

InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();

for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}

I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?

Could please anybody who has experience with it share his experience ? Thank you

Original Q&A

There are 1 answers

**Gagravarr** · Accepted Answer · 2011-03-31T21:56:57+00:00

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.

You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();

Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());

if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
   // Do something special with the PDF metadata here
}

TechQA.

Apache Tika and document metadata

There are 1 answers

Related Questions in JAVA

Related Questions in APACHE

Related Questions in METADATA

Related Questions in DOCUMENTS

Related Questions in APACHE-TIKA

Popular Questions

Trending Questions