Get Metadata from Open Office Files

3.1k views Asked by At

I need to modify only the Open Office file metadata. How I can do it without loading the entire file into memory (file.odt)? I need to work only with the file: meta.xml and label: ... metadata ...

I'm using Apache ODF Toolkit 0.5-incubating. My code loads the meta.xml file but I can not get metadata:

OdfPackage pkg = OdfPackage.loadPackage(new File("file.odt"));
Node d = pkg.getDom("meta.xml").getElementsByTagName("office:document-meta").item(0);

for(int i =0; i<d.getAttributes().getLength();i++) {
  String nombre = d.getAttributes().item(i).getNodeName();
  String valor = d.getAttributes().item(i).getNodeValue();
  System.out.println("Clave: " + nombre + " valor: " + valor);
} 
2

There are 2 answers

2
Gagravarr On

If you want to work with a range of file formats, the Apache Tika is your best bet. Tika provides a common interface for extracting text and metadata from a large number of formats, and hides the complexity of the different types and formats from you.

On the command line, to extract the metadata from this sample file you'd do

java -jar tika-app-1.4.jar --metadata quick.odt

And you'd get back a huge amount of metadata:

Author: Jesper Steen Møller
Character Count: 43
Content-Length: 7042
Content-Type: application/vnd.oasis.opendocument.text
Creation-Date: 2005-09-06T23:34:00
Edit-Time: PT2M0S
Image-Count: 0
Keywords: Pangram, fox, dog
Last-Modified: 2005-09-06T23:49:00
Last-Save-Date: 2005-09-06T23:49:00
Object-Count: 0
Page-Count: 1
Paragraph-Count: 1
Table-Count: 0
Word-Count: 9
cp:subject: Gym class featuring a brown fox and lazy dog
creator: Jesper Steen Møller
date: 2005-09-06T23:49:00
dc:creator: Jesper Steen Møller
dc:description: Gym class featuring a brown fox and lazy dog
dc:language: en-US
dc:subject: Pangram, fox, dog
dc:title: The quick brown fox jumps over the lazy dog
dcterms:created: 2005-09-06T23:34:00
dcterms:modified: 2005-09-06T23:49:00
description: Gym class featuring a brown fox and lazy dog
editing-cycles: 5
generator: OpenOffice.org/1.9.125$Win32 OpenOffice.org_project/680m125$Build-8947
initial-creator: Nevin Nollop
language: en-US
meta:author: Jesper Steen Møller
meta:character-count: 43
meta:creation-date: 2005-09-06T23:34:00
meta:image-count: 0
meta:initial-author: Nevin Nollop
meta:object-count: 0
meta:page-count: 1
meta:paragraph-count: 1
meta:save-date: 2005-09-06T23:49:00
meta:table-count: 0
meta:word-count: 9
modified: 2005-09-06T23:49:00
nbCharacter: 43
nbImg: 0
nbObject: 0
nbPage: 1
nbPara: 1
nbTab: 0
nbWord: 9
resourceName: quick.odt
subject: Gym class featuring a brown fox and lazy dog
title: The quick brown fox jumps over the lazy dog
xmpTPg:NPages: 1

From Java, you could get the same with something as simple as

TikaConfig tika = TikaConfig.getDefaultConfig();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();

InputStream input = TikaInputStream.get(new File("test.ods"));

tika.getParser().parse(input, null, metadata, context);

And you'd get the metadata on the Metadata object

0
Keshavram Kuduwa On

You can use OdfDocument Package provided by org.odftoolkit. You can get the dependency here => https://mvnrepository.com/artifact/org.odftoolkit/odfdom-java

You can parse your document

OdfDocument odfDocument = OdfDocument.loadDocument(new URL(URLPath).openStream());

And get metadata like for example

wordCount = odfDocument.getOfficeMetadata().getDocumentStatistic().getWordCount();