Best way to parse large XML document in Jython

3.3k views Asked by At

I need to parse a large (>800MB) XML file from Jython. The XML is not deeply nested, containing about a million relevant elements. I need to convert these elements into real objects.

I've used nu.xom.* successfully before, but now that I've switched from Java to Jython, the library fails with the following message:

The parser has encountered more than "64,000" entity expansions in this document; this is the limit imposed by the application.

I have not found a way to fix this, so I probably have to look for another XML library. It could be either Java or Jython-compatible Python and should be efficient. Pythonic would be great, nu.xom.* is simple but not very pythonic. Do you have any suggestions?

4

There are 4 answers

1
John Machin On

Does jython support xml.etree.ElementTree? If so, use the iterparse method to keep your memory size down. Read this and use elem.clear() as described.

0
Steven D. Majewski On

Sax is the best way to parse large documents.

Sounds like you're hitting the default expansion limit. See this note:

https://bugs.java.com/bugdatabase/view_bug?bug_id=4843787

You need to set System property "entityExpansionLimit" to change the default.

(added) see also the answer to this question.

0
Valentin Kantor On

there is a lxml python library, that can parse large files, without loading data to memory. but i don't know if i jython compatible

3
DKIT On

Try using the SAX parser, it is great for streaming large XML files.