I'm new to xml parsing, and I've been trying to figure out a way to skip over a parent element's contents because there is a nested element that contains a large amount of data in its text attribute (I cannot change how this file is generated). Here's an example of what the xml looks like:
<root>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
<Parent>
<thing_1>
<a>I need this</a>
</thing_1>
<thing_2>
<a>I need this</a>
</thing_2>
<thing_3>
<subgroup>
<huge_thing>enormous string here</huge_thing>
</subgroup>
</thing_3>
</Parent>
</root>
I've tried lxml.iterparse and xml.sax implementations to try and work this out, but no dice. These are the majority of the answers I've found in my searches:
Use the tag keyword in iterparse.
This does not work, because, although lxml cleans up the elements in the background, the large text in the element is still parsed into memory, so I'm getting large memory spikes.
Create a flag where you set it to True if the start event for that element is found and then ignore the element in parsing.
This does not work, as the element is still parsed into memory at the end event.
Break before you reach the end event of the specific element.
I cannot just break when I reach the element, because there are multiples of these elements that I need specific children data from.
This is not possible as stream parsers still have an end event and generate the full element.
... ok.
I'm currently trying to directly edit the stream data that the GzipFile sends to iterparse in hopes that it would be able to not even know that the element exists, but I'm running into issues with that. Any direction would be greatly appreciated.
I don't think you can get a parser to selectively ignore some part of the XML it's parsing. Here are my findings using the SAX parser...
I took your sample XML, blew it up to just under 400MB, created a SAX parser, and ran it against my big.xml file two different ways.
sax.parse('big.xml', MyHandler()), memory peaked at 12M.parser.feed(chunk), memory peaked at 10M.I then doubled the size, for an 800M file, re-ran both ways and the peak memory usage didn't change, ~10M. The SAX parser seems very effecient.
I ran this script against your sample XML to create some really big text nodes, 400M each.
Here's big.xml's size in MB:
Heres's my SAX ContentHandler which only handles the character data if the path to the data's parent ends in
thing_*/a(which according to your sample disqualifieshuge_thing)...BTW, much appreciation to l4mpi for this answer, showing how to buffer the character data you do want:
For both the whole-file parse method, and the chunked reader, I get:
It's printing
thing_3because of my simple logic, but the data insubgroup/huge_thingis ignored.Here's how I call the handler with the straight-forward
parse()method:When I run that with Unix/BSD time, I get:
Here's how I call the handler with the more complex chunked reader, using a 4K chunk size:
Even with a 512B chunk size, it doesn't get below 10M, but the runtime doubled.
I'm curious to see what kind of performance you're getting.