Possible Memory Leak in Parsing a XML File?

238 views Asked by At

I have a long running script, which parses a large XML file(~9GB) and inserts data into a database in chunks. This is what that looks like,

import lxml.etree as ET
import gc

def __get_elements1(self):
    context = ET.iterparse(tmp_folder + file_path, events=("start", "end"), tag='Elements1')

    elements = []

    try:
        for event, element in context:
            if event == "end":
                elements_result = tuple([child.text for child in element])
                elements.append(elements_result)

                element.clear()
                while element.getprevious() is not None:
                    del element.getparent()[0]

                if len(elements) == self.chunk_size:
                    # insert data to database

        if len(elments) > 0:
           # insert data to database
    except Exception as e:
        logging.error(e)

    del context
    del elements

    gc.collect()

Here, I am looking for elements with a specific tag and parsing them. Now, I have few other functions that parse different tags in the same file. The code is pretty much the same with very slight modifications. I actually want to run these parallelly, but even when I run them sequentially, I am faced with a problem.

My first function runs fine and without consuming much resources, however, when I get to my second function, memory consumption just shoots up and my application fails. This does not make sense, especially because I am parsing the elements with the most data through my first function. My second function can be thought to look something like this,

def __get_elements2(self):
    context = ET.iterparse(tmp_folder + file_path, events=("start", "end"), tag='Elements2')

    elements = []

    try:
        for event, element in context:
            if event == "end":
                elements_result = tuple([child.text for child in element])
                elements.append(elements_result)

                element.clear()
                while element.getprevious() is not None:
                    del element.getparent()[0]

                if len(elements) == self.chunk_size:
                    # insert data to database

        if len(elments) > 0:
           # insert data to database
    except Exception as e:
        logging.error(e)

    del context
    del elements

    gc.collect()

Notice that tag given to the iterparse function above is different.

What could be the cause of this? Is it a possible memory leak? I cannot think of anything else. How do I resolve this?

1

There are 1 answers

0
Smopuim_81 On

I had the same problem myself. I have studied the documentation here: https://eli.thegreenplace.net/2012/03/15/processing-xml-in-python-with-elementtree/.

Apparently your problem is the uneven distribution of the search tag in the xml file or its absence. Until the search tag is found, the elem.clear() call does not occur and elem gradually loads the xml tree, which causes memory overflow.

This is what is written in the iterparse documentation: "The loop iterates over iterparse events, detecting "end" events for the location tag, looking for the desired value. The call to elem.clear() is key here - iterparse still builds a tree, doing it on the fly. Clearing the element effectively discards the tree [7], freeing the allocated memory."