I am trying to extract some data from a quite large XML file (the dblp records, for those familiar with it). For this purpose I am using a slightly modified version of a fast parser I found on github, which works like this:


def fast_iter(context, func,*args, **kwargs):
    collaborations = [Top level elements I need]
    #Bunch of local variables

    #read chunk line by line
    for event, elem in context:

        if elem.tag in collaborations and event == "start":
            #Print to console that we are entering a top level element I might need
            tag = elem.tag
            print(elem.tag,event)

        #Additional processing only on the tags I need

        if elem.tag == tag and event == 'end':
            #Reset local variables
            #Print to console that the element is done
            print(elem.tag,event)

        #Regardless of the outcome of the processing
        #Free the memory from the last used element
        if event=="end":
            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]
    del context
    #clear chunks

#@func: process_element
#@param elem : parsed data of chunk
#@param fout : file name to write
#@desc: It is handler to write content. just write content to file
def process_element(elem, fout):
    #Write to file if I hit a top level element
    #That I actually needed
    print("writing ... " + elem)
    print(elem, file=fout)

if __name__ == "__main__":
    fout = open('parsed_data.txt', 'w')
    context = etree.iterparse('dblp.xml', load_dtd=True,html=True, events=["start", "end"])
    fast_iter(context, process_element, fout)

As I mentioned, this is a big XML file (2.3GB if I remember correctly), so I expect execution to take a while; However, I noticed that after the script starts working the execution gets progressively slower and slower (as indicated by the speed at which I get the start-end prints to console) even though the function is supposed to not hog memory as it frees it up at every cycle.

What am I missing here? Is this expected behaviour because I'm printing so much stuff to console? In this case, how can I have any kind of indication that the script is processing the file without slowing down the actual execution?

As an additional note I want to point out that the Tree built from the XML file is quite unique; as the authors state:

"In general, our XML is a shallow but very long list of XML records. The root element has several million child elements, but usually no element is deeper than level three."

Is this my bottleneck? If so, how can I speed up execution? As far as I can see, every child element of the root node would get removed from memory as soon as we encounter its closing tag, hence there should be no problem whatsoever during incremental processing of the file.

0 Answers