parsing invalid xml using xmltodict

786 views Asked by At

I am reading a xml file and converting to df using xmltodict and pandas.

This is how one of the elements in the file looks like

<net>
    <ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
    <endAddress>66.125.37.127</endAddress>
    <handle>NET-66-125-37-120-1</handle>
    <name>SBC066125037120020307</name>
    <netBlocks>
        <netBlock>
            <cidrLenth>29</cidrLenth>
            <endAddress>066.125.037.127</endAddress>
            <type>S</type>
            <startAddress>066.125.037.120</startAddress>
        </netBlock>
    </netBlocks>
    <pocLinks/>
    <orgHandle>C00285134</orgHandle>
    <parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
    <registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
    <startAddress>66.125.37.120</startAddress>
    <updateDate>2002-03-08T07:56:59-05:00</updateDate>
    <version>4</version>
</net>

since there are a large number of records like this which is being pulled in by an API, sometimes some <net> objects at the end of the file can be partially downloaded. ex : one tag not having closing tag.

This is what i wrote to parse the xml

xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read()  # Read data
xml_data = xmltodict.parse(xml_data,
                      process_namespaces=True,
                      namespaces={'http://www.arin.net/bulkwhois/core/v1':None})

when that happens, I get an error like so

no element found: line 30574438, column 37

I want to be able to parse till the last valid <net> element. How can that be done?

1

There are 1 answers

0
Patrick Artner On BEST ANSWER

You may need to fix your xml beforehand - xmltodict has no ability to do that for you.

You can leverage lxml as described in Python xml - handle unclosed token to fix your xml:

from lxml import etree

def fixme(x):
    p = etree.fromstring(x, parser = etree.XMLParser(recover=True))
    return etree.tostring(p).decode("utf8")


fixed = fixme("""<start><net>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
</net><net>
<endAddress>66.125.37.227</endAddress>
<handle>NET-66-125-37-220-1</handle>
""")

and then use the fixed xml:

import xmltodict

print(xmltodict.parse(fixed))

to get

OrderedDict([('start', 
    OrderedDict([('net', [
        OrderedDict([('endAddress', '66.125.37.127'), ('handle', 'NET-66-125-37-120-1')]), 
        OrderedDict([('endAddress', '66.125.37.227'), ('handle', 'NET-66-125-37-220-1')])
        ])
    ]))
])