I am reading a xml file and converting to df using xmltodict and pandas.
This is how one of the elements in the file looks like
<net>
<ref>https://whois.arin.net/rest/v1/net/NET-66-125-37-120-1</ref>
<endAddress>66.125.37.127</endAddress>
<handle>NET-66-125-37-120-1</handle>
<name>SBC066125037120020307</name>
<netBlocks>
<netBlock>
<cidrLenth>29</cidrLenth>
<endAddress>066.125.037.127</endAddress>
<type>S</type>
<startAddress>066.125.037.120</startAddress>
</netBlock>
</netBlocks>
<pocLinks/>
<orgHandle>C00285134</orgHandle>
<parentNetHandle>NET-66-120-0-0-1</parentNetHandle>
<registrationDate>2002-03-08T00:00:00-05:00</registrationDate>
<startAddress>66.125.37.120</startAddress>
<updateDate>2002-03-08T07:56:59-05:00</updateDate>
<version>4</version>
</net>
since there are a large number of records like this which is being pulled in by an API, sometimes some <net>
objects at the end of the file can be partially downloaded.
ex : one tag not having closing tag.
This is what i wrote to parse the xml
xml_data = open('/Users/dgoswami/Downloads/net.xml', 'r').read() # Read data
xml_data = xmltodict.parse(xml_data,
process_namespaces=True,
namespaces={'http://www.arin.net/bulkwhois/core/v1':None})
when that happens, I get an error like so
no element found: line 30574438, column 37
I want to be able to parse till the last valid <net>
element.
How can that be done?
You may need to fix your xml beforehand -
xmltodict
has no ability to do that for you.You can leverage
lxml
as described in Python xml - handle unclosed token to fix your xml:and then use the fixed xml:
to get