Convert XML to dataframe in Python; stumbling on why xml is not parsed. Encoding issue?

309 views Asked by At

Would really appreciate some help. I have been busy more than two days, surfing around to understand why I can't access this xml file to put its contents in a df. My goal is to put the worksheets in the xml file in pandas dataframes. I know there are several posts adressing this topic, but I seem to be facing some errors that make it complex.

The data is a download from a well-known ETF provider. It is downloaded as a ".xls" but it actually is a 'xml' format; clearly an Excel xlm. So an easy pd.read_excel won't work. That's where I was forced to get into xml formats and libraries like LXML and xml.etree.ElementTree. I have worked som time with BS4 though.

The xml download doesn't specify any encoding and when I try to parse it, it returns errors. So I have dabbled around with chardet and et.XMLParser to discover it's encoding and have it 'hard set' in the parser. But to no avail. When parsing it returns:

'lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1'

Instead of parsing it directly (see xml_tree1 below), I tried to read the xml with a fromstring, and I noticed some gibberish. So I replaced it with nothing:

xml_str = xml_file.read().replace('', '')

Now I have clean xml code, but still can't find any childeren in my root. In fact it seems to be empty and not parsed at all. My knowledge is letting me down. Could somebody push me to the right direction? My problem is in the early stage; I can't seem to parse the file and the underlying format. A second problem is that I need to parse the ss:table on the respective worksheets in the document. A little further in the code I have jotted down some examples for me to work with. Any comments are very welcom.

These are the posts that helped me the most;

How do I pick up text values of child nodes when parsing XML with ElementTree

Reading a spreadsheet like .xml with ElementTree

The source of the xml can be found here (Dutch version). You can download in the top-right corner.

https://www.ishares.com/nl/professionele-belegger/nl/producten/251882/ishares-msci-world-ucits-etf-acc-fund

Snippet of the xml:

<?xml version="1.0"?>
<ss:Workbook xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<ss:Styles>
<ss:Style ss:ID="Default">
<ss:Alignment ss:Horizontal="Left"/>
</ss:Style>
<ss:Style ss:ID="wraptext">
<ss:Alignment ss:Horizontal="Left" ss:WrapText="1"/>
<ss:Font ss:Italic="1"/>
</ss:Style>
<ss:Style ss:ID="disclaimer">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
</ss:Style>
<ss:Style ss:ID="DefaultHyperlink">
<ss:Alignment ss:Vertical="Center" ss:WrapText="1"/>
<ss:Font ss:Color="#0000FF" ss:Underline="Single" />
</ss:Style>
<ss:Style ss:ID="headerstyle">
<ss:Font ss:Bold="1" />
</ss:Style>
<ss:Style ss:ID="Date">
<ss:NumberFormat ss:Format="dd\-mmm\-yyyy"/>
</ss:Style>
<ss:Style ss:ID="Left">
<ss:Alignment ss:Horizontal="Left"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
<ss:Style ss:ID="Right">
<ss:Alignment ss:Horizontal="Right"/>
<ss:NumberFormat ss:Format="Standard"/>
</ss:Style>
</ss:Styles>
<ss:Worksheet ss:Name="Overzicht">
<ss:Table>
<ss:Row >
<ss:Cell ss:StyleID="headerstyle">
<ss:Data ss:Type="String">iShares Core MSCI World UCITS ETF</ss:Data>
</ss:Cell>
</ss:Row>''

My code so far:

import lxml.etree as et
from lxml import objectify
import io
import chardet

with open('C:\\MSCI.xml') as xml_file:
    parser = et.XMLParser(encoding="iso-8859-5", recover=True)
    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!

    xml_tree = et.ElementTree(et.fromstring(xml_str, parser=parser))
    root = xml_tree.getroot()
    xml_tree0 = et.iterparse(xml_file, encoding='iso-8859-1')  # Nothing
    xml_tree1 = et.parse(xml_file, parser=parser)  # File seems empty, but is not
    xml_tree2 = objectify.parse(io.StringIO(xml_str))  # This is the same as fromstring

    #################################################
    ### Trying to capture encoding and replace it ###
    #################################################
    detector = chardet.UniversalDetector()
    for line in xml_file.readlines():
        detector.feed(line)  # This doesn't seem to work
        if detector.done: break
    detector.close()
    print(detector.result)

    xml_enc = detector.result['encoding']  # The result seems always to be None
    if xml_enc != 'utf-8':
        # content = xml_str(xml_enc, 'replace').encode('utf-8')  # Don't know how to replace encoding
        pass
    xml_clean = et.fromstring(xml_str, parser=parser)

    # The detector function above and Encryption replacer does not work :(

    #############################################################################
    ### Some code below is how I'd guess to proceed, after I have a good tree ###
    #############################################################################

    # ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}
    # https://stackoverflow.com/questions/59945728/how-do-i-pick-up-text-values-of-child-nodes-when-parsing-xml-with-elementtree
    # https://stackoverflow.com/questions/54107550/reading-a-spreadsheet-like-xml-with-elementtree

    ### Something like this to iterate through the children
    # for appt in xml_tree.getchildren():
    #     for elem in appt.getchildren():
    #         if not elem.text:
    #             text = "None"
    #         else:
    #             text = elem.text
    #         print(elem.tag + " => " + text)

    ### Or something like this to iterate to take into account namespaces
    # for ws in xml.findall('ss:Worksheet', namespaces):
    #     for table in ws.findall('ss:Row', namespaces):
    #         for c in table.findall('ss:Cell', namespaces):
    #             data = c.find('ss:Data', namespaces)
    #             if data.text is None:
    #                 text = []
    #                 data = data.findall('html:Font', namespaces)
    #                 for element in data:
    #                     text.append(element.text)
    #
    #                 data_text = ''.join(text)
    #                 print(data_text)
    #             else:
    #                 print(data.text)

    ### Or something like this to iterate to take into account xpaths and namespaces
    # L = []
    # ws = xml.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
    # if len(ws) > 0:
    #     tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    #     if len(tables) > 0:
    #         rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
    #         for row in rows:
    #             tmp = []
    #             cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
    #             for cell in cells:
    #                 #                print(cell.text);
    #                 tmp.append(cell.text)
    #             L.append(tmp)
    # print(L)
1

There are 1 answers

0
Josh77 On

Well I eventually ended up with the code below.

Works for me, but still don't get it why I can't parse the file directly and needed to replace the gibberish in the string.

Thoughts?

Perhaps I can make someone else happy with the below. Took me wayyy too much time ;) :S

Cheers!

import lxml.etree as et
import io
import chardet
import pandas as pd

filepath = 'C:\\MSCI.xml'
namespace = '{urn:schemas-microsoft-com:office:spreadsheet}'
find_elem = 'Worksheet'
ws_name = 'Posities'

# Capture encoding
with open(filepath, 'rb') as f:
    data = f.read()
xml_enc = chardet.detect(data).get('encoding')
if xml_enc == 'UTF-8-SIG':
    xml_enc = xml_enc.replace('-SIG', '')

'''
##########################################################################
### Parse the xml file, iterate through it, append and build dataframe ###
##########################################################################
# https://stackoverflow.com/questions/10242237/lxml-etree-iterparse-error-typeerror-reading-file-objects-must-return-plain-st
# https://stackoverflow.com/questions/36804794/iterparse-large-xml-using-python
# https://riptutorial.com/python/example/25995/opening-and-reading-large-xml-files-using-iterparse--incremental-parsing-
# https://stackoverflow.com/questions/28253006/python-element-tree-iterparse-filter-nodes-and-children
# https://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy
# https://stackoverflow.com/questions/7018326/lxml-iterparse-in-python-cant-handle-namespaces
# https://stackoverflow.com/questions/38790012/how-to-get-all-the-tags-in-an-xml-using-python
'''

with open(filepath) as xml_file:

    xml_str = xml_file.read().replace('', '')  # !!! IShares xml has error in first row !!!
    xml_byte = io.BytesIO(xml_str.encode(xml_enc))

    worksheet = []
    for event, elem in et.iterparse(xml_byte, recover=True, events=('start', 'end')):
        if elem.tag == et.QName(namespace + find_elem) and event == 'start':
            for name, value in elem.items():
                if value == ws_name:
                    for table in elem:
                        row_values = []
                        for row in table:
                            cell_values = []
                            for cells in row:
                                for data in cells:
                                    content = data.text
                                    cell_values.append(content)
                            row_values.append(cell_values)
                    worksheet.append(row_values)
    xml_df_concat = pd.concat([pd.DataFrame(worksheet[i]) for i in range(len(worksheet))], ignore_index=True)