I'm parsing this XML to Pandas dataset:

<list>
    <item>
        <name>name1</name>
        <category>cat1</category>
    </item>
    <item>
        <name>name2</name>
        <category>cat2</category>
    </item>
    <item>
        <name>name3</name>
        <category>cat1</category>
        <category>cat2</category>
    </item>
</list>

But I need that every category made a column with a boolean like this:

name     cat1   cat2
name1    1      0
name2    0      1
name3    1      1

I was searched for an example of this but I only find examples that parse the xml to dataframe with the columns like a nodes.

I have charged the XML to a dict with this code:

import xml.etree.ElementTree as ET
import xmltodict
import pandas as pd

tree = ET.parse('example.xml')
xml_data = tree.getroot()

xmlstr = ET.tostring(xml_data, method='xml')

data_dict = dict(xmltodict.parse(xmlstr))

data_dict

This charge this dict:

{'list': OrderedDict([('item',
               [OrderedDict([('name', 'name1'), ('category', 'cat1')]),
                OrderedDict([('name', 'name2'), ('category', 'cat2')]),
                OrderedDict([('name', 'name3'), ('category', ['cat1', 'cat2'])])])])}

How I can charge the Pandas dataframe from the XML or from the dict?

1 Answers

1
Vivek Kalyanarangan On Best Solutions

Use -

data = []
for elem in tree.findall('./item'):
    tag = {}
    name = elem.find('./name')
    tag["name"] = name.text

    cats = elem.findall('./category')
    for cat in cats:
        tag[cat.text] = 1
    data.append(tag)
print(pd.DataFrame(data).fillna(0))

Output

   cat1  cat2   name
0   1.0   0.0  name1
1   0.0   1.0  name2
2   1.0   1.0  name3