I have an XML which is not well formed as I am getting this error when I am trying to read XML:

import xml.etree.ElementTree as ET
ET.parse(r'my.xml')

I get the below error

ParseError: not well-formed (invalid token): line 2034, column 317

So, I used BeautifulSoup to read the xml by below code:

from bs4 import BeautifulSoup

with open(r'my.xml') as fp:
    soup = BeautifulSoup(fp, 'xml')

If I print soup it looks like this:

        <Placemark> 
<name>India </name> 
    <description>Country</description> 
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>
        <Placemark> 
<name>USA</name>   
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>            
    <Placemark>   
    <description>City</description> 
    <styleUrl>#icon-962-B29189</styleUrl> 
    </Placemark>

I have a total of more than 100 Placemark tags and the information in them. I want to capture name and description of each tag and make a df with respective columns.

My code for same is:

name_tag=[x.text.strip() for x in soup.findAll('name')]
description_tag =[x.text.strip() for x in soup.findAll('description')]

The problem is for some of the Placemark tags I don't have name or description tag at all. And hence I am not able to know which name has what description. So, there is a mismatch between name and description because of absence of tags.

Expected Output Dataframe:

Name      Description
India     Country
USA
           City

Is their any way I can achieve the same?

1 Answers

2
DeepSpace On Best Solutions

Since you are searching for name and description tags separately, you are losing track of which name belongs to which description.

Instead, you should parse each placemark tag on its own, and handle the case of missing name and description tags for each placemark tag.

data = []

for placemark in soup.findAll('placemark'):
    try:
        name = placemark.find('name').text.strip()
    except AttributeError:
        name = None
    try:
        description = placemark.find('description').text.strip()
    except AttributeError:
        description = None

    data.append((name, description))

df = pd.DataFrame(data, columns=['Name', 'Description'])
print(df)
#       Name    Description
#  0   India        Country
#  1     USA           None
#  2    None           City