Retrieve text from an XML-formatted string in Python

64 views Asked by At

I have a list of strings that follow a relatively similar format. Here are two examples:

text_1 = ''<abstract lang="en" source="my_source" format="org"><p id="A-0001" num="none">My text is here </p><img file="Uxx.md" /></abstract>''

text_2 = ''<abstract lang="db" source="abs" format="hrw" abstract-source="my_source"><p>Another text.</p></abstract>''

I can't vouch for other variations since it's an extensive collection of strings, but it's evident that the format is XML, and my sole objective is to retrieve the text from each of these strings. What do you sugest for this?

3

There are 3 answers

1
Michael Ruth On

Use the xml package. It's part of stdlib and easy to use. Plus, it provides a nice tutorial.

import xml.etree.ElementTree as ET
text_1 = '<abstract lang="en" source="my_source" format="org"><p id="A-0001" num="none">My text is here </p><img file="Uxx.md" /></abstract>'
root = ET.fromstring(text_1)

You can access the data:

print(root.tag, root.attrib)
for child in root:
    print(child.tag, root.attrib)
abstract {'lang': 'en', 'source': 'my_source', 'format': 'org'}
p {'id': 'A-0001', 'num': 'none'}
img {'file': 'Uxx.md'}

edit: To view text of the <p> element:

root[0].text
'My text is here '

You can also get information about the members of root and child (both are Elements) with help().

help(root)
class Element(builtins.object)
 |  Methods defined here:
 |
 |  __copy__(self, /)
 |
 |  __deepcopy__(self, memo, /)
 |
 ...
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  attrib
 |      A dictionary containing the element's attributes
 |
 |  tag
 |      A string identifying what kind of data this element represents
 |
 |  tail
 |      A string of text directly after the end tag, or None
 |
 |  text
 |      A string of text directly after the start tag, or None
0
Timeless On

Your expected output isn't clear but in any case, you might need to findtext with :

import xml.etree.ElementTree as ET

xmls = [text_1, text_2]

texts = [ET.fromstring(x).findtext("p").strip() for x in xmls]

Alternatively, using :

#pip install beautifulsoup4
from bs4 import BeautifulSoup

texts = [BeautifulSoup(x, "lxml").text.strip() for x in xmls]

Output :

print(texts) # ['My text is here', 'Another text.']
0
Karree On

you can use xmltodict module

pip install xmltodict

and then use this to convert xml format strings to dictionary

xmltodict.parse(xml_strings)