Retrieve text from an XML-formatted string in Python

Question

Retrieve text from an XML-formatted string in Python

64 views Asked by Eghbal At 30 October 2023 at 13:33

I have a list of strings that follow a relatively similar format. Here are two examples:

text_1 = ''<abstract lang="en" source="my_source" format="org"><p id="A-0001" num="none">My text is here </p><img file="Uxx.md" /></abstract>''

text_2 = ''<abstract lang="db" source="abs" format="hrw" abstract-source="my_source"><p>Another text.</p></abstract>''

I can't vouch for other variations since it's an extensive collection of strings, but it's evident that the format is XML, and my sole objective is to retrieve the text from each of these strings. What do you sugest for this?

Original Q&A

There are 3 answers

**Michael Ruth** · Answer 1 · 2023-10-30T13:46:46+00:00

Use the xml package. It's part of stdlib and easy to use. Plus, it provides a nice tutorial.

import xml.etree.ElementTree as ET
text_1 = '<abstract lang="en" source="my_source" format="org"><p id="A-0001" num="none">My text is here </p><img file="Uxx.md" /></abstract>'
root = ET.fromstring(text_1)

You can access the data:

print(root.tag, root.attrib)
for child in root:
    print(child.tag, root.attrib)

abstract {'lang': 'en', 'source': 'my_source', 'format': 'org'}
p {'id': 'A-0001', 'num': 'none'}
img {'file': 'Uxx.md'}

edit: To view text of the <p> element:

root[0].text

'My text is here '

You can also get information about the members of root and child (both are Elements) with help().

help(root)

class Element(builtins.object)
 |  Methods defined here:
 |
 |  __copy__(self, /)
 |
 |  __deepcopy__(self, memo, /)
 |
 ...
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  attrib
 |      A dictionary containing the element's attributes
 |
 |  tag
 |      A string identifying what kind of data this element represents
 |
 |  tail
 |      A string of text directly after the end tag, or None
 |
 |  text
 |      A string of text directly after the start tag, or None

**Timeless** · Answer 2 · 2023-10-30T14:31:50+00:00

Your expected output isn't clear but in any case, you might need to findtext with elementtree:

import xml.etree.ElementTree as ET

xmls = [text_1, text_2]

texts = [ET.fromstring(x).findtext("p").strip() for x in xmls]

Alternatively, using beautifulsoup :

#pip install beautifulsoup4
from bs4 import BeautifulSoup

texts = [BeautifulSoup(x, "lxml").text.strip() for x in xmls]

Output :

print(texts) # ['My text is here', 'Another text.']

**Karree** · Answer 3 · 2023-10-30T14:42:58+00:00

Karree On 30 October 2023 at 14:42

you can use xmltodict module

pip install xmltodict

and then use this to convert xml format strings to dictionary

xmltodict.parse(xml_strings)

TechQA.

Retrieve text from an XML-formatted string in Python

There are 3 answers

Related Questions in PYTHON

Related Questions in XML

Related Questions in NSREGULAREXPRESSION

Popular Questions

Trending Questions