Select xml node by xpath with attribute value containing apostroph

1.1k views Asked by At

I'm trying to extract some data from a given XML file. Therefore, I have to select some specific nodes by their attribute values. My XML looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<svg ....>
    ....
    <g font-family="'BentonSans Medium'" font-size="12">
        <text>bla bla bla</text>
        ....
    </g>
    ....
</svg>

I've tried to escape the apostrophs in the value but I couldn't get it working.

from lxml import etree as ET

tree = ET.parse("file.svg")
root = tree.getroot()

xPath = ".//g[@font-family='&apos;BentonSans Medium&apos;]"
print(root.findall(xPath))

I always get errors of this kind:

File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 214, in prepare_predicate
raise SyntaxError("invalid predicate")

Anyone got ideas how to select these nodes with XPath?

1

There are 1 answers

1
Ruslan Osmanov On BEST ANSWER

Try this:

xPath = ".//g[@font-family=\"'BentonSans Medium'\"]"

Your code fails because you haven't put the closing single quote:

xPath = ".//g[@font-family='&apos;BentonSans Medium&apos;]"

It should be after the last &apos;:

xPath = ".//g[@font-family='&apos;BentonSans Medium&apos;']"

But it doesn't make the XPath expression correct, as &apos; is interpreted just as is.


By the way, if you want to check if the font-family contains the given string, use contains() XPath function with the xpath method:

xPath = '//g[contains(@font-family, "BentonSans Medium")]'
print(root.xpath(xPath))

Output

[<Element g at 0x7f2093612108>]

The sample code fetches all g elements with font-family attribute values containing BentonSans Medium string.

I don't know why the findall method doesn't work with contains(), but the xpath seems more flexible, and I would recommend using this method instead.