xpath regex doesn't search tail in lxml.etree

620 views Asked by At

I'm working with lxml.etree and I'm trying to allow users to search a docbook for text. When a user provides the search text, I use the exslt match function to find the text within the docbook. The match works just fine if the text shows up within the element.text but not if the text is in element.tail.

Here's an example:

>>> # XML as lxml.etree element
>>> root = lxml.etree.fromstring('''
...   <root>
...     <foo>Sample text
...       <bar>and more sample text</bar> and important text.
...     </foo>
...   </root>
... ''')
>>>
>>> # User provides search text    
>>> search_term = 'important'
>>>
>>> # Find nodes with matching text
>>> matches = root.xpath('//*[re:match(text(), $search, "i")]', search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})
>>> print(matches)
[]
>>>
>>> # But I know it's there...
>>> bar = root.xpath('//bar')[0]
>>> print(bar.tail)
 and important text.

I'm confused because the text() function by itself returns all the text – including the tail:

>>> # text() results
>>> text = root.xpath('//child1/text()')
>>> print(text)
['Sample text',' and important text']

How come the tail isn't being included when I use the match function?

1

There are 1 answers

0
har07 On

How come the tail isn't being included when I use the match function?

That's because in xpath 1.0, when given a node-set, match() function (or any other string function such as contains(), starts-with(), etc.) only take into account the first node.

Instead of what you did, you can use //text() and apply regex match filter on individual text nodes, and then return the text node's parent element, like so :

xpath = '//text()[re:match(., $search, "i")]/parent::*'
matches = root.xpath(xpath, search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})