I'm working with lxml.etree
and I'm trying to allow users to search a docbook for text. When a user provides the search text, I use the exslt match
function to find the text within the docbook. The match works just fine if the text shows up within the element.text
but not if the text is in element.tail
.
Here's an example:
>>> # XML as lxml.etree element
>>> root = lxml.etree.fromstring('''
... <root>
... <foo>Sample text
... <bar>and more sample text</bar> and important text.
... </foo>
... </root>
... ''')
>>>
>>> # User provides search text
>>> search_term = 'important'
>>>
>>> # Find nodes with matching text
>>> matches = root.xpath('//*[re:match(text(), $search, "i")]', search=search_term, namespaces={'re':'http://exslt.org/regular-expressions'})
>>> print(matches)
[]
>>>
>>> # But I know it's there...
>>> bar = root.xpath('//bar')[0]
>>> print(bar.tail)
and important text.
I'm confused because the text()
function by itself returns all the text – including the tail
:
>>> # text() results
>>> text = root.xpath('//child1/text()')
>>> print(text)
['Sample text',' and important text']
How come the tail
isn't being included when I use the match
function?
That's because in xpath 1.0, when given a node-set,
match()
function (or any other string function such ascontains()
,starts-with()
, etc.) only take into account the first node.Instead of what you did, you can use
//text()
and apply regex match filter on individual text nodes, and then return the text node's parent element, like so :