Improve XPath-query to distinguish text-nodes correctly

Question

Improve XPath-query to distinguish text-nodes correctly

42 views Asked by NetWurst At 02 September 2018 at 19:06

I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.

Constraints

pure XPath 1.0
no aux-functions (e.g. no "concat()")

HTML-Markup

<span class="container">
    Peter: Lorem Impsum
    <i class="divider" role="img" aria-label="|"></i>
    Paul Smith: Foo Bar BAZ
    <i class="divider" role="img" aria-label="|"></i>
    Mary: One Two Three
</span>

Challenge

I want to extract the three coherent strings:

Peter: Lorem Impsum
Paul Smith: Foo Bar BAZ
Mary: One Two Three

XPath

The following XPath-queries is the best I've come up with after HOURS of research:

XPath-query 1

//span[contains(@class, "container")]

=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three

XPath-query 2

//span[contains(@class, "container")]//text()

Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three

Problem

Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.

Is it possible to integrate some "artificial separators" between the text-nodes?

Original Q&A

There are 1 answers

**kjhughes** · Accepted Answer · 2018-09-02T22:10:40+00:00

You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select

a string, or
a set of text nodes

Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).

To understand the limits you're hitting against, your first XPath,

//span[contains(@class, "container")]

selects a nodeset of span elements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:

Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three

But be clear: Your XPath is selecting a nodeset of span elements, not strings here.

Your second XPath,

//span[contains(@class, "container")]//text()

selects a nodeset of text() nodes. The environment in which XPath 1.0 is operating is showing the string value of each selected text() node.

If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,

//span[contains(@class, "container")]/text()/string()

or you could join them,

string-join(//span[contains(@class, "container")]/text(), "|")

and directly get

Peter: Lorem Impsum
|
Paul Smith: Foo Bar BAZ
|
Mary: One Two Three

or

string-join(//span[contains(@class, "container")]/text()/normalize-space(), "|")

to get

Peter: Lorem Impsum|Paul Smith: Foo Bar BAZ|Mary: One Two Three

TechQA.

Improve XPath-query to distinguish text-nodes correctly

There are 1 answers

Related Questions in XPATH

Related Questions in DOMXPATH

Related Questions in XPATH-1.0

Related Questions in XPATHQUERY

Popular Questions

Trending Questions