I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.
Constraints
- pure XPath 1.0
- no aux-functions (e.g. no "concat()")
HTML-Markup
<span class="container">
Peter: Lorem Impsum
<i class="divider" role="img" aria-label="|"></i>
Paul Smith: Foo Bar BAZ
<i class="divider" role="img" aria-label="|"></i>
Mary: One Two Three
</span>
Challenge
I want to extract the three coherent strings:
- Peter: Lorem Impsum
- Paul Smith: Foo Bar BAZ
- Mary: One Two Three
XPath
The following XPath-queries is the best I've come up with after HOURS of research:
XPath-query 1
//span[contains(@class, "container")]
=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
XPath-query 2
//span[contains(@class, "container")]//text()
Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three
Problem
Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.
Is it possible to integrate some "artificial separators" between the text-nodes?
You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select
Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).
To understand the limits you're hitting against, your first XPath,
selects a nodeset of
spanelements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:But be clear: Your XPath is selecting a nodeset of
spanelements, not strings here.Your second XPath,
selects a nodeset of
text()nodes. The environment in which XPath 1.0 is operating is showing the string value of each selectedtext()node.If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,
or you could join them,
and directly get
or
to get