Select element value from html via XPath

57 views Asked by At

I've got a html element that looks like this:

<p>
<strong>Popular resonses: </strong>
bat, butterfly, moth
</p>

Html contains about all elements with <p> tag.

I need to extract <p> values (bat, butterfly, moth).

Thanks.

P.S

I've tried to parse with Matcher and Pattern but it did'n work. I'm using JSoup as parsing library.

2

There are 2 answers

1
TDG On

You can get your desired text by using -

Elements el = doc.select("p:has(strong)");
    for (Element e : el) {          
        System.out.println(e.ownText());
    }

This will find all the p elements in the html that contains also strong, and print the text that belongs only to the p but not to the strong -

bat, butterfly, moth

If you use e.text() instead, you will get all the text in the p element -

Popular resonses: bat, butterfly, moth

If you have only one such element you can also use -

Element e = doc.select("p:has(strong)").first();
System.out.println(e.ownText());

Which saves you the loop.

0
Jonathan Hedley On

There are a couple ways, depending on just what you want:

If you want the TextNode objects:

String html = """
    <p>
    <strong>Popular responses: </strong>
    bat, butterfly, moth
    </p>
    """;
Document doc = Jsoup.parse(html);
List<TextNode> textNodes = doc.selectXpath("//p/text()", TextNode.class);
for (TextNode textNode : textNodes) {
    print(textNode.siblingIndex() + ": " + textNode.text());
}

Gives:

0:  
2:  bat, butterfly, moth 

That's the two TextNodes that are directly in that P. The whitespace before the strong, and then the content after.

Another way, similarly to TDG's answer to fetch Elements, but using XPath, is:

String html = """
    <p>
    <strong>Popular responses: </strong>
    bat, butterfly, moth
    </p>
    """;
Document doc = Jsoup.parse(html);
Elements elements = doc.selectXpath("//p");
for (Element element : elements) {
    print(element.ownText());
}

Giving:

bat, butterfly, moth

See the XPath selector guide, TextNode, and Element.ownText() for more details.

If you want an array of values like {"bat", "butterfly", "moth"}, you could then call ownText.split(",").