Parsing HTML elements in order using Jsoup

604 views Asked by At

Using Jsoup, I've been trying to parse articles and to display it in Android apps by creating TextView and ImageView programmatically. I'm trying to avoid using WebView layout in Android Studio since it doesn't offer much customization. With TextView, I can set the apps to behave as I want.

The problem is that I should get the articles elements exactly in order and display it in that order.

The article might look like this (simplified for the sake of asking)

<h2>Lorem ipsum</h2>
<p>Lorem ipsum 2</p>
<p>Lorem ipsum 3</p>
<p><img src="blabla.jpg"/></p>
<p>Lorem ipsum Lorem ipsum Lorem ipsum</p>
<strong>Dolor si amette</strong>
<p><img src="abc.png"/><br/>Source : ABC Pte. Ltd.</p>

The structure wont be the same for each article. Maybe on some other article,it would be like this

<p><img src="blabla.jpg"/></p>
<p>Lorem ipsum 2</p>
<p>Lorem ipsum 3</p>
<h2>Lorem ipsum</h2>
<p><img src="abc.png"/><br/>Source : ABC Pte. Ltd.</p>
<strong>Dolor si amette</strong>

Whats important is that whenever theres image,i should get the URL for the image,and when there's text,i should get the text.

I've tried to iterate each p tag and look for image or text.

    Document jsoupParse = Jsoup.parse(html);

    Elements paragraph = jsoupParse.getElementsByTag("p");
    int sizeJsoup = jsoupParse.getElementsByTag("p").size();
    System.out.println("Size of P tag = "+sizeJsoup);

    for(Element element:paragraph){
        if(element.hasText()){
            System.out.println("Text:"+element.text());
        }else{
            Elements image = element.getElementsByTag("img");
            for(Element imageElement:image){
                System.out.println("Image URL : "+imageElement.absUrl("src"));
            }
        }
    }

Unfortunately it doesn't consider heading tag and it only grabs the text in case like

    <p><img src="abc.png"/><br/>Source : ABC Pte. Ltd.</p>

It only get below text but not the image URL too.

    Source : ABC Pte. Ltd.
1

There are 1 answers

0
Abhay Gupta On

Instead of imageElement.absUrl("src"); u can try by using imageElement.attr("src");