Java JSoup HTML API: How do I get JSoup to ignore custom tags?

46 views Asked by At

I've ran into an issue where JSoup is inserting html into the text where it's not wanted or needed. This is supposedly a feature, but unfortunately it's completely ruining my entire implementation.

I'm running this code to fetch paragraphs from HTML and render them onto a textbox:

public String[] getParagraphHTML() {

    Elements paragraphs = document.select("p");
    String[] processed = new String[paragraphs.size()];
    
    for (int i = 0; i < paragraphs.size(); i++) {
        //System.err.println(i + " " + paragraphs.get(i).html());
        processed[i] = paragraphs.get(i).html();
    }
    
    return processed;
}

As an example, here is an example of some of the raw text that would be passed in (within the proper html/body/p blocks):

<u>Oh, hello! You must be the person I've been waiting on all morning. </u><strong><u>You wouldn't happen to be <player> would you?</u></strong>

It's all standard except for the custom tag, which has specific functionality in my program. By the time it's stored in processed[i] after having html() called on it, it's basically destroyed:

Oh, hello! You must be the person I've been waiting on all morning. You wouldn't happen to be <player> would you? </player></u></strong> <player>

For some reason, it's added a random linebreak as well as a closing tag for the first , then confusingly, adds another <player> tag onto the end, which just completely confuses my system entirely.

Is there a way to toggle this functionality off entirely, and have JSoup stick to tags it specifically recognizes? I don't want to have to manage another tagging format just because JSoup won't play ball with this last thing.

So, to summarize: I simply want JSoup to ignore my custom tag entirely. No added tags, nothing. Just ignore anything that's not basic html.

EDIT To clarify even further: The desired outcome is that when JSoup doesn't recognize a tag, it does not attempt to "correct it" in the text output. In the case above, <player> is not a real HTML tag, but rather one exclusive to my program. In which case, JSoup should ignore it and leave it in the text, rather than arbitrarily adding a bunch of tags at the end of the text.*

Thank you for your time.

EDIT: I've read about JSoup.clean() and whitelists, but unfortunately the way that's set up seems flimsy. For example, Element has no way for me to just get the raw untouched text anymore. I either have text() or html(), both ruin the output. In the case of text(), it'll remove all tags, including my custom one. html() on the other hand "corrects" it and gives me the mess above.

I'm not sure what to do unless I can get a Cleaner/Whitelist to apply to html().

0

There are 0 answers