•").text() but instead I get" /> •").text() but instead I get" /> •").text() but instead I get"/>

prevent Jsoup from converting utf-8 to utf-16

61 views Asked by At

I would like the following to return the encoded bullet point the same way it was supplied:

Jsoup.parse("<div>&#8226;</div>").text()

but instead I get a utf-16 string out with a bullet which displays as a black circle. This is causing rendering problems in Chrome as the rest of the page is utf-8. From the many other SO questions I found I thought this might work

Parser.unescapeEntities(Jsoup.parse("<div>&#8226;</div>").text(), true)

but in retrospect I see that does the opposite, it turns escaped content into un-escaped.

I found some suggestions that the html needed to claim utf-8 encoding in the head for it to be parsed the way I am hoping, but this valid html is still turning into utf-16

<!DOCTYPE html>
  <html lang='en'><head><title>foo</title><meta charset='UTF-8'></head>
  <body>&#8226;</body>
</html>

In particular, I am using Jsoup to parse out an element from previously generated html and return the original html text, such as

Jsoup.parse(myHtml).getElementsByClass("myClass").first().toString()

Question: how I can I use Jsoup to parse out a fragment of html which contains utf-8 representation of utf-16 characters and not have that content converted to utf-16?

1

There are 1 answers

0
Jonathan Hedley On

I don't see that this has anything to do with UTF-16. That bullet character is representable in UTF-8. If you are using the decoded form in a page that is served as UTF-8, it will display correctly. I guess there must be another issue with how you are serving it that is dropping to ascii or another encoding.

To directly answer your question: the Element.text() method always returns decoded text.

To get encoded HTML, you should use the Element.html() method. Now, that will default to outputting in UTF-8, and jsoup only escapes characters into entities when the output character set does not support the character, and so the form will be the same. As you seemingly don't want UTF-8 either, you can configure the OutputSettings as required.

Here's a worked example:

Document doc = Jsoup.parse("<div>&#8226;</div>");
Element div = doc.expectFirst("div");
print("Text", div.text());
print("HTML Default", div.html());

doc.outputSettings().charset("UTF-8");
print("HTML UTF8", div.html());

doc.outputSettings().charset("ascii");
print("HTML ascii", div.html());

doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
print("HTML extended", div.html());

print("Text", div.text()); // shows that text() is always decoded

Gives:

Text: •
HTML Default: •
HTML UTF8: •
HTML ascii: &#x2022;
HTML extended: &bull;
Text: •

&#x2022; is the same as &8226;, in hex vs decimal encoding.

(BTW as evidential proof that • can be represented directly in a UTF-8 page, hit View Source for this Stack Overflow page. And Inspect Network to see Content Type = text/html; charset=utf8)