"org.apache.commons.lang.StringEscapeUtils" and "en dash"

Question

"org.apache.commons.lang.StringEscapeUtils" and "en dash"

2.8k views Asked by Zalivaka At 16 February 2011 at 14:30

I am using "*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)" to convert Html entity escapes to a string containing the actual Unicode characters corresponding to the escapes. However it doesn't parse "em dash" and "en dash" symbols properly. StringEscapeUtils replaces "–" with "\u0096" while the correct misplacement is "\u2013". And as I have read "\u0096" is cp1252 equivalent for "–". So how can I make it work in a right way? I know that I can replace it manually but I wonder if I can do it with StringEscapeUtils or with any other util.

Original Q&A

There are 2 answers

**Gugussee** · Answer 1 · 2011-02-16T15:20:48+00:00

And as I have read "\u0096" is cp1252 equivalent for "–".

I don't think so. 0x0096 in Unicode is a C1 control code:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

and is unlikely to be the replacement for "-" (as you wrote).

Well, if StringEscapeUtils really messes this up (en dash should indeed be \u2013) and if it's the only escape it is messing up and if there's no reason to have any other 0x0096 in your String, then a replaceAll after having calling StringEscapeUtils should work.

The following does the replace you expect:

System.out.println("Broken\u0096stuff".replaceAll("\u0096", "\u2013"));

However you should first make sure that StringEscapeUtils really messes things up and really, really, understand why/how you get that 0x0096 in a Java String.

Then, also, it should probably be pointed out to you that sadly Java's Unicode support is a major SNAFU because Java was conceived before Unicode 3.1 came out.

Hence it seemed a smart idea to use 16 bits for the char primitive, it seemed a smart idea to use a 4-hexdigits '\uxxxx' escape sequence, it seemed a smart idea to represent the length of the char[] in String's length() method, etc.

These were actually all very very stupid idea leading to one of the major Java SNAFU where the char primitive cannot actually hold a Unicode char anymore and where String's length method does actually not return a String's real length.

I like the following:

final char brokenCharCannotRepresentUnicode31Codepoints = '\uFFFF'; // How do I store a Unicode 3.1 codepoint here!?

Why this rant? Well, because I don't know how the regexp replacement in String's replaceAll is implemented but I really wouldn't be suprised if there were cases (i.e. certain codepoints) where String's replaceAll was, like char and like length and like \uxxxx, well.. hmmm, totally broken.

**Stephen C** · Answer 2 · 2011-02-16T15:55:28+00:00

I suspect that the problem is not in the StringEscapeUtils.unescapeHtml(...) call.

Instead, I suspect that the character has been turned into '\u0096' before the call. More specifically, I suspect that your code has used the wrong character set when reading the HTML as characters.

As you say, an en-dash is code-point 0x96 in cp1252. So one way to get an en-dashed mistranslated to the unicode code-point \u0096 would be to start with a byte stream that was encoded using cp1252 and read / decode it using an InputStreamReader(is, "Latin-1").

TechQA.

"org.apache.commons.lang.StringEscapeUtils" and "en dash"

There are 2 answers

Related Questions in JAVA

Related Questions in UNICODE

Related Questions in CHARACTER-ENCODING

Related Questions in HTML-ESCAPE-CHARACTERS

Popular Questions

Trending Questions