I am using "*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)" to convert Html entity escapes to a string containing the actual Unicode characters corresponding to the escapes. However it doesn't parse "em dash" and "en dash" symbols properly. StringEscapeUtils replaces "" with "\u0096" while the correct misplacement is "\u2013". And as I have read "\u0096" is cp1252 equivalent for "". So how can I make it work in a right way? I know that I can replace it manually but I wonder if I can do it with StringEscapeUtils or with any other util.
"org.apache.commons.lang.StringEscapeUtils" and "en dash"
2.8k views Asked by Zalivaka At
2
There are 2 answers
0
Stephen C
On
I suspect that the problem is not in the StringEscapeUtils.unescapeHtml(...) call.
Instead, I suspect that the character has been turned into '\u0096' before the call. More specifically, I suspect that your code has used the wrong character set when reading the HTML as characters.
As you say, an en-dash is code-point 0x96 in cp1252. So one way to get an en-dashed mistranslated to the unicode code-point \u0096 would be to start with a byte stream that was encoded using cp1252 and read / decode it using an InputStreamReader(is, "Latin-1").
Related Questions in JAVA
- I need the BIRT.war that is compatible with Java 17 and Tomcat 10
- Creating global Class holder
- No method found for class java.lang.String in Kafka
- Issue edit a jtable with a pictures
- getting error when trying to launch kotlin jar file that use supabase "java.lang.NoClassDefFoundError"
- Does the && (logical AND) operator have a higher precedence than || (logical OR) operator in Java?
- Mixed color rendering in a JTable
- HTTPS configuration in Spring Boot, server returning timeout
- How to use Layout to create textfields which dont increase in size?
- Function for making the code wait in javafx
- How to create beans of the same class for multiple template parameters in Spring
- How could you print a specific String from an array with the values of an array from a double array on the same line, using iteration to print all?
- org.telegram.telegrambots.meta.exceptions.TelegramApiException: Bot token and username can't be empty
- Accessing Secret Variables in Classic Pipelines through Java app in Azure DevOps
- Postgres && statement Error in Mybatis Mapper?
Related Questions in UNICODE
- Question about unicode assignments in python
- Can't we make a better variable-length character encoding with just using the 1 bit extra in the 7 bit ASCII?
- UTF-8 string has too many bytes using SBCL and babel on Windows 64 bits
- how to implement ZWJ and NZWJ in fontlab
- charAt() on HTML entities
- NCURSESW - Unable to use addwstr function to print out unicode characters outside of standard ASCII
- pdftk unicode works in preview but not adobe acrobat
- How to store metadata for a UTF-8 text file cross-platform?
- Is there a 'bottom-to-top' equivalent of the unicode 'rtl override'?
- pdftk generated pdf does not render correct utf-8
- How do I add a bullet point before a line of text in ZPL on a Zebra ZD500R?
- Visual C++ - how can I turn a unicode character into char or string?
- Getting error 'Some bytes have been replaced with the Unicode substitution character while loading file ... with Unicode (UTF-8)"
- French special characters unicode required for first name
- How to use HTML5 input pattern attribute to validate Latin and extended Latin characters only
Related Questions in CHARACTER-ENCODING
- Can't we make a better variable-length character encoding with just using the 1 bit extra in the 7 bit ASCII?
- Cpanel filter encoding utf-8?
- bagaimana cara menginstall steghide lewat mac
- Encoding problem on MySQL: Why some non-ASCII characters get encoded on more than 4 bytes?
- Matching multi-language (latin extended) characters in lua
- Handle mixed charsets in the same json file
- MIPS Aiken to Binary
- I am not sure why I need to Encode path parameter TWICE to make the rest call with special chars to work?
- having character encoding problem on my blog content in php application
- Visual C++ - how can I turn a unicode character into char or string?
- Cypresss Unable to Load UTF-16 Website on Brower Launch
- How to set encoding?
- HL7 encoding characters in non-ASCII strings
- How to fix these two warnings about implicit string cast during charset conversion?
- Python PyODBC and SQL Server encoding issue
Related Questions in HTML-ESCAPE-CHARACTERS
- Why is my powershell function not returning a decoded string
- Is there any way to include # in the CSV input without breaking the format
- When a browser opens a URL with params it hangs and Java App (using UrlRewrite) throws RequestRejectedException: the URL contained ";"
- Telegram Bot: a set of characters break out of HTML escape
- Trying to escapeHtml Characters with Jsondeserializer in spring boot, But when ever it is persisting html escaping is happening more than once
- Escaping snippets of html code in a long string
- Display html characters in Svelte
- Html escape in byte array - xss issue
- Escape html in java using entity numbers instead of names
- HTML Special Character Codes translation in text XSLT 2.0
- Why does my JS URL string's & get converted to #038; on some server setups?
- How to partially escapeHTML in Ruby (dont escape HTML tags)
- How to escape an ampersand in an CSS attribute selector?
- How to caculate the number of all elements including escape sequences in a string?
- JSON Parse: Expecting "EOF" got undefined error
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I don't think so. 0x0096 in Unicode is a C1 control code:
http://en.wikipedia.org/wiki/C0_and_C1_control_codes
and is unlikely to be the replacement for "-" (as you wrote).
Well, if StringEscapeUtils really messes this up (en dash should indeed be \u2013) and if it's the only escape it is messing up and if there's no reason to have any other 0x0096 in your String, then a replaceAll after having calling StringEscapeUtils should work.
The following does the replace you expect:
However you should first make sure that StringEscapeUtils really messes things up and really, really, understand why/how you get that 0x0096 in a Java String.
Then, also, it should probably be pointed out to you that sadly Java's Unicode support is a major SNAFU because Java was conceived before Unicode 3.1 came out.
Hence it seemed a smart idea to use 16 bits for the char primitive, it seemed a smart idea to use a 4-hexdigits '\uxxxx' escape sequence, it seemed a smart idea to represent the length of the char[] in String's length() method, etc.
These were actually all very very stupid idea leading to one of the major Java SNAFU where the char primitive cannot actually hold a Unicode char anymore and where String's length method does actually not return a String's real length.
I like the following:
Why this rant? Well, because I don't know how the regexp replacement in String's replaceAll is implemented but I really wouldn't be suprised if there were cases (i.e. certain codepoints) where String's replaceAll was, like char and like length and like \uxxxx, well.. hmmm, totally broken.