How can I convert characters in Java from Extended ASCII or Unicode to their 7-bit ASCII equivalent, including special characters like open (“
0x93) and close (”
0x94) quotes to a simple double quote ("
0x22) for example. Or similarly dash (–
0x96) to hyphen-minus (-
0x2D). I have found Stack Overflow questions similar to this, but the answers only seem to deal with accents and ignore special characters.
For example I would like “Caffè – Peña”
to transformed to "Caffe - Pena"
.
However when I use java.text.Normalizer:
String sample = "“Caffè – Peña”";
System.out.println(Normalizer.normalize(sample, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}", ""));
Output is
“Caffe – Pena”
To clarify my need, I am interacting with an IBM i Db2 database that uses EBCDIC encoding. If a user pastes a string copied from Word or Outlook for example, characters like the ones I specified are translated to SUB (0x3F in EBCDIC, 0x1A in ASCII). This causes a lot of unnecessary headache. I am looking for a way to sanitize the string so as little information as possible is lost.
After some digging I was able to find solution based on this answer using org.apache.lucene.analysis.ASCIIFoldingFilter
All the examples I was able to find were using the static version of the method foldToASCII as in this project:
However that static method has a note on it saying
So after some trial and error I came up with this version that avoids using the static method:
Similar to an answer I provided here.
This does exactly what I was looking for and translates characters to their ASCII 7-bit equivalent version.
However, through further research I have found that because I am mostly dealing with Windows-1252 encoding and because of the way jt400 handles ASCII <-> EBCDIC (CCSID 37) translation, if an ASCII string is translated to EBCDIC and back to ACSII, the only characters that are lost are
0x80
through0x9f
. So inspired by the way lucene's foldToASCII handles it, I put together following method that handles these cases only:Since it turns out that my real problem was Windows-1252 to Latin-1 (ISO-8859-1) translation, here is a supporting material that shows the Windows-1252 to Unicode translation used in the method above to ultimately get Latin-1 encoding.