How can I convert characters in Java from Extended ASCII or Unicode to their 7-bit ASCII equivalent, including special characters like open (“ 0x93) and close (” 0x94) quotes to a simple double quote (" 0x22) for example. Or similarly dash (– 0x96) to hyphen-minus (- 0x2D). I have found Stack Overflow questions similar to this, but the answers only seem to deal with accents and ignore special characters.
For example I would like “Caffè – Peña” to transformed to "Caffe - Pena".
However when I use java.text.Normalizer:
String sample = "“Caffè – Peña”";
System.out.println(Normalizer.normalize(sample, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}", ""));
Output is
“Caffe – Pena”
To clarify my need, I am interacting with an IBM i Db2 database that uses EBCDIC encoding. If a user pastes a string copied from Word or Outlook for example, characters like the ones I specified are translated to SUB (0x3F in EBCDIC, 0x1A in ASCII). This causes a lot of unnecessary headache. I am looking for a way to sanitize the string so as little information as possible is lost.
After some digging I was able to find solution based on this answer using org.apache.lucene.analysis.ASCIIFoldingFilter
All the examples I was able to find were using the static version of the method foldToASCII as in this project:
However that static method has a note on it saying
So after some trial and error I came up with this version that avoids using the static method:
Similar to an answer I provided here.
This does exactly what I was looking for and translates characters to their ASCII 7-bit equivalent version.
However, through further research I have found that because I am mostly dealing with Windows-1252 encoding and because of the way jt400 handles ASCII <-> EBCDIC (CCSID 37) translation, if an ASCII string is translated to EBCDIC and back to ACSII, the only characters that are lost are
0x80through0x9f. So inspired by the way lucene's foldToASCII handles it, I put together following method that handles these cases only:Since it turns out that my real problem was Windows-1252 to Latin-1 (ISO-8859-1) translation, here is a supporting material that shows the Windows-1252 to Unicode translation used in the method above to ultimately get Latin-1 encoding.