I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase());
the character i and a dot is printed(this site does not display it properly)
Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?
I've tested normalization with no success.
public static void main(String... a) {
String iTurkish = "\u0130";//"İ";
String iEnglish = "I";
prin(iTurkish);
prin(iEnglish);
}
private static void prin(String s) {
System.out.print(s);
System.out.print(" - Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
System.out.print(" - lower case: " + s.toLowerCase());
System.out.print(" - Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
System.out.println();
}
The result is not properly shown in the site but the first line(iTurkish) still has the ̇
near lowercase i.
Purpose and Problem
This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it
If you print out the hex values of the characters you're seeing, the difference is clear:
Normalizing the Turkish
İ
doesn't give you an EnglishI
, instead it gives you an EnglishI
followed by a diacritic,0x307
. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation forNormalizer
mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's
CharMatcher
class to strip non-ASCII characters after normalizing, e.g.:This answer goes into more depth about what
\p{InCombiningDiacriticalMarks}
does, and why it's not ideal. MyCharMatcher
solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than thePattern
based approach.