How do I match "i" with Turkish i in java?

1.6k views Asked by At

I want to match the lower case of "I" of English (i) to lower case of "İ" of Turkish (i). They are the same glyph but they don't match. When I do System.out.println("İ".toLowerCase()); the character i and a dot is printed(this site does not display it properly)

Is there a way to match those?(Preferably without hard-coding it) I want to make the program match the same glyphs irrelevant of the language and the utf code. Is this possible?

I've tested normalization with no success.

public static void main(String... a) {
    String iTurkish = "\u0130";//"İ";
    String iEnglish = "I";
    prin(iTurkish);
    prin(iEnglish);
}

private static void prin(String s) {
    System.out.print(s);
    System.out.print(" -  Normalized : " + Normalizer.normalize(s, Normalizer.Form.NFD));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(s.toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

The result is not properly shown in the site but the first line(iTurkish) still has the ̇ near lowercase i.

Purpose and Problem

This will be a multi lingual dictionary. I want the program to be able to recognize that "İFEL" starts with "if". To make sure they are not case sensitive I first convert both text to lower case. İFEL becomes i(dot)fel and "if" is not recognized as a part of it

2

There are 2 answers

4
dimo414 On BEST ANSWER

If you print out the hex values of the characters you're seeing, the difference is clear:

İ 0x130 - Normalized : İ 0x49 0x307 - Lower case: i̇ 0x69 0x307 - Lower case Normalized : i̇ 0x69 0x307
I 0x49 - Normalized : I 0x49 - Lower case: i 0x69 - Lower case Normalized : i 0x69

Normalizing the Turkish İ doesn't give you an English I, instead it gives you an English I followed by a diacritic, 0x307. This is correct, and to be expected by the normalization process. Normalization is not a "Convert to ASCII" operation. As the documentation for Normalizer mentions, the process it's following is a very rigorously defined standard, the Unicode Standard Annex #15 — Unicode Normalization Forms.

There are numerous ways to strip diacritics, either before or after normalizing. What you need will depend on the specifics of your use case, but for your use case I would suggest using Guava's CharMatcher class to strip non-ASCII characters after normalizing, e.g.:

String asciiString = CharMatcher.ascii().retainFrom(normalizedString);

This answer goes into more depth about what \p{InCombiningDiacriticalMarks} does, and why it's not ideal. My CharMatcher solution isn't ideal either (the linked answer offers more robust solutions), but for a quick fix you may find retaining only ASCII characters "good enough". This is both closer to "correct" and faster than the Pattern based approach.

3
Rafiq On

You can use the code bellow:

public static void main(String... a) {

      String iTurkish = "\u0130";//"İ";
      String iEnglish = "I";
      prin(iTurkish);
      prin(iEnglish);


}

private static void prin(String s) {
    System.out.print(s);
    String nfdNormalizedString = Normalizer.normalize(s, Normalizer.Form.NFD);
    Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
    System.out.print(" -  Normalized : " + pattern.matcher(nfdNormalizedString).replaceAll(""));
    System.out.print(" - lower case: " + s.toLowerCase());
    System.out.print(" -  Lower case Normalized : " + Normalizer.normalize(pattern.matcher(nfdNormalizedString).replaceAll("").toLowerCase(), Normalizer.Form.NFD));
    System.out.println();

}

Or see Converting Symbols, Accent Letters to English Alphabet