I am using BreakIterator
to count the number of visible character in a String. This works perfectly for English language. But in case of Hindi language it doesn't work as expected.
The below String has a length of 3, but is considered as single character visually.
ज्य
When I used BreakIterator
, I expect it to consider it as a single unit, but it considers it as 2 units. The below is my code:
final String text = "ज्य";
final Locale locale = new Locale("hi","IN");
final BreakIterator breaker = BreakIterator.getCharacterInstance(locale);
breaker.setText(text);
int start = breaker.first();
for (int end = breaker.next();
end != BreakIterator.DONE;
start = end, end = breaker.next()) {
final String substring = text.substring(start, end);
}
Ideally, the for
loop should be executed ONCE with start=0 and end=3; But for the String above it's executed twice (start=0, end=2 and start=2, end=3).
How can I get BreakIterator
to work exactly?
UPDATE:
The above piece of code works perfectly when run as a JAVA program. It misbehaves only when used in ANDROID.
Since this happens only in Android, I have reported a bug in android: https://code.google.com/p/android/issues/detail?id=230832
I think you need to play with unicode characters
Oracle Doc. for Character Boundaries