I'm working on a program that deals with Korean sentences and I need a way to break down a syllable, or block, into its letters. For those who don't know Hangul, a syllable is composed of 2-4 letters (jamo), creating thousands of different combinations. What I'd like to do is break down those syllables into the letters that form it.
I was able to get the first letter by comparing its Unicode value to the associated letter in that range, i.e. a syllable that starts with x letter is in y range. However, I'm at a loss for finding the rest of the letters.
This is a table containing the Unicode values for Hangul syllables: http://jrgraphix.net/r/Unicode/AC00-D7AF
Hangul syllable decomposition (e.g.
퓛
→ᄑ
+ᅱ
+ᆶ
) is done in Java through thejava.text.Normalizer
class:The algorithm for Hangul decomposition is also given in Section 3.12 of the Unicode Standard (from page 142); and since normalisation also affects other, non-Hangul characters, you should familiarise yourself with the general principles and forms of Unicode normalisation in UAX #15.