I currently face some problems with different unicode representations of special characters, especially with the ones with accents or diaereses and so on. I wrote a python script, which parses multiple database dumps and compares values between them. The problem is, that in different file, these special characters are stored differently. In some files, these characters are composed, in others decomposed. As I want to have the string extracted from the dump always in the composed representation, I tried adding the following line:
value = unicodedata.normalize("NFC", value)
However, this solves my problem the only in some cases. For example, for umlauts it works as expected. Nevertheless, characters like ë will remain in the decomposed schema (e͏̈).
I figured out, that there is COMBINING GRAPHEME JOINER-character(U+034F) between the e and diaeresis character. Is that normal, or could this be cause the of my problem?
Does anybody know, how to handle this issue?
The purpose of
U+034F
COMBINING GRAPHEME JOINER is to ensure that certain sequences remain distinct under searching/sorting/normalisation. This is required for the correct handling of characters and combining marks as they are used in some languages with Unicode algorithms. From section 23.2 of the Unicode Standard (page 805):In general, you should not remove a CGJ without some special knowledge about why it was inserted in the first place.