Normalize unicode does not work as expected

232 views Asked by At

I currently face some problems with different unicode representations of special characters, especially with the ones with accents or diaereses and so on. I wrote a python script, which parses multiple database dumps and compares values between them. The problem is, that in different file, these special characters are stored differently. In some files, these characters are composed, in others decomposed. As I want to have the string extracted from the dump always in the composed representation, I tried adding the following line:

value = unicodedata.normalize("NFC", value)

However, this solves my problem the only in some cases. For example, for umlauts it works as expected. Nevertheless, characters like ë will remain in the decomposed schema (e͏̈).

I figured out, that there is COMBINING GRAPHEME JOINER-character(U+034F) between the e and diaeresis character. Is that normal, or could this be cause the of my problem?

Does anybody know, how to handle this issue?

1

There are 1 answers

0
一二三 On BEST ANSWER

The purpose of U+034F COMBINING GRAPHEME JOINER is to ensure that certain sequences remain distinct under searching/sorting/normalisation. This is required for the correct handling of characters and combining marks as they are used in some languages with Unicode algorithms. From section 23.2 of the Unicode Standard (page 805):

U+034F combining grapheme joiner (CGJ) is used to affect the collation of adjacent characters for purposes of language-sensitive collation and searching. It is also used to distinguish sequences that would otherwise be canonically equivalent.

...

In turn, this means that insertion of a combining grapheme joiner between two combining marks will prevent normalization from switching the positions of those two combining marks, regardless of their own combining classes.

In general, you should not remove a CGJ without some special knowledge about why it was inserted in the first place.