What is the theory behind unicode collation sorting

2.3k views Asked by At

What is the theory behind unicode sorting? I understand how it works, but I don't understand why they decided on this standard for collation sorting.

It seems that when you have two strings to compare, using ucol_strcolliter() for example:

ucol_strcollIter(collator, &stringIter1, &stringIter2, &Status)

Then, say I the two strings are:

string string1 = "hello"
string string2 = "héllo"

Under the "Secondary" collation strength, string1 should be ordered before string2. Where string1 and string2 are compared on their secondary strength.

<1 hello
<2 héllo

BUT

If you have trailing spaces, like:

string string1 = "hello  "
string string2 = "héllo "

then the accented hello (string2) will be placed before string1. And, both are compared on their primary weight.

<1 héllo  
<1 hello 

Why does the unicode collation algorithm take into account the trailing spaces?

Is there some reason behind this?

3

There are 3 answers

0
Scott Russell On BEST ANSWER

Probably the best TP would be this.

You can try various option combinations with the ICU Collation Demo. (give "alternate=shifted" a try)

4
Random832 On

Because the space character has a primary collation weight of 0x0209. (reference Default Unicode Collation Element Table, search # SPACE)

Spaces, trailing or not, are part of the string.

0
Trent Wood On

This is an old question but I'll answer for others in the future.

The original 'they' is the International Organization for Standardization, who published ISO-14651, a standard for collation of text in any encoding scheme but with a goal of supporting Unicode. This standard was largely implementation independent.

Then the Unicode Consortium published the Unicode Collation Algorithm, which is compatible with ISO-14651 but goes much farther in terms of implementation details.

Collation depends on language sorting rules and collation classes usually take locale as a parameter. The default sort order is defined in DUCET, as mentioned previously. If you use the ICU4J library it will be synchronized with DUCET.

The comparison algorithm is based on a minimum of 3 levels for compliance with ISO-14651. The levels are defined as follows.

  1. Base characters (e.g. a, b, c, d)
  2. Accents
  3. Case / Variants
  4. Punctuation
  5. Identical

Most characters are normalized before comparison. So an accented 'á' will be normalized to an 'a' for level-1 comparison. Level-2 is used as a tie-breaker.

The default rules are there for a reason but can be customized for individual use cases. Note that languages sort differently and sort order does not typically match the order in which characters appear in Unicode. Language sort order does not equal binary sort order.

Refer to the Unicode Collation Algorithm for a very detailed explanation.