What is the theory behind unicode collation sorting

Question

What is the theory behind unicode collation sorting

2.2k views Asked by user3404884 At 04 December 2014 at 19:00

What is the theory behind unicode sorting? I understand how it works, but I don't understand why they decided on this standard for collation sorting.

It seems that when you have two strings to compare, using ucol_strcolliter() for example:

ucol_strcollIter(collator, &stringIter1, &stringIter2, &Status)

Then, say I the two strings are:

string string1 = "hello"
string string2 = "héllo"

Under the "Secondary" collation strength, string1 should be ordered before string2. Where string1 and string2 are compared on their secondary strength.

<1 hello
<2 héllo

BUT

If you have trailing spaces, like:

string string1 = "hello  "
string string2 = "héllo "

then the accented hello (string2) will be placed before string1. And, both are compared on their primary weight.

<1 héllo  
<1 hello

Why does the unicode collation algorithm take into account the trailing spaces?

Is there some reason behind this?

Original Q&A

There are 3 answers

Random832 On 04 December 2014 at 19:29

Because the space character has a primary collation weight of 0x0209. (reference Default Unicode Collation Element Table, search # SPACE)

Spaces, trailing or not, are part of the string.

Trent Wood On 01 October 2019 at 22:15

This is an old question but I'll answer for others in the future.

The original 'they' is the International Organization for Standardization, who published ISO-14651, a standard for collation of text in any encoding scheme but with a goal of supporting Unicode. This standard was largely implementation independent.

Then the Unicode Consortium published the Unicode Collation Algorithm, which is compatible with ISO-14651 but goes much farther in terms of implementation details.

Collation depends on language sorting rules and collation classes usually take locale as a parameter. The default sort order is defined in DUCET, as mentioned previously. If you use the ICU4J library it will be synchronized with DUCET.

The comparison algorithm is based on a minimum of 3 levels for compliance with ISO-14651. The levels are defined as follows.

Base characters (e.g. a, b, c, d)
Accents
Case / Variants
Punctuation
Identical

Most characters are normalized before comparison. So an accented 'á' will be normalized to an 'a' for level-1 comparison. Level-2 is used as a tie-breaker.

The default rules are there for a reason but can be customized for individual use cases. Note that languages sort differently and sort order does not typically match the order in which characters appear in Unicode. Language sort order does not equal binary sort order.

Refer to the Unicode Collation Algorithm for a very detailed explanation.

**Scott Russell** · Accepted Answer · 2014-12-05T14:52:19+00:00

Scott Russell On 05 December 2014 at 14:52 BEST ANSWER

Probably the best TP would be this.

You can try various option combinations with the ICU Collation Demo. (give "alternate=shifted" a try)

TechQA.

What is the theory behind unicode collation sorting

There are 3 answers

Related Questions in ICU

Related Questions in UCA

Popular Questions

Popular Tags

Trending Questions