Invert Unicode String Collation Keys

492 views Asked by At

I'm have an index which stores text strings for search, both in their original form and their collated form (Collated form is used for searching the index, Original is displayed in the results).

The collation is done via the ICU4C implementation, which works as defined in the Unicode Collation Algorithm. I use Sort Keys, and typically only store the Primary strength (no accents, lower/upper-case, code pages etc).

For debugging purposes, is there any way to invert a collation sort key to retrieve a human-readable string similar to the original? Obviously this is a lossy process, but converting the sort key of 'a' to display the ASCII 'a' character is good enough. Hopefully there is a standard way of doing this, without having to implement the translation from binary sort key to printable unicode characters myself. Optimally, the solution will be implemented in C/C++.

Thanks in advance.

1

There are 1 answers

2
AndreyS Scherbakov On

You don't need an universal reverse collation algorithm. You only want to find string collation keys you ever dealt with.

So, simply create a global map(dictionary) where you store every UTF string you putting into your main index. An UTF string will be the dictionary value while its collation key will be your dictionary key:

allStrings[collation_key] = utf_string

Acting in such a way, you create a global sorted index of all your known strings. Then, simply retrieve a string by its collation key whenever you need it for the debugging output.

The string for a given collation key is not guaranteed to be unique, but, as you've mentioned, a string more or less similar to the original one works well in your application.

If even you haven't access to all insertions to the index, you probably still can retrieve all the records first by string: object -> str, then by collation: object -> coll, and join them : for (objects) your_dictionary[collations[object]]=strings[object] .

If you still want to design your own algorithm, note that a collation string contains of sequence of primary elements for all chars followed by a sequence of secondary elems, followed by tertiary elems and so on. I think you may focus on learning primary elements.

  • To get a primary element for a char, convert a string of form "given_character long_constant_sequence" to its collation sequence. A prefix subsequence of the result that variates when you change you given_character will be the primary element(s) of given_character.
  • Now, write a procedure collecting primary elements for all characters. Then, store a map : primary elements -> character (and maybe primary_elements -> primary_elements_size; (but maybe the size is constant in the collation algorithm you used - check it))
  • Develop a tokenizer splitting your collation into elements one by one. Probably it should stop once it fails to find more known primary elements - you don't want to deal with secondary etc. elements in your simple solution.
  • Use tokenizer on your index collations, then lookup primary elements in the primary_elements -> character map previously built. Compose an approximate UTF string to display during the debugging session.