I'm have an index which stores text strings for search, both in their original form and their collated form (Collated form is used for searching the index, Original is displayed in the results).
The collation is done via the ICU4C implementation, which works as defined in the Unicode Collation Algorithm. I use Sort Keys, and typically only store the Primary strength (no accents, lower/upper-case, code pages etc).
For debugging purposes, is there any way to invert a collation sort key to retrieve a human-readable string similar to the original? Obviously this is a lossy process, but converting the sort key of 'a' to display the ASCII 'a' character is good enough. Hopefully there is a standard way of doing this, without having to implement the translation from binary sort key to printable unicode characters myself. Optimally, the solution will be implemented in C/C++.
Thanks in advance.
You don't need an universal reverse collation algorithm. You only want to find string collation keys you ever dealt with.
So, simply create a global map(dictionary) where you store every UTF string you putting into your main index. An UTF string will be the dictionary value while its collation key will be your dictionary key:
Acting in such a way, you create a global sorted index of all your known strings. Then, simply retrieve a string by its collation key whenever you need it for the debugging output.
The string for a given collation key is not guaranteed to be unique, but, as you've mentioned, a string more or less similar to the original one works well in your application.
If even you haven't access to all insertions to the index, you probably still can retrieve all the records first by string: object -> str, then by collation: object -> coll, and join them : for (objects) your_dictionary[collations[object]]=strings[object] .
If you still want to design your own algorithm, note that a collation string contains of sequence of primary elements for all chars followed by a sequence of secondary elems, followed by tertiary elems and so on. I think you may focus on learning primary elements.