One of the things I often need to do when handling a multibyte string is deleting its last character. How do I locate this last character so I can chop it off using normal byte operations, preferably with as few reads as possible?
Note that this question is intended to work for most, if not all, multibyte encodings. The answer for self-synchonizing encodings like UTF-8 is trivial, as you can just go right-to-left in the bytestring for a start marker.
The answer will be written in C, with POSIX multibyte functions. The said functions are also found on Windows. Assume that the bytestring ends at
lenand is well-formed up to the point; assume appropriatesetlocalecalls. Porting tombrlenis left as an exercise for the reader.The naive solution
The obviously correct solution involves parsing the encoding "as intended", going from left-to-right.
Deleting multiple characters like this will cause an "accidentally quadratic" situation; memorizing of intermediate positions will help, but additional management is required.
The right-to-left solution
As I mentioned in the question, for self-synchonizing encodings the only thing to do is to look for a start marker. But what breaks with the ones that don't self-synchonize?
0x7f, and there's almost no differentiating between start and continuation bytes. For that we can check formblen(pos) == bytes_leftsince we know the string is well-formed.With that cleared out (and assuming the bytestring up to
lenis well-formed), we can have:(You should be able to find the last good char of a malformed string by
(next > 0) && (next <= len - pos - 1). But don't return that when the last byte is okay!)What's the point of this?
The code sample above is for the idealist who does not want to write just a "UTF-8 support" but a "locale support" based on the C library. There might not have a point for this at all in 2021 :)