Last character of a multibyte string

93 views Asked by At

One of the things I often need to do when handling a multibyte string is deleting its last character. How do I locate this last character so I can chop it off using normal byte operations, preferably with as few reads as possible?

Note that this question is intended to work for most, if not all, multibyte encodings. The answer for self-synchonizing encodings like UTF-8 is trivial, as you can just go right-to-left in the bytestring for a start marker.

1

There are 1 answers

0
Mingye Wang On

The answer will be written in C, with POSIX multibyte functions. The said functions are also found on Windows. Assume that the bytestring ends at len and is well-formed up to the point; assume appropriate setlocale calls. Porting to mbrlen is left as an exercise for the reader.

The naive solution

The obviously correct solution involves parsing the encoding "as intended", going from left-to-right.

ssize_t index_of_last_char_left(const char *c, size_t len) {
  size_t pos = 0;
  size_t next = 1;
  mblen(NULL, 0);

  while (pos < len - 1) {
    next = mblen(c + pos, len - pos);
    if (next == -1)  // Invalid input
      return pos;
    pos += next;
  }

  return pos - next;
}

Deleting multiple characters like this will cause an "accidentally quadratic" situation; memorizing of intermediate positions will help, but additional management is required.

The right-to-left solution

As I mentioned in the question, for self-synchonizing encodings the only thing to do is to look for a start marker. But what breaks with the ones that don't self-synchonize?

  • The one-or-two-byte EUC encodings have both bytes of the two-byte sequence higher than 0x7f, and there's almost no differentiating between start and continuation bytes. For that we can check for mblen(pos) == bytes_left since we know the string is well-formed.
  • The Big5, GBK, and GB10830 encodings also allow a continuation byte in the ASCII range, so a lookbehind is mandatory.

With that cleared out (and assuming the bytestring up to len is well-formed), we can have:

// As much as CJK encodings do. I don't have time to see if it works for UTF-1.
#define MAX_MB_LEN 4

ssize_t index_of_last_char_right(const char *c, size_t len) {
  ssize_t pos = len - 1;
  bool last = true;
  bool last_is_okay = false;
  assert(!mblen(NULL, 0));   // No, we really cannot handle shift states.

  for (; pos >= 0 && pos >= len - 2 - MAX_MB_LEN; pos--) {
    int next = mblen(c + pos, len - pos);
    bool okay = (next > 0) && (next == len - pos - 1);
    if (last) {
      last_is_okay = okay;
      last = false;
    } else if (okay)
      return pos;
  }

  return last_is_okay ? len - 1 : -1;
}

(You should be able to find the last good char of a malformed string by (next > 0) && (next <= len - pos - 1). But don't return that when the last byte is okay!)

What's the point of this?

The code sample above is for the idealist who does not want to write just a "UTF-8 support" but a "locale support" based on the C library. There might not have a point for this at all in 2021 :)