How to convert only the next one character from a UTF-8 byte array efficiently?

1.6k views Asked by At

I have this code which works:

QString qs = QString::fromUtf8(bp,ut).at(0);
QChar c(qs[0]);

Where bp is a QByteArray::const_pointer, and ut is the maximum expected length of the UTF-8 encoded Unicode code-point. I then grab the first QChar c from the QString qs. It seems that there should be a more efficient way to simply get only the next QChar from the UTF-8 byte array without having to convert an arbitrary amount of the QByteArray into a QString and then getting only the first QChar.

EDIT From the comments below, it is clear that no one yet understands my question. So I will start with some basics. UTF-8 and UTF-16 are two different encodings of the world standard Unicode. The most common and encouraged Unicode encoding for transfer over the Internet and Unicode text files is UTF-8 which results in every Unicode code-point using 1 to 4 bytes in UTF-8 encoding. UTF-16 on the other hand is more convenient for handling characters inside a program. Therefore the vast majority of software out there is making the conversion between these two encodings all the time. A QChar is the more convenient UTF-16 encoding of all the Unicode code-points from 0x00 to 0xffff, which covers the majority of the languages and symbols so far defined and in common use. Surrogate pairs are used for the higher Unicode code-point values. At present surrogate pairs seem to have limited support, and are not of interest to me as for the present question.

When you read a text file into a QPlainTextEdit the conversion is done automatically and behind the scenes. Reading a QString from a QByteArray can also be done automatically (provided your locale and codec settings are set for UTF-8), or they can be done explicitly using toUtf8() or fromUtf8() as in my code above.

The conversion in the other direction can efficiently be done implicitly (behind the scenes) or explicitly with the following code:

    ba += *si; // Depends on the UTF-8 codec

or

    ba += QString(*si).toUtf8(); // UTF-8 explicitly

where ba is a QByteArray and si is QString::const_iterator. These do exactly the same thing (assuming the codec is set to UTF-8). They both convert the next (one) character from the QChar pointed to within a QString resulting in appending one or more bytes in ba.

All I am trying to do is the inverse conversion for only one character at a time, efficiently. Internally this is being done for every character being converted, and I'm sure it is being done very efficiently.

The problem with QString::fromUtf8(p,n) is that n is the number of bytes to process rather than the number of characters to convert. Therefore you must allow for the largest number of bytes which could be 3 (or 4 if it actually handled surrogate pairs). So if all you want is the next character, you must be prepared to process several bytes, and they do get converted and then are discarded if the result is a QString with more than one character.

Q: Is there a conversion function that does this one character at a time?

1

There are 1 answers

5
SirDarius On

You want to use QTextDecoder.

It is, according to the documentation:

The QTextDecoder class provides a state-based decoder. A text decoder converts text from an encoded text format into Unicode using a specific codec. The decoder converts text in this format into Unicode, remembering any state that is required between calls.

The important thing here is state. QString and QTextCodec are stateless, so they work on entire strings, start to end.

QTextDecoder, on the other hand, allows you to work on byte buffers one byte at a time, maintaining a state between calls so the caller knows if an UTF-8 sequence has been only partially decoded.

For example:

QTextDecoder decoder(QTextCodec::codecForName("UTF-8"));
QString result;
for (int i = 0; i < bytearray.size(); i++) {
     result = decoder.toUnicode(bytearray.constData() + i, 1);
     if (!result.isEmpty()) {
          break; // we got our character !
     }
}

The rationale behind this loop is that as long as the decoder is not able to decode a complete UTF-8 character, it will return an empty string.

As soon as it is able to, the result string will contain the one decoded unicode character.

This loop is as efficient as possible, and by memorizing the loop index, next characters can be obtained in the same way.