Substring of a std::string in utf-8? C++11

10.9k views Asked by At

I need to get a substring of the first N characters in a std::string assumed to be utf8. I learned the hard way that .substr does not work... as... expected.

Reference: My strings probably look like this: mission:\n\n1億2千万匹

4

There are 4 answers

5
Gunnar Klämke On

You could use the boost/locale library to convert the utf8 string into a wstring. And then use the normal .substr() approach:

#include <iostream>
#include <boost/locale.hpp>

std::string ucs4_to_utf8(std::u32string const& in)
{
    return boost::locale::conv::utf_to_utf<char>(in);
}

std::u32string utf8_to_ucs4(std::string const& in)
{
    return boost::locale::conv::utf_to_utf<char32_t>(in);
}

int main(){

  std::string utf8 = u8"1億2千万匹";

  std::u32string part = utf8_to_ucs4(utf8).substr(0,3);

  std::cout<<ucs4_to_utf8(part)<<std::endl;
  // prints : 1億2
  return 0;
}
8
Jonny On

I found this code and am just about to try it out.

std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng)
{
    if (leng==0) { return ""; }
    unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos;
    for (q=0, i=0, ix=str.length(); i < ix; i++, q++)
    {
        if (q==start){ min=i; }
        if (q<=start+leng || leng==std::string::npos){ max=i; }

        c = (unsigned char) str[i];
        if      (
                 //c>=0   &&
                 c<=127) i+=0;
        else if ((c & 0xE0) == 0xC0) i+=1;
        else if ((c & 0xF0) == 0xE0) i+=2;
        else if ((c & 0xF8) == 0xF0) i+=3;
        //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8
        //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8
        else return "";//invalid utf8
    }
    if (q<=start+leng || leng==std::string::npos){ max=i; }
    if (min==std::string::npos || max==std::string::npos) { return ""; }
    return str.substr(min,max);
}

Update: This worked well for my current issue. I had to mix it with a get-length-of-utf8encoded-stdsstring function.

This solution had some warnings spat at it by my compiler:

Some warnings spit out by my compiler.

0
Atul On

Based on this answer I've written my utf8 substring function:

void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
    int len = 0, byteIndex = 0;
    const char* aStr = originalString.c_str();
    size_t origSize = originalString.size();

    for (byteIndex=0; byteIndex < origSize; byteIndex++)
    {
        if((aStr[byteIndex] & 0xc0) != 0x80)
            len += 1;

        if(len >= SubStrLength)
            break;
    }

    csSubstring = originalString.substr(0, byteIndex);
}
0
Huberti On

You could use the std library to convert the utf8 string into a wstring. And then use the normal .substr() approach:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

std::string ucs4ToUtf8(const std::u32string& in)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes(in);
}

std::u32string utf8ToUcs4(const std::string& in)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.from_bytes(in);
}

int main(){

  std::string utf8 = u8"4ą5źćęł";

  std::u32string part = utf8ToUcs4(utf8).substr(0,3);

  std::cout<<ucs4ToUtf8(part)<<std::endl;
  // prints : 4ą5
  return 0;
}