c++: how to remove surrogate unicode values from string?

289 views Asked by At

how do you remove surrogate values from a std::string in c++? looking for regular expression like this:

string pattern = u8"[\uD800-\uDFFF]";
regex regx(pattern);
name = regex_replace(name, regx, "_");

how do you do it in a c++ multiplatform project (e.g. cmake).

1

There are 1 answers

12
Remy Lebeau On BEST ANSWER

First off, you can't store UTF-16 surrogates in a std::string (char-based), you would need std::u16string (char16_t-based), or std::wstring (wchar_t-based) on Windows only. Javascript strings are UTF-16 strings.

For those string types, you can use either:

  • std::remove_if() + std::basic_string::erase():

    #include <string>
    #include <algorithm>
    
    std::u16string str; // or std::wstring on Windows
    ...
    str.erase(
        std::remove_if(str.begin(), str.end(),
            [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
        ),
        str.end()
    );
    
  • std::erase_if() (C++20 and later only):

    #include <string>
    
    std::u16string str; // or std::wstring on Windows
    ...
    std::erase_if(str,
        [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
    );
    

UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if() for that task, eg:

#include <string>
#include <algorithm>

std::u16string str; // or std::wstring on Windows
...
std::replace_if(str.begin(), str.end(),
    [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
    u'_'
);

Or, if you really want a regex-based approach, you can use std::regex_replace(), eg:

#include <string>
#include <regex>

std::wstring str; // std::basic_regex does not support char16_t strings!
...
std::wstring newstr = std::regex_replace(
    str,
    std::wregex(L"[\\uD800-\\uDFFF]"),
    L"_"
);