UTF-16 codecvt facet

2.1k views Asked by At

Extending from this questions about locales
And described in this question: What I really wanted to do was install a codecvt facet into the locale that understands UTF-16 files.

I could write my own. But I am not a UTF expert and as such I am sure I would get it nearly correct; but it would break at the most inconvenient time. So I was wondering if there are any resources (on the web) of pre-build codecvt (or other) facets that can be used from C++ that are peer reviewed and tested?

The reason is the default locale (on my system MAC OS X 10.6) when reading a file just converts 1 byte to 1 wchar_t with no conversion. Thus UTF-16 encoded files are converted into wstrings that contain lots of null ('\0') characters.

2

There are 2 answers

2
seh On BEST ANSWER

I'm not sure if by "resources on the Web" you meant available free of cost, but there is the Dinkumware Conversions Library that sounds like it will fit your needs—provided that the library can be integrated into your compiler suite.

The codecvt types are described in the section Code Conversions.

0
Justin Time - Reinstate Monica On

As of C++11, there are additional standard codecvt specialisations and types, intended for converting between various UTF-x and UCSx character sequences; one of these may suit your needs.

In <locale>:

  • std::codecvt<char16_t, char, std::mbstate_t>: Converts between UTF-16 and UTF-8.
  • std::codecvt<char32_t, char, std::mbstate_t>: Converts between UTF-32 and UTF-8.

In <codecvt>:

  • std::codecvt_utf8_utf16<typename Elem>: Converts between UTF-8 and UTF-16, where UTF-16 code points are stored as the specified Elem (note that if char32_t is specified, only one code point will be stored per char32_t).
    • Has two additional, defaulted template paramters (unsigned long MaxCode = 0x10ffff, and std::codecvt_mode Mode = (std::codecvt_mode)0), and inherits from std::codecvt<Elem, char, std::mbstate_t>.
  • std::codecvt_utf8<typename Elem>: Converts between UTF-8 and either UCS2 or UCS4, depending on Elem (UCS2 for char16_t, UCS4 for char32_t, platform-dependent for wchar_t).
    • Has two additional, defaulted template paramters (unsigned long MaxCode = 0x10ffff, and std::codecvt_mode Mode = (std::codecvt_mode)0), and inherits from std::codecvt<Elem, char, std::mbstate_t>.
  • std::codecvt_utf16<typename Elem>: Converts between UTF-16 and either UCS2 or UCS4, depending on Elem (UCS2 for char16_t, UCS4 for char32_t, platform-dependent for wchar_t).
    • Has two additional, defaulted template paramters (unsigned long MaxCode = 0x10ffff, and std::codecvt_mode Mode = (std::codecvt_mode)0), and inherits from std::codecvt<Elem, char, std::mbstate_t>.

codecvt_utf8 and codecvt_utf16 will convert between the specified UTF and either UCS2 or UCS4, depending on the size of Elem. Therefore, wchar_t will specify UCS2 on systems where it's 16- to 31-bit (such as Windows, where it's 16-bit), or UCS4 on systems where it's at least 32-bit (such as Linux, where it's 32-bit), regardless of whether wchar_t strings actually use that encoding; on platforms that use different encodings for wchar_t strings, this will understandably cause problems if you aren't careful.

For more information, see CPP Reference:

Note that support for header codecvt was only added to libstdc++ relatively recently. If using an older version of Clang or GCC, you may have to use libc++, if you want to use it.
Note that versions of Visual Studio prior to 2015 don't actually support char16_t and char32_t; if these types exist on previous versions, it will be as typedefs for unsigned short and unsigned int, respectively. Also note that older versions of Visual Studio can have trouble converting strings between UTF encodings sometimes, and that Visual Studio 2015 has a glitch that prevents codecvt from working properly with char16_t and char32_t, requiring the use of same-sized integral types instead