When should we prefer wide-character strings?

3.1k views Asked by At

I am modernizing a large, legacy MFC codebase which contains a veritable medley of string types:

  • CString
  • std::string
  • std::wstring
  • char*
  • wchar_t*
  • _bstr_t

I'd like to standardize on a single string type internally, and convert to other types only when absolutely required by a third-party API (i.e. COM or MFC functions). The question my coworkers and I are debating; which string type should we standardize on?

I would prefer one of the C++ standard strings: std::string or std::wstring. I'm personally leaning toward std::string, because we do not have any need for wide characters - it is an internal codebase with no customer-facing UI (i.e. no need for multiple-language support). "Plain" strings allow us to use simple, unadorned string literals ("Hello world" vs L"Hello world" or _T("Hello world")).

Is there an official stance from the programming community? When faced with multiple string types, what is typically used as the standard 'internal' storage format?

2

There are 2 answers

3
Simon Mourier On BEST ANSWER

If we talk about Windows, than I'd use std::wstring (because we often need cool string features), or wchar_t* if you just pass strings around.

Note Microsoft recommends that here: Working with Strings

Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters. The Visual C++ compiler supports the built-in data type wchar_t for wide characters

Also:

When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings. [...] Internally, the ANSI version translates the string to Unicode.

Also:

New applications should always call the Unicode versions. Many world languages require Unicode. If you use ANSI strings, it will be impossible to localize your application. The ANSI versions are also less efficient, because the operating system must convert the ANSI strings to Unicode at run time. [...] Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.

0
rustyx On

It depends.

When programming for Windows, I recommend to use std::wstring at least for:

  • Resources (Strings, Dialogs, etc.)
  • Filesystem access (Windows allows non-ASCII characters in file and directory names (that includes all the "wrong kinds of apostrophes" btw), these are impossible to open using ANSI API)
  • COM (a BSTR is always wide character)
  • Other user-facing interfaces (clipboard, system error reporting, etc)

However, it is easier to handle internal ASCII data files and UTF-8-encoded-data using single-character strings. It's fast, efficient and straightforward.

There may also be other aspects that are not mentioned in the question, such as databases or APIs used, input/output files, etc. and their charsets - all of those play a role when deciding on the best data structures for the job.

"UTF-8 everywhere" is a sound idea in general. But there is 0 Windows API that takes UTF-8. Even the std::experimental::filesystem API uses std::wstring on Windows and std::string on POSIX.