Stumped with Unicode, Boost, C++, codecvts

11.7k views Asked by At

In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.

But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.

My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.

The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.

#include <string>
#include <boost/locale.hpp>
#include <locale>

int main(void)
{
  std::string data("Testing, 㤹");

  std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
  std::locale toLoc   = boost::locale::generator().generate("en_US.UTF-32");

  typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
  cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);

  std::locale convLoc = std::locale(fromLoc, toCvt);

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

  // Output is unconverted -- what?

  return 0;
}

I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?

3

There are 3 answers

3
Jookia On BEST ANSWER

Okay, after a long few months I've figured it out, and I'd like to help people in the future.

First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).

#include <boost/locale.hpp>
namespace loc = boost::locale;

int main(void)
{
  loc::generator gen;
  std::locale blah = gen.generate("en_US.utf-32");

  std::string UTF8String = "Tésting!";
  // from_utf will also work with wide strings as it uses the character size
  // to detect the encoding.
  std::string converted = loc::conv::from_utf(UTF8String, blah);

  // Outputs a UTF-32 string.
  std::cout << converted << std::endl;

  return 0;
}

As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.

I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.

As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.

8
Yakov Galka On
  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

This does no conversion, since it uses codecvt<char, char, mbstate_t> which is a no-op. The only standard streams that use codecvt are file-streams. std::cout is not required to perform any conversion at all.

To force Boost.Filesystem to interpret narrow-strings as UTF-8 on windows, use boost::filesystem::imbue with a locale with a UTF-8 ↔ UTF-16 codecvt facet. Boost.Locale has an implementation of the latter.

11
Cheers and hth. - Alf On

The Boost filesystem iostream replacement classes work fine with UTF-16 when used with Visual C++.

However, they do not work (in the sense of supporting arbitrary filenames) when used with g++ in Windows - at least as of Boost version 1.47. There is a code comment explaining that; essentially, the Visual C++ standard library provides non-standard wchar_t based constructors that Boost filesystem classes make use of, but g++ does not support these extensions.

A workaround is to use 8.3 short filenames, but this solution is a bit brittle since with old Windows versions the user can turn off automatic generation of short filenames.


Example code for using Boost filesystem in Windows:

#include "CmdLineArgs.h"        // CmdLineArgs
#include "throwx.h"             // throwX, hopefully
#include "string_conversions.h" // ansiOrFillerFrom( wstring )

#include <boost/filesystem/fstream.hpp>     // boost::filesystem::ifstream
#include <iostream>             // std::cout, std::cerr, std::endl
#include <stdexcept>            // std::runtime_error, std::exception
#include <string>               // std::string
#include <stdlib.h>             // EXIT_SUCCESS, EXIT_FAILURE
using namespace std;
namespace bfs = boost::filesystem;

inline string ansi( wstring const& ws ) { return ansiWithFillersFrom( ws ); }

int main()
{
    try
    {
        CmdLineArgs const   args;
        wstring const       programPath     = args.at( 0 );

        hopefully( args.nArgs() == 2 )
            || throwX( "Usage: " + ansi( programPath ) + " FILENAME" );

        wstring const       filePath        = args.at( 1 );
        bfs::ifstream       stream( filePath );     // Nice Boost ifstream subclass.
        hopefully( !stream.fail() )
            || throwX( "Failed to open file '" + ansi( filePath ) + "'" );

        string line;
        while( getline( stream, line ) )
        {
            cout << line << endl;
        }
        hopefully( stream.eof() )
            || throwX( "Failed to list contents of file '" + ansi( filePath ) + "'" );

        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}