How to get the name of a Unicode character?

572 views Asked by At

I think I saw this a long time ago; a way to get a string containing the name of a unicode character by using Win32 API calls. I'm using C++ Builder so if there is support for it in the VCL library that would work fine too.

For example:
GetUnicodeName(U+0021) would return a string (or fill in a struct or similar), such as "EXCLAMATION MARK".

Or if there are some other way to get the same result from Windows with C or C++.

The worst case scenario would be to have a HUGE lookup table with the names of interest (mainly Latin characters).

1

There are 1 answers

0
DJm00n On

You can use undocumented GetUName method from getuname.dll:

std::string GetUnicodeCharacterName(wchar_t character)
{
    // https://github.com/reactos/reactos/tree/master/dll/win32/getuname
    typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
    static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));

    if (!pfnGetUName)
        return {};

    std::array<WCHAR, 256> buffer;
    int length = pfnGetUName(character, buffer.data());

    return utf8::narrow(buffer.data(), length);
}

// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
    if (!std::iswgraph(character))
    {
        if (character <= 0x21)
            character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
        else
            character = 0xFFFD; // REPLACEMENT CHARACTER
    }

    return character;
}

// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
//  <U+10338 Supplementary Multilingual Plane>
//  <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
    // UTF-8 <=> UTF-32 converter
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;

    // UTF-8 to UTF-32
    std::u32string utf32string = utf32conv.from_bytes(string);

    std::string characterNames;
    characterNames.reserve(35 * utf32string.size());

    for (const char32_t& codePoint : utf32string)
    {
        if (!characterNames.empty())
            characterNames.append(", ");

        char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
        std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";

        // UTF-32 to UTF-8
        std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
        characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
    }

    return characterNames;
}

The downside is that it only contains characters from Unicode Basic Multilingual Plane (BMP).

Update: You can use u_charName() ICU API that comes with Windows since Fall Creators Update (Version 1709 Build 16299):

std::string GetUCharNameWrapper(char32_t codePoint)
{
    typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
    static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));

    if (!pfnU_charName)
        return {};

    int errorCode = 0;
    std::array<char, 512> buffer;
    int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);

    if (errorCode != 0)
        return {};

    return std::string(buffer.data(), length);
}