Accept non ASCII characters

1.9k views Asked by At

Consider this program:

#include <stdio.h>

int main(int argc, char* argv[]) {
   printf("%s\n", argv[1]);  
   return 0;
}

I compile it like this:

x86_64-w64-mingw32-gcc -o alpha alpha.c

The problem is if I give it a non ASCII argument:

$ ./alpha róisín
r�is�n

How can I write and/or compile this program such that it accepts non ASCII characters? To respond to alk: no, the program is printing wrongly. See this example:

$ echo Ω | od -t x1c
0000000  ce  a9  0a
        316 251  \n
0000003

$ ./alpha Ω | od -t x1c
0000000  4f  0d  0a
          O  \r  \n
0000003
3

There are 3 answers

5
Zombo On BEST ANSWER

The easiest way to do this is with wmain:

#include <fcntl.h>
#include <stdio.h>

int wmain (int argc, wchar_t** argv) {
  _setmode(_fileno(stdout), _O_WTEXT);
  wprintf(L"%s\n", argv[1]);
  return 0;
}

It can also be done with GetCommandLineW; here is a simple version of the code found at the HandBrake repo:

#include <stdio.h>
#include <windows.h>

int get_argv_utf8(int* argc_ptr, char*** argv_ptr) {
  int argc;
  char** argv;
  wchar_t** argv_utf16 = CommandLineToArgvW(GetCommandLineW(), &argc);
  int i;
  int offset = (argc + 1) * sizeof(char*);
  int size = offset;
  for (i = 0; i < argc; i++)
    size += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1, 0, 0, 0, 0);
  argv = malloc(size);
  for (i = 0; i < argc; i++) {
    argv[i] = (char*) argv + offset;
    offset += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1,
      argv[i], size-offset, 0, 0);
  }
  *argc_ptr = argc;
  *argv_ptr = argv;
  return 0;
}

int main(int argc, char** argv) {
  get_argv_utf8(&argc, &argv);
  printf("%s\n", argv[1]);
  return 0;
}
1
AudioBubble On

Since you're using MinGW (actually MinGW-w64, but that shouldn't matter in this case), you have access to the Windows API, so the following should work for you. It could probably be cleaner and actually tested properly, but it should provide a good idea at the least:

#define _WIN32_WINNT 0x0600
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

#include <windows.h>

int main (void)
{
    int       argc;
    int       i;
    LPWSTR    *argv;

    argv = CommandLineToArgvW(GetCommandLineW(), &argc);
    if (argv == NULL)
    {
        FormatMessageA(
            (
                FORMAT_MESSAGE_ALLOCATE_BUFFER |
                FORMAT_MESSAGE_FROM_SYSTEM |
                FORMAT_MESSAGE_IGNORE_INSERTS),
            NULL,
            GetLastError(),
            0,
            (LPWSTR)&error, 0,
            NULL);

        fprintf(stderr, error);
        fprintf(stderr, "\n");
        LocalFree(error);
        return EXIT_FAILURE;
    }

    for (i = 0; i < argc; ++i)
        wprintf(L"argv[%d]: %ls\n", i, argv[i]);

    // You must free argv using LocalFree!
    LocalFree(argv);

    return 0;
}

Bear in mind this one issue with it: Windows will not compose your strings for you. I use my own Windows keyboard layout that uses combining characters (I'm weird), so when I type

example -o àlf

in my Windows Command Prompt, I get the following output:

argv[0]: example
argv[1]: -o
argv[2]: a\u0300lf

The a\u0300 is U+0061 (LATIN SMALL LETTER A) followed by a representation of the Unicode code point U+0300 (COMBINING GRAVE ACCENT). If I instead use

example -o àlf

which uses the precomposed character U+00E0 (LATIN SMALL LETTER A WITH GRAVE), the output would have differed:

argv[0]: example
argv[1]: -o
argv[2]: \u00E0lf

where \u00E0 is a representation of the precomposed character à represented by Unicode code point U+00E0. However, while I may be an odd person for doing this, Vietnamese code page 1258 actually includes combining characters. This shouldn't affect filename handling ordinarily, but there may be some difficulty encountered.

For arguments that are just strings, you may want to look into normalization with the NormalizeString function. The documentation and examples linked in it should help you to understand how the function works. Normalization and a few other things in Unicode can be a long journey, but if this sort of thing excites you, it's also a fun journey.

6
Frank On

Try compiling and running the following program:

#include <stdio.h>

int main()
{
    int i = 0;

        for( i=0; i<256; i++){
            printf("\nASCII Character #%d:%c ", i, i);
        }

        printf("\n");

    return 0;
}

In your output you should see those little question marks from number 128 and onward. FYI I am using Ubuntu, and when I compile and run this program (whith GNOME Terminal) this happens to me as well.

However, if I go to Terminal > Set character encoding... and select Western (WINDOWS-1252) as opposed to Unicode (UTF-8), and rerun the program, the extended ASCII characters display properly.

I don't know the exact steps for Windows/MinGW, but, in short, changing the character encoding should fix your problem.