C character coding on windows console

145 views Asked by At

I am having trouble in understanding the character set for printing on the console in for a Windows C programme. I have not found any question answering this directly (if there should be one a link would be appreciated).

When looking through some different character sets (UCS-2, ISO 8859-1, Unicode) I always find the character 'ý' after the character 'ü'. When I then made a C programme to print the characters on a console, actually the character "superscript 2" follows 'ü' (sorry, don't know how to write the character suberscript here). In a visual studio debugging environment 'ý' is still shown to be following 'ü'.

My question is therefore: What character set is used by C to write on the console?

1

There are 1 answers

0
Luis Colorado On

those characters are the iso-latin-1 versions of the some of the extended iso-latin-1 characters, when encoded as utf-8. It can be due to two causes:

  • you are using utf-8 in your program output (so a single utf char with codepoints in the range \u0080...\u002f, is printed as two characters) and your terminal doesn't support utf-8 output.
  • you have read those characters from an utf-8 keyboard on a program that doesn't support utf-8 encoding for unicode characters. So the characters have been read as pairs of characters, and processed as such, and output later on as pairs.

My question is therefore: What character set is used by C to write on the console?

It depends. To support multibyte characters you need to do several things in C. I assume you have done nothing special but to use the normal functions of C, which normally assume you are using 7bit ASCII characters, and the locale is set to C (this is no locale at all):

  • You need to set the input/output routines to support some locale (the locale you are using, which is set with some environment variables) so they know in which charset multibyte sequences are shown. In main, you need to initialize the locale with a call to setlocale(3).
  • You need to use the wchar_t versions of all routines that are going to use the type wchar_t (this type supports character sets of more than 256 characters, like Unicode)

You need to educate yourself, as from that point on, strlen() for example, will not be the routine to calculate a string length (as it justs count the number of bytes of the passed string ---which is char related, and not wchar_t related) so you need to use mblen(3) instead (be very careful at the function prototypes, as some functions take a wchar_t * string, while others take a char * string).

Check the manual pages for routines like: scscoll(3), strcoll(3), strxfrm(3), wcsxfrm(3), wprintf(3), fwprintf(3), swprintf(3), vfwprintf(3), fwide(3),...

I wrote a small version of the cal(1) command, and internationalized it to support foreign locales and complete international support (this includes the use of wide chars) You can get it here to see the complete thing to use a program that shows its output in the language you have configured for your session.

See also the manual page for the locale(1) command, to check the locale you have configured for your account.