Please see this snippet I wrote that is supposed to simply convert a multibyte string (which it gets from stdin) to a wide string. Having read the mbsrtowcs and mbstate_t documentation from cppreference I thought it was valid:

#include <stdio.h>
#include <wchar.h>
#include <errno.h>
#include <stdlib.h>
#include <error.h>

int main()
{
        char *s = NULL; size_t n = 0; errno = 0;
        ssize_t sn = getline(&s, &n, stdin);
        if(sn == -1 && errno != 0)
                error(EXIT_FAILURE, errno, "getline");
        if(sn == -1 && errno == 0) // EOF
                return EXIT_SUCCESS;

        // determine how big should be the allocated buffer
        const char* cs = s; mbstate_t st = {0}; // cs to avoid comp. warnings
        size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
        if(wn == (size_t)-1)
                error(EXIT_FAILURE, errno, "first mbsrtowcs");

        wchar_t* ws = malloc((wn+1) * sizeof(wchar_t));
        if(ws == NULL)
                error(EXIT_FAILURE, errno, "malloc");

        // finally convert the multibyte string to wide string
        st = (mbstate_t){0};
        if(mbsrtowcs(ws, &cs, wn+1, &st) == (size_t)-1)
                error(EXIT_FAILURE, errno, "second mbsrtowcs");

        if(printf("%ls", ws) < 0)
                error(EXIT_FAILURE, errno, "printf");

        return EXIT_SUCCESS;
}

Yes this works for ASCII strings. BUT the very reason I'm trying to deal with non-ASCII strings is that I would like to support diacritics beyond the ASCII table! And it fails for those. The first call to mbsrtowcs fails with EILSEQ, which would indicate that the multi-byte string is invalid. But oddly enough, inspecting it with gdb, it seems valid! (insofar as gdb displays it correctly). Please see the effects of feeding this snippet a non-ASCII string and gdbing it below:

m@m-X555LJ:~/wtfdir$ gcc -g -o wtf wtf.c
m@m-X555LJ:~/wtfdir$ ./wtf
asa
asa
m@m-X555LJ:~/wtfdir$ ./wtf
ąsa
./wtf: first mbsrtowcs: Invalid or incomplete multibyte or wide character
m@m-X555LJ:~/wtfdir$ gdb ./wtf
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./wtf...done.
(gdb) break 18
Breakpoint 1 at 0x93b: file wtf.c, line 18.
(gdb) r
Starting program: /home/m/wtfdir/wtf 
ąsa

Breakpoint 1, main () at wtf.c:18
18          size_t wn = mbsrtowcs(NULL, &cs, 0, &st);
(gdb) p cs
$1 = 0x555555756260 "ąsa\n"
(gdb) c
Continuing.
/home/m/wtfdir/wtf: first mbsrtowcs: Invalid or incomplete multibyte or wide character
[Inferior 1 (process 5612) exited with code 01]
(gdb) quit

If this matters, I'm on Linux, and the locale encoding seems to be UTF8:

m@m-X555LJ:~$ locale charmap
UTF-8

(this is why I expected this to work, trivial programs like printf("ąsa\n"); tend to work for me on Linux but not on Windows)

What am I missing? what am I doing wrong?

0

There are 0 answers