A few questions about character sets and mapping (translation phase 1)

59 views Asked by At

The questions below are about Character sets (C11, 5.2.1 Character sets) and mapping (C11, 5.1.1.2 Translation phases, 1).

The list:

  1. Can a source character set as an extension include control characters, representing other than horizontal tab, vertical tab, and form feed? If yes, then does a diagnostic need to be produced when using such control characters in e.g. string literal?

    Example: GCC/LLVM/MSVC support many control characters in a string literal w/o issuing a diagnostic AND they keep such control characters in the string literal after the mapping at the translation phase 1 is done. (Meaning that GCC/LLVM/MSVC support these control characters in the source character set.) Is it OK that diagnostic is not produced?

Demo:

# GCC
# test \x00
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x00' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
gcc t999.c -c -std=c11 -pedantic -Wall -Wextra -S ;\
grep 's:' t999.S -A1
t999.c:1:12: warning: null character(s) preserved in literal
    1 | char x[] = "x x"; int s = sizeof x;
      |            ^
s:
        .long   4
# here we see that a diagnostic is produced, sizeof x is 4

# test \x01
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x01' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
gcc t999.c -c -std=c11 -pedantic -Wall -Wextra -S ;\
grep 's:' t999.S -A1
s:
        .long   4
# here we see that no diagnostic is produced, sizeof x is 4

# MSVC
# test \x00
# see below

# test \x01
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x01' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
cl t999.c /c /std:c11 /FA /nologo ;\
grep -P '^s' t999.asm
s       DD      04H
# here we see that no diagnostic is produced, sizeof x is 4
  1. C11, 5.1.1.2 Translation phases, 1:

Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.

A simple question: is "mapping to nothing" still a mapping? E.g. X => <nothing>. Or perhaps it is not a "mapping", but "skipping" (or "removal")? Example: in "x<null>y" (in binary 22 78 00 79 22) MSVC skips/removes null character w/o producing a diagnostic (making sizeof produce 3 instead of 4). Is it OK?

Demo:

# MSVC
# test \x00
$ echo "char x[] = \"xxx\"; int s = sizeof x;" > t999.c ;\
printf '\x00' | dd of=t999.c bs=1 seek=13 count=1 conv=notrunc ;\
cl t999.c /c /std:c11 /FA /nologo ;\
grep -P '^s' t999.asm
s       DD      03H
# here we see that no diagnostic is produced, sizeof x is 3
0

There are 0 answers