Actual usages for C multibyte character constants

176 views Asked by At

Can anybody help me understand the actual usages of multibyte character constants in C?

I have seen the following code working just fine, and I want to understand what the actual usage of this language feature is. (I know that defining them is standard C; accessing them, however, is not standard conformant). Someone pointed out to me that these multibyte character constants are useful on platforms like Classic MacOS, but they failed to be able to provide an example.

#include <stdio.h>

int main() {
    (void) 'this'; // this seems to be standard conformant

    // but what can we do with this "feature"?
    // This compiles and runs just fine, but is a crude hack:
    long i = 'this';
    const char* u = (const char*) &i;
    const unsigned z = sizeof(long)/sizeof(char);
    printf("%u\n", z);

    for(unsigned v = 0; v < z; v++)
    {
        printf("%c\n", (char)u[v]);
    }

    return 0;
}

Code output was requested (see here: https://godbolt.org/z/ebsTj4a9E):

8
s
i
h
t
3

There are 3 answers

0
John Bollinger On BEST ANSWER

Can anybody help me understand the actual usages of multibyte character constants in C?

Your wording is a bit of a melange as far as the language spec's terminology goes, but then the spec uses a confusing set of similar terms for similar, but distinct concepts. Among them

  • "character" in the sense of abstract character: "member of a set of elements used for the organization, control, or representation of data";

  • "character" / "single-byte character", in the sense of a collection of bits that fits in one byte;

  • "multi-byte character": a sequence of one or more bytes representing a member of the [source or execution] extended character set; as opposed to

  • "wide character": a value representable by an object of type wchar_t;

  • "character constant" / "integer character constant": a lexical element of C source code, consisting of a sequence of one or more multibyte characters enclosed in single quotes.

That's a bit of a mess, I think you'll agree, but to its credit, at least the spec avoids throwing the term "multibyte character constant" into the mix as well.

I think what you're talking about is what the spec describes as "an integer character constant containing more than one character [...] or containing a character or escape sequence that does not map to a single-byte execution character". The values of such constants have type int, as do all character constants, but their values are implementation defined.

I know that defining them is standard C; accessing them, however, is not standard conformant

No, that's not a good description. Character constants containing multiple single-byte characters are lexically valid, and, supposing that the implementation accepts them, their values are implementation defined. The spec does not actually bind implementations to accept such constants, however, neither in general nor any particular ones. That's a bit of a problem for "defining them is standard C". On the other hand, in implementations that do accept them, they serve as ordinary constants with whatever int values the implementation attributes to them. In that sense, there is no inherent issue with accessing them.

The main issue with these is that they are not portable, in the sense that different implementations may attribute different numeric values to the same lexical constant. In truth, however, this is a difference of magnitude, not kind, for exactly the same is true of character constants formed of individual single-byte characters.

The thing that distinguishes character constants formed of individual single-byte characters is exactly that they do map to individual members of the execution character set, in a predictable way. If you need a portable program then you need to avoid character constants of the kind you ask about. However, if you are content with code that you can rely upon to work correctly only on certain implementations, then "implementation defined" means that the values of such constants are defined, and conforming implementations must each document their definitions. For example, since at least version 4.0, GCC has used this definition:

The compiler evaluates a multi-character character constant a character at a time, shifting the previous value left by the number of bits per target character, and then or-ing in the bit-pattern of the new character truncated to the width of a target character. The final bit-pattern is given type int, and is therefore signed [...]. If there are more characters in the constant than would fit in the target int [...] the excess leading characters are ignored.

(GCC Manual)

You can rely on that as long as you stick to GCC and any other implementation that guarantees compatibility with GCC in this area. And you might not even need specific values to match across implementations, as long as different constants of interest to you (not exceeding some maximum length, say) can be relied upon to have different values.

But what can you actually do with them?

There's not much that I would actually do with them, myself, but I can imagine them being used

  • mnemonically, somewhat like enumeration constants (but I would actually use enumeration constants here)

  • as a direct mapping to keywords of a language featuring keywords that are universally short enough. That is, one might read in a keyword, convert it to a number, and compare that to various multi-character character constants (but I would probably choose a different approach, maybe a hash table).

  • in other, non-specific ways I have yet to consider.

8
Ulrich Eckhardt On

I can give you an example where I used to use them (until things broke):

switch (code) {
    // Pen up
    case 'pu':
        ...
    // Pen down
    case 'pd':
        ...
    // Plot absolute
    case 'pa':
        ...
    // Plot relative
    case 'pr':
        ...
    ...
}

This was part of a parser for HPGL (Hewlett Packard Graphics Language) used by plotters.

The next time we upgraded the compiler, the new compiler rejected that code or the compiled binary failed to work as desired, I don't remember which, so we had to rewrite it.

Conclusion: Don't use them, they are useless. :)

0
chux - Reinstate Monica On

Can anybody help me understand the actual usages of multibyte character constants in C?

'this' is called an integer character constant. It has type int.

An integer character constant is a sequence of one or more multibyte characters enclosed in single quotes, as in ’x’. ... C23dr § 6.4.4.4 2

An integer character constant has type int. ... § 6.4.4.4 11

There are many implementation specific attributes to integer character constants of more than 1 character that limit portability. When portability is important, rarely are integer character constants of more than 1 character used.

When portability is not important, integer character constants of more than 1 character such as 'ab' (when sizeof(int) >= 2), 'this' (when sizeof(int) >= 4) may be used like integer character constants of 1 character like 'a'.

#include <stdio.h>
int parse_text_to_dir(const char *s);
void report_invalid_movement(void);
extern int room;
extern int secret_room;

int main(void) {
  char s[80];
  switch (parse_text_to_dir(fgets(s, sizeof s, stdin))) {
    case 'n' :
    case 's' : room = room; break; // loop back
    case 'e' : room++; break;
    case 'w' : room--; break;
    case 'ne' : room = secret_room; break;
    default: report_invalid_movement();
  }
  • Advantage: Code does not need a enum direction as the enumeration name is implied with the constant value itself. This is much like code that says if (answer == 'y') {, but is extended to 2 characters to handle a direction like 'ne'.

  • Disadvantage: The above reason is generally out of modern coding styles and a enum direction { dir_n, dir_ne, dir_e, ...}; is preferred.

  • Disadvantage: Forming the equivalent value of 'this' from single characters is tricky due to endian, sizeof(int), char signed-ness, .. and tends to be error prone.

enum dir { dir_n, dir_ne, dir_e, dir_se, dir_s, dir,sw, dir_w, dir_nw, dir_up, dir_dn };
enum dir parse_text_to_dir(const char *s);
void report_invalid_movement(void);
extern int room;
extern int secret_room;

int main(void) {
  char s[80];
  switch (parse_text_to_dir(fgets(s, sizeof s, stdin))) {
    case dir_n :
    case dir_s : room = room; break; // loop back
    case dir_e : room++; break;
    case dir_w : room--; break;
    case dir_ne : room = secret_room; break;
    default: report_invalid_movement();
  }