Undefined behavior on reading object using non-character type when last written using character type

486 views Asked by At

Assuming unsigned int has no trap representations, do either or both of the statements marked (A) and (B) below provoke undefined behavior, why or why not, and (especially if you think one of them is well-defined but the other isn't), do you consider that a defect in the standard? I am primarily interested in the current version of the C standard (i.e. C2011), but if this is different in older versions of the standard, or in C++, I would also like to know about that.

(_Alignas is used in this program to eliminate any question of UB due to inadequate alignment. The rules I discuss in my interpretation, though, say nothing about alignment.)

#include <stdlib.h>
#include <string.h>

int main(void)
{
    unsigned int v1, v2;
    unsigned char _Alignas(unsigned int) b1[sizeof(unsigned int)];
    unsigned char *b2 = malloc(sizeof(unsigned int));

    if (!b2) return 1;

    memset(b1, 0x55, sizeof(unsigned int));
    memset(b2, 0x55, sizeof(unsigned int));

    v1 = *(unsigned int *)b1; /* (A) */
    v2 = *(unsigned int *)b2; /* (B) */

    return !(v1 == v2);
}

My interpretation of C2011 is that (A) provokes undefined behavior but (B) is well-defined (to store an unspecified value into v2), because:

  • memset is defined (§7.24.6.1) to write to its first argument as-if through an lvalue with character type, which is allowed for both b1 and b2 per the special case at the bottom of §6.5p7.

  • The object b1 has a declared type, unsigned char[n]. Therefore, its effective type for accesses is also unsigned char[n] per 6.5p6. Statement (A) reads b1 via an lvalue expression whose type is unsigned int, which is not the effective type of b1 nor any of the other exceptions in 6.5p7, so the behavior is undefined.

  • The object pointed-to by b2 has no declared type. The value stored into it (by memset) was (as-if) through an lvalue with character type, so the second case of 6.5p6 does not apply. The value was not copied from anywhere, so the third case of 6.5p6 does not apply either. Therefore, the effective type of the object is the type of the lvalue used for the access, which is unsigned int, and the rules of 6.5p7 are satisfied.

  • Finally, per 6.2.6.1, assuming unsigned int has no trap representations, the memset operation has created the representation of some unspecified unsigned int value in each of b1 and b2. Therefore, if neither (A) nor (B) provokes undefined behavior, then the actual values in v1 and v2 are unspecified but they are equal.

Commentary:

The asymmetry of the "type-based aliasing" rules (that is, 6.5p7), permitting an object with any effective type to be accessed by an lvalue with character type, but not vice versa, is a continual source of confusion. The second case of 6.5p6 seems to have been added specifically to prevent its being undefined behavior to read a value initialized by memset (or, for that matter, calloc) but, because it only applies to objects with no declared type, is itself an additional source of confusion.

2

There are 2 answers

3
Phil Miller On

On a superficial examination, I'd agree with your assessment (A is UB, B is fine), and can offer a concrete rationale for why that should be so (prior to the edit to include _Alignas()): Alignment.

The char[] on the stack can start at any address, whether that's a valid alignment for an unsigned int or not. In contrast, malloc() is required to return memory meeting the strictest alignment requirements of any native type on the platform in question.

The standard obviously doesn't want to impose alignment requirements on char[] beyond those of char, so it has to leave type-punned access to it as potentially undefined.

2
supercat On

The authors of the Standard acknowledge in the rationale that it would be possible for an implementation to be conforming but useless. Because they expected that implementers would endeavor to make their implementations useful, they didn't think it necessary to mandate every behavior that might be needed to make an implementation suitable for any particular purpose.

The Standard imposes no requirements on the behavior of code that accesses an aligned object of character-array type as some other type. That doesn't mean that they intended that implementations should do something other than treat the array as untyped storage in cases where code takes the address of the array once but never accesses it directly. The fundamental nature of aliasing is that it requires that an item be accessed in two different ways; if an object is only ever accessed one way, there is by definition no aliasing. Any quality implementation which is supposed to be suitable for low-level programming should behave in useful fashion in cases where a char[] is used only as untyped storage, whether the Standard requires it or not, and its hard to imagine any useful purpose that would be impeded by such treatment. The only purpose that would be served by having the Standard mandate such behavior would be to prevent compiler writers from treating the lack of a mandate as being--in and of itself--a reason not to process such code in the obvious useful fashion.