Given:
#include <string.h>
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
The compiler can optimize it into:
test_data:
cmpl $1684234849, (%rdi)
sete %al
ret
which is nice.
But if I use my own memcmp() (not from <string.h>), the compiler can't optimize it into a single cmpl instruction. Instead, it does this:
static int memcmp(const void *s1, const void *s2, size_t n)
{
const unsigned char *p1 = s1, *p2 = s2;
size_t i;
for (i = 0; i < n; i++) {
int ret = p1[i] - p2[i];
if (ret)
return ret;
}
return 0;
}
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
test_data:
cmpb $97, (%rdi)
jne .L5
cmpb $98, 1(%rdi)
jne .L5
cmpb $99, 2(%rdi)
jne .L5
cmpb $100, 3(%rdi)
sete %al
ret
.L5:
xorl %eax, %eax
ret
Link: https://godbolt.org/z/Kfhchr45a
- What prevents the compiler from further optimizing it?
- Did I do something that inhibits the optimization?
Data-dependent branching defeats auto-vectorization in GCC/Clang (but not classic ICC). The trip-count can't be computed before the first iteration (in the abstract machine), so GCC and clang wouldn't even try to use
pcmpeqb/pmovmskbfor large sizes. (Which is the efficient way tomemcmpon x86-64 for large inputs.)Apparently this is a hard problem, as it's been unsolved in GCC/Clang for 15 to 20 years (from now back to whenever GCC first started doing auto-vectorization.)
See also how to auto vectorization array comparison function - writing it as a loop that counts mismatches and always touches all elements can give auto-vectorization. (Or an OR reduction instead of a sum reduction). But that won't help for a small fixed size like 4 bytes. And removing the early-out entirely is a disaster for a 1GiB memcmp with a difference in the first byte. (A good SIMD memcmp like glibc's SSE2/AVX2 versions would branch maybe every 64 to 128 bytes on a vector compare results.)
Apparently there's no idiom-recognition either, at least not for this way of writing it. (There is for
memcpy; GCC and clang can turn simple copy loops intocall memcpyorcall memset, or inline expansions of those functions. So if you're writing your own, you need-fno-builtin-memcpyor else your function turns into a call to itself... Pick a different name for experiments.)Writing it as a loop that unconditionally touches all bytes could give auto-vectorization, but probably won't get recognized as a
memcmpidiom in GCC's internals. I think that would be necessary for good code with small problems, like this case where we want a single dwordcmp.Compilers must avoid introducing segfaults by inventing reads past where the abstract machine stops. If
void *datapoints to a 1-byte buffer holding'z', your manual loop has well-defined behaviour in the abstract machine. Reading all 4 bytes would be past the end of the buffer.But
memcmpis UB if any part of the array is inaccessible, so the compiler can touch all 4 bytes without checking for early-out or for pointer alignment. (Can std::memcmp read any bytes past the first difference? yes, unlike your loop.)(In x86-64 asm, going past the end could go into an unmapped page, resulting in a segfault, violating the as-if rule. Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, but only within the same page. This is something you can work around with aligned loads for
strlenandstrchr, but a bigger obstacle for vectorizingstrcmpwith differently-aligned pointers.)Instead of comparing two unknown pointers from function args, I changed your
test_datacaller to pass pointers to two global arrays,char foo[4], bar[4];, which the compiler knows for certain are both readable. (Godbolt). But that didn't help, so still no.