The following is a minimal reproducible example of code that I had to generate an 'array' (whose 1-byte elements are packed into the resulting uint_fast64_t) of 3D coordinates within octree branches given x z and y positions:
#include <stdint.h>
void test(uint_fast64_t *const coord, const uint_fast8_t x, const uint_fast8_t z, const uint_fast8_t y) {
static const uint_fast64_t m = 0x2040810204081ULL, a = 0x101010101010101ULL;
*coord = (x * m & a) | (z * m & a) << 1 | (y * m & a) << 2;
}
Looking at the assembly, GCC appears to only generate one 'variant' of the m constant, but three variants of the a constant, including 0x404040404040404 and 0x202020202020202.
test:
movabs rax, 567382630219905 ; 0x2040810204081
movzx edx, dl
movzx esi, sil
movzx ecx, cl
movabs r8, 144680345676153346 ; 0x202020202020202
imul rdx, rax
imul rsi, rax
imul rcx, rax
movabs rax, 289360691352306692 ; 0x404040404040404
add rdx, rdx
and rdx, r8
movabs r8, 72340172838076673 ; 0x101010101010101
and rsi, r8
sal rcx, 2
or rdx, rsi
and rcx, rax
or rdx, rcx
mov QWORD PTR [rdi], rdx
ret
For whatever reason, GCC appears to be 'constant propagating' the << 1 and << 2 to these masks, and storing them separately, when the same mask could just be used by doing the and first and then the bitshift. This is what is confusing.
Clang, on the other hand, fully propagates the bitshifts to the constants, so the assembly contains 6 of the 64 bit constants, but no shift operations corresponding to the << 1 and << 2. This seems to be a speed optimization at the cost of size.
But I am perplexed by the GCC handling. Some of the constants are 'folded' but some are not, and the manner in which they are not providing any perceptible benefit.
My question is:
- Is there, for some obscure reason, some advantage to performing the shift first and the
andmask later, even at the expense of storing extra constants in the code? - If not, is there some hack or compiler flag I can use to circumvent this, and force GCC to simply
andfirst and shift afterwards, to avoid storing those constants?
This is one of those cases when 'The compiler will optimize the code, just forget about it.' does not really work, since this 'optimization' itself is what I find questionable.
I know that 16 bytes of opcode is 'not much' but I am still curious as to why GCC performs this 'optimization' despite seeming like a lose to an untrained eye. And this even happens with aggressive size optimizations.
I can only speculate that the GCC code generator is simply programmed to always compute bit masks relative to the final positions, even when it means that the number of bit masks is growing.
GCC has other heuristics as well, like reduction of immediates by 1, when comparing with inequality.
if (a < 2)is transformed toif (a <= 1), which does not make sense if one also needs to to computeif (a == 2)for some other use.One can anyway prevent GCC and clang from taking some optimisations by an optimisation barrier
asm("" :"+r"(a))-- combined with having the constants as non-const variables.The barrier informs that register containing
ais being somehow modified by theasmstatement, meaning thatano longer contains the constant. Subsequentlya << 1, a << 2are also no longer immediates derivable froma.In this particular case one can apparently also use
for
In this case I would have actually expected
lea rax, [rax + 4*rbx]format to be used, instead of two separateadd rcx, rcxto left-shift by 1 as it accumulates in a longer dependency chain than necessary.