This is not asking about structure padding/packing, which refers to any unnamed bytes inserted into structures for alignment purposes.
I have this function:
#include <stdint.h>
uint8_t get_index(const uint8_t xs, const uint8_t zs, const uint8_t ys, const uint8_t l) {
return (xs >> l & 1) | (zs >> l & 2) | (ys >> l & 4);
}
To my surprise GCC does not seem to use any SWAR for this despite having optimizations enabled due to the fact that multiple and
and sar
instructions were emitted.
But I see that I could trivially implement SWAR like so:
#include <stdint.h>
union Arg {
uint8_t b[3];
uint32_t u;
};
uint8_t get_index(union Arg arg, const uint8_t l) {
static const union Arg mask = {.b = {1, 2, 4}};
/* Using this instead of an integer constant makes the behavior not depend on endianness.
This will be optimized into the appropriate integer constant anyway. */
arg.u = arg.u >> l & mask.u;
return arg.b[0] | arg.b[1] | arg.b[2];
}
As expected the assembly is actually shorter: Version 1 Version 2 Version 3 (all the same)
- Why does GCC fail to optimize the former into the latter? Is there any particular reason or is this just a missed optimization?
- Would an individual byte parameter be accessed any differently from a byte inside a
struct
/union
and if so why? My intuition says they should not be because either way they are at known positions in the current stack frame. - Is there any reason doing so would be slower than passing them individually?
I have seen: Passing many variables vs. passing struct but that question more focuses on large structs that are much bigger than a CPU word size whereas my object is only 4 bytes. Those also do not address the question of accessing the individual bytes within a word.
As pointed out in comments, your versions are not functionally equivalent, since for instance
get_index(0,0,1,7)
would return 0 with the first version but 2 with the second. You say that your value forl
will not exceed 5, but of course the compiler cannot know that, and must emit code that gives correct results for all possible inputs. (In comments you say you tried adding a__builtin_unreachable
to assertl <= 5
, which in principle would make the optimization legal, but still I think it's a pretty tough one for a compiler to be able to find.)But setting this aside...
You are on x86-64 with the SysV ABI, in which the first few integer parameters are not passed on the stack, but in registers. And separate parameters get passed in separate registers even if they are small enough to fit into one. On the other hand, an aggregate (struct or union) of 8 bytes or less gets passed in a single register.
(Even on x86-32, which does pass all parameters on the stack, the SysV ABI calls for arguments narrower than 32 bits to be widened and passed in separate 4-byte stack slots (so as to make it possible to use
push
). So in that case, although in version 1 the bytes are at known offsets on the stack, they are not at adjacent offsets, and so we still have to do more work to pack them into one register. https://godbolt.org/z/xqnndfGPr )So you're not comparing apples to apples here. The union version of the function might look more efficient. But if its caller obtained the values for
x,z,y
from three separate computations, then they probably ended up in three separate registers, and so the caller will have to do more work to pack them into one register or stack slot. You're not really saving computation, just outsourcing it to someone else.And there are certainly cases where getting arguments packed into a single register would be worse than getting them in separate registers. Consider something simple like:
Try on godbolt
In
sum_2
, since x86 doesn't have a good way to add different bytes or bitfields of a single register (other than the low two bytes viaal/ah
etc), we need extra instructions to unpack into more registers.So in answer to your title question, yes, there absolutely can be performance penalties.