Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx
in the prologue, and it always used it like a "zero register" (think $zero
on MIPS). In particular, it often:
- used it to zero out memory; this is not unusual, as the encoding for a
mov mem,imm
is 1 to 4 bytes bigger thanmov mem,reg
(the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise; - used it for compares against zero - as in
cmp reg,ebx
. This is what stroke me as really unusual, as it should be exactly the same astest reg,reg
, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, withebx
being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also usedtest reg,reg
in the exact same fashion (test
/cmp
=>jg
).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx
instead of xor ebx,ebx
; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.
Everything you commented on as odd looks sub-optimal to me.
test eax,eax
sets all flags (except AF) the same ascmp
against zero, and is preferred for performance and code-size.On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of
ebx
as a zero-register, which is also completely useless there.Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using
test
?(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating
0
like any other constant that you might want to load once and then reg-reg copy throughout the function.Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a
mov
ormovdqa
to copy to the other.This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where
mov
can run on the AGU or ALU, but xor-zeroing still needs an ALU port.For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.