The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found."
If I compile this code in release mode it still works, but if I enable compiler
optimizations, such as \O1 or \O2 the index is not zero and the assert()
fails.
#include <iostream>
#include <cassert>
using namespace std;
int main()
{
unsigned long index = 0;
_BitScanReverse64(&index, 0x0ull);
cout << index << endl;
assert(index == 0);
return 0;
}
Is this the intended behaviour ? I am using Visual Studio Community 2015, Version 14.0.25431.01 Update 3. (I left cout in, so that the variable index is not deleted during optimization). Also is there an efficient workaround or should I just not use this compiler intrinsic directly?
AFAICT, the intrinsic leaves garbage in
index
when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.Despite the
index
arg being taken by reference, the compiler treats it as output-only.unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the
bsr
instruction.Intel documents the
BSR
instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting
Index
when the input is non-zero (and the intrinsic's return value is non-zero).On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g.
mov rax,-1
;bsf eax, ecx
(with zeroed ECX) leaves RAX=-1 (64-bit), not the0x00000000ffffffff
you'd get fromxor eax, 0
. But with non-zero ECX,bsf eax, ecx
has the usual effect of zero-extending into RAX, leaving for example RAX=3.IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.
At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.
Skylake dropped the false dependency for
lzcnt
andtzcnt
(and a later uarch dropped the false dep forpopcnt
) while still preserving the dependency forbsr
/bsf
. (Why does breaking the "output dependency" of LZCNT matter?)Of course, since MSVC optimized away your
index = 0
initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.So in C++ terms, the intrinsic has no input dependency on
index
. But in asm, the instruction does have an input dependency on the dst register, like anadd dst, src
instruction. This can cause unexpected performance issues if compilers aren't careful.Unfortunately on Intel hardware, the
popcnt / lzcnt / tzcnt
asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).You need to check it to make sure
index
is valid, unless you know the input was non-zero. e.g.If you want to avoid this extra check branch, you can use the
lzcnt
instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.