I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};
. I know I can use SIMD to compare all elements at once, using movemask
+tzcnt
to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48)
but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?
For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16
0
bits. Unlikepcmpeqw
, you'll have some zero bits in the other elements.So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
So it could look like this (untested):
Godbolt with GCC and clang, also including a SIMD intrinsics version.
Clang unfortunately adds
0xFFFEFFFEFFFEFFFF
instead of reusing the multiplier constant, so it has three 64-bit immediate constants.AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD
movemask
, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.Match position
If you need to know where the match is, I think that bithack has a
1
in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving0x80008000...
).So maybe
tzcnt(any_zeros) >> 4
to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31.31 >> 4
= 1.If that doesn't work, then yeah AVX2 or AVX-512
vpbroadcastw xmm0, edi
/vmovq
/vpcmeqw
/vpmovmskb
/tzcnt
will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of whichshort
.)Actually just SSE2
pshuflw
can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-sourcepcmpeqw mm0, [rsi]
since it has no alignment requirement and is only 64-bit, not 128.If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
AXV-512 gives us
vpbroadcastw xmm0, edi
, replacingvmovd
+vpbroadcastw xmm,xmm
ormovd + pshuflw
, saving a shuffle uop.With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a
__m128i
. So probably we do need to consider the front-end costs of setting up those constants.I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the
movzx edi, di
from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of_mm_cvtsi32_si128(needle)
is discarded by the later shuffle. Maybe better to make the function argunsigned
, not explicitly a narrow 16-bit type.