Suppose I have two uint16_t[4]
arrays, a
and b
. Each integer in these arrays is in the range [0, 16383], so bits 14 and 15 aren't set. Then I have some code to find the minimum and maximum among a[i]
and b[i]
for each i
:
uint16_t min[4], max[4];
for (int i = 0; i < 4; i++) {
if (a[i] < b[i]) {
min[i] = a[i];
max[i] = b[i];
} else {
min[i] = b[i];
max[i] = a[i];
}
}
Suppose for some reason I can't/won't use SIMD, but I'd still like to compute this as fast as possible, on a 64-bit platform. Thus, a natural solution is to use the SIMD-within-a-register (SWAR) paradigm on 64-bit registers to compute these 4 values in a single iteration, rather than over 4 iterations with 16-bit arithmetic.
What bit-twiddling hacks could be used to implement either (min or max) or ideally both operations using the SWAR paradigm, so that the resulting code is faster than the loop above? My target architecture is ARMv8, so feel free to use any ARMv8 instructions that help reducing instruction counts.
C, assembly, or C + inline assembly solutions are all welcome.
You could use code like this, though it's really a lot longer than just doing it with SIMD:
The critical path of this algorithm has 6 instructions. The instructions
are not on the critical path. If executed in a loop, the constant load can likely be hoisted out. The last two instructions can be evaluated independently, yielding minimum and maximum with the same latency.