I copied glibc's implementation of binary search algorithm, then modified it a little bit to suit my needs. I decided to test it and other things I have learned about GCC (attributes and built-ins). The code looks as:
int main() {
uint_fast16_t a[61] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61 };
uint64_t t1 = Time(0);
for(register uint_fast16_t i = 0; i < 10000000; ++i) {
binary_search(rand() % 62, a, 61);
}
printf("%ld\n", Time(0) - t1);
return 0;
}
Now, this program runs just fine. The problem begins when I add more lines of code, for instance:
uint_fast16_t a[61] __attribute__ ((aligned (64) )) = /* ... */
In this case I would expect faster code, yet performance has not changed after multiple tests (tens of tests). I also tested the program with alignment of 8 and 1 - no changes. I even expected gcc to throw an error/warning, because using alignment less than type size (in my case 64bit machine, uint_fast16_t is 8 bytes), but there was none. Then another change, which was adding caching (introduced in GCC 9). I added the following code before the for loop:
caches(a, uint_fast16_t, uint_fast16_t, 61, 0, 3);
// where "caches" is:
#define caches(x, type, data_type, size, rw, l) ({ \
for(type Q2W0 = 0; Q2W0 < size; Q2W0 += 64 / sizeof(data_type)) { \
__builtin_prefetch(x + Q2W0, rw, l); \
} \
})
No change in performance as well. I figured out maybe my CPU is caching the array automatically after first binary_search
so I eliminated the for
loop and measure a few times again with and without the caching line, but I have not noticed any change in performance as well.
More information:
- Using CentOS8 64bit latest kernel
- Using GCC 9.2.1 20191120
- Compiling with
-O3 -Wall -pthread -lanl -Wno-missing-braces -Wmissing-field-initializers
, no errors / warnings during compilation - Things are not optimised away (checked asm output)
I am pretty sure I don't know about something / I am doing something wrong.
Full code available here.
register uint_fast16_t
is pre-mature optimization, let the compiler decide which variables to place in registers. Regardregister
as a mostly obsolete keyword.As noted in comments,
uint_fast16_t i = 0; i < 10000000
is either a bug or bad practice. You should perhaps do something like this instead:In which case you should get compiler errors upon initialization, if the value does not fit. Alternatively, check the value with static assertions.
Better yet, use
size_t
for the loop iterator in this case.Why? What makes you think the array was misaligned to begin with? The compiler will not misalign variables just for the sake of it. Particularly not when the array members are declared as
uint_fastnn
- the whole point of usinguint_fast16_t
is in fact to get correct alignment.In this case, the array results in both gcc and clang for x86/64 to spew out a bunch of
.quad
assembler instructions, resulting in perfectly aligned data.Regarding the cache commands, I know too little of how they work to comment on them. It is however likely that you already have ideal data cache performance in this case - the array should be in data cache.
As for instruction cache, it's unlikely to do much good during binary search, which by its nature comes with a tonne of branches. In some cases a brute force linear search might outperform binary search for this very reason. Benchmark and see. (And make sure to bludgeon your old computer science algorithm teacher with a big O when brute force proves to be much faster than binary search.)
rand() % 62
may or may not be quite a bottleneck. Both the rand function and modulus could mean a lot of overhead depending on system.