No performance difference in different variations of the same program

80 views Asked by At

I copied glibc's implementation of binary search algorithm, then modified it a little bit to suit my needs. I decided to test it and other things I have learned about GCC (attributes and built-ins). The code looks as:

int main() {
  uint_fast16_t a[61] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61 };
  uint64_t t1 = Time(0);
  for(register uint_fast16_t i = 0; i < 10000000; ++i) {
    binary_search(rand() % 62, a, 61);
  }
  printf("%ld\n", Time(0) - t1);
  return 0;
}

Now, this program runs just fine. The problem begins when I add more lines of code, for instance:

uint_fast16_t a[61] __attribute__ ((aligned (64) )) = /* ... */

In this case I would expect faster code, yet performance has not changed after multiple tests (tens of tests). I also tested the program with alignment of 8 and 1 - no changes. I even expected gcc to throw an error/warning, because using alignment less than type size (in my case 64bit machine, uint_fast16_t is 8 bytes), but there was none. Then another change, which was adding caching (introduced in GCC 9). I added the following code before the for loop:

caches(a, uint_fast16_t, uint_fast16_t, 61, 0, 3);
// where "caches" is:
#define caches(x, type, data_type, size, rw, l) ({ \
  for(type Q2W0 = 0; Q2W0 < size; Q2W0 += 64 / sizeof(data_type)) { \
    __builtin_prefetch(x + Q2W0, rw, l); \
  } \
})

No change in performance as well. I figured out maybe my CPU is caching the array automatically after first binary_search so I eliminated the for loop and measure a few times again with and without the caching line, but I have not noticed any change in performance as well.
More information:

  1. Using CentOS8 64bit latest kernel
  2. Using GCC 9.2.1 20191120
  3. Compiling with -O3 -Wall -pthread -lanl -Wno-missing-braces -Wmissing-field-initializers, no errors / warnings during compilation
  4. Things are not optimised away (checked asm output)

I am pretty sure I don't know about something / I am doing something wrong.

Full code available here.

1

There are 1 answers

2
Lundin On BEST ANSWER
  • register uint_fast16_t is pre-mature optimization, let the compiler decide which variables to place in registers. Regard register as a mostly obsolete keyword.

  • As noted in comments, uint_fast16_t i = 0; i < 10000000 is either a bug or bad practice. You should perhaps do something like this instead:

    const uint_fast16_t MAX = 10000000; 
    ... i < MAX
    

    In which case you should get compiler errors upon initialization, if the value does not fit. Alternatively, check the value with static assertions.

    Better yet, use size_t for the loop iterator in this case.

  • __attribute__ ((aligned (64) )) "In this case I would expect faster code"

    Why? What makes you think the array was misaligned to begin with? The compiler will not misalign variables just for the sake of it. Particularly not when the array members are declared as uint_fastnn - the whole point of using uint_fast16_t is in fact to get correct alignment.

    In this case, the array results in both gcc and clang for x86/64 to spew out a bunch of .quad assembler instructions, resulting in perfectly aligned data.

  • Regarding the cache commands, I know too little of how they work to comment on them. It is however likely that you already have ideal data cache performance in this case - the array should be in data cache.

    As for instruction cache, it's unlikely to do much good during binary search, which by its nature comes with a tonne of branches. In some cases a brute force linear search might outperform binary search for this very reason. Benchmark and see. (And make sure to bludgeon your old computer science algorithm teacher with a big O when brute force proves to be much faster than binary search.)

  • rand() % 62 may or may not be quite a bottleneck. Both the rand function and modulus could mean a lot of overhead depending on system.