fastest u_int64_t[8] array compare in C/C++

1.9k views Asked by At

What is the fastest method, to compare two u_int64[8] arrays in C/C++ ?

Array 1 is inside a std::vector (~10k elements) array 2 is inside a dynamic allocated struct. (is memcmp() here false positive free?)

My (pseudo C) implementation :

typedef struct {            
    u_int64_t array[8];
}work_t;

/* alloc and fill array work_t* work = new (std::nothrow) work_t etc... */

for(u_int32_t i=0; i < some_std_vector.size(); i++) {       

                if((some_std_vector[i]->array[0] == work->array[0]) &&
                   (some_std_vector[i]->array[1] == work->array[1]) &&
                   (some_std_vector[i]->array[2] == work->array[2]) &&
                   (some_std_vector[i]->array[3] == work->array[3]) &&
                   (some_std_vector[i]->array[4] == work->array[4]) &&
                   (some_std_vector[i]->array[5] == work->array[5]) &&
                   (some_std_vector[i]->array[6] == work->array[6]) &&
                   (some_std_vector[i]->array[7] == work->array[7])) {
                    //...do some stuff...
                }
}

The target platform is Linux x86_64 gcc 4.9.2, the loop is inside a pthread, tcmalloc is used, and the code is compiled with -O2

4

There are 4 answers

0
Thomas Matthews On BEST ANSWER

Here are some suggestions to improve the speed.

Use Local Variables if possible

Instead of using pointers, e.g. -> operator, use local variables or pass the variables as references. The compiler may generate extra code for loading a pointer into a register then dereferencing the the register to get the value.

Use Processor's Data Cache Most modern processors have a data cache. If you can load several variables with the data, then compare, you may invoke the processor's data cache.

Also, design your data to fit efficiently into a data cache line. This means that data members (arrays included) should be next to each other or very close.

Block Compare

At the lowest level you are comparing many consecutive bytes. As other's have mentioned, you may get better performance by using a memory compare function.

Another suggestion is to help the compiler by loading the values into separate variables, the comparing the values:

for (/*...*/)
{
//...
    uint64_t a1 = some_std_vector[i]->array[0];
    uint64_t a2 = some_std_vector[i]->array[1];
    uint64_t a3 = some_std_vector[i]->array[2];
    uint64_t a4 = some_std_vector[i]->array[3];

    uint64_t b1 = work->array[0];
    uint64_t b2 = work->array[1];
    uint64_t b3 = work->array[2];
    uint64_t b4 = work->array[3];

    if ((a1 == b1) && (a2 == b2) && (a3 == b3) && (a4 == b4))
    {
       //...
    }
}

The concept here is to load the variables first into multiple registers and then compare the registers.

Review Assembly Language & Profile

With all of the techniques presented in the answers, the best method is to code one up, review the assembly language and profile. Remember to set the optimization levels to high for speed.

If your process has special instructions that can make this faster, you want to verify that the compiler is using them or there is justification for not using them.

0
MateuszZawadzki On

Depending on the device you are using and compiler you are using, you can try some "specific" issues. E.g. in some compilers there are techniques that allow to perform wide load from memory and, in result of that, fastest multiple comparisons. Also there are ways to manually unroll the loop, so they are executed faster. But it depends on compiler. You can always try some ways and check assembler code to see which way is the fastest.

0
Amadeus On

I would imagine the only way to truly answer this question would be to write two routines, one using the loop you provided, and the other using memcmp. Then, analyze and look at the diassembly to see which looks to be the most efficient. (You could also be obsessive and use a profiler.)

One might also write a custom routine in assembly to compare them directly (i.e., a custom version of memcmp that works specifically for comparing exactly what you're looking at) and compare it alongside the other two as well.

Either way, I agree with the others that everything is probably going to be pretty close (with a modern compiler); however, if you really wanted to be retentive about it, you'd have to test it with a profiler and/or have the skills to look at the assembly created and know which one would be faster by sight.

0
gj13 On

I did some tests and looked at gcc memcmp, glibc memcmp and my code above. glibc-2.20 memcmp is the fastes way, because it uses platform specific optimisations (in my case).

gcc memcmp is much slower. (bug43052, compile with -fno-builtin-memcmp)