The task is to set each element of a C integer array to its absolute value. I'm trying to do it as efficiently as possible. Below are a progression of optimizations that I've made. Please tell me if these are actually optimizations at all, and if any more can be made!
The first parameter of the function will be an integer array, and the second will be an integer size of that array.
Here's the standard implementation:
void absolute (int array[], int n){
for(int i = 0; i < n; i++)
if(array[i] < 0)
array[i] = - array[i];
}
That's plenty to satisfy any introductory programming course professor, but I'd like to play with it a little more and maybe learn something about optimization along the way.
Based on https://stackoverflow.com/a/2074403, a branchless absolute value:
void absolute (int array[], int n){
for(int i = 0; i < n; i++){
uint32_t temp = array[i] >> 31; // make a mask of the sign bit
array[i] ^= temp; // toggle the bits if value is negative
array[i] += temp & 1; // add one if value was negative
}
}
Based on comparisons to zero being more efficient, and not needing an extra variable sitting around:
void absolute (int array[], int n){
for(n--; n >= 0;){
uint32_t temp = array[n] >> 31;
array[n] ^= temp;
array[n] += temp & 1;
}
}
(does this vectorize anymore though?)
That's as far as I got. Can more be done to optimize this function?
Personally I rather like this question. It's questions like these that get you wondering if perhaps there is a way to make my own code better.
Your final optimization is incorrect as it initializes n--, but n is never decremented again. To correct this you need
for(n--; n >= 0; n--). Though the results I had found that decrementing or incrementing your for loop contained no noticeable advantage.If the values of the array are not truly randomly distributed I found that the simple
if(array[i] < 0)used in the first implementation is actually significantly faster.Here's the code I used to benchmark:
Test Results
Here's my results (the times are in seconds). These tests were run on Intel(R) Xeon(R) CPU W3580 @ 3.33GHz. gcc (Debian 4.9.2-10) 4.9.2
Conclusion
My take away from this is that hardware optimizations to handle branch predictions might significantly speed up code execution and improve your speed better than any software based optimizations. By trying to optimize out the branching, you've created code that executes the same steps no matter the data being processed. So while it executes in constant time, if the data is not perfect randomly distributed you might actually be making the execution speed slower.Update: I did some tests with compiler optimizations turned on and found different results that don't entirely support the conclusions I came to earlier.
In my experience I have found that if you can simply write less code, that is normally the best way of optimization. It seems the less instructions, the faster it executes regardless of hardware features.
I look forward to reading any comments on this exercise.
Update
I've added the results of Rotem's implementation. This code is super fast and proves that the less instructions you have, the faster the execution time. Nice work Rotem!
Update 2
I did some extensive testing today and found that micro optimizations like changing the way the for loop counts has absolutely no effect when compiler optimizations like
gcc -O3are turned on. The compiler ends up generating assembly that does pointer comparison on the array pointer to test if we've reached the end.Optimizing the SSE code provided by Rotem also makes no difference when the compiler is run with
gcc -O3as it properly aligns the memory on a 16 Byte boundary which removes the_mm_loadu_si128()/_mm_storeu_si128()necessity.Final Update
I added another implementation which is using the simple and intuitive
abs()function. It turns outabs()on gcc and MSVC is actually a compiler intrinsic. I redid all the test results simply usinggcc -O3optimizations.As you can see the Rotem's SIMD implementaiton and the
abs()implementation are the fastest, followed by the two XOR implementations, and finally the branching implementations.Of the two XOR implementations the one that decrements the for loop is actually slightly slower as it's loop contains 14 instructions whereas the increment loop contains only 13.
Rotem's SIMD implementation and the
abs()implementation actually both rely on thePABSDinstruction and both have loops containing 7 instructions. The slight difference in speed (SIMD being slightly faster) however comes from the fact that the optimized SIMD implementation assumes the memory will always contain multiples of 4 integers (128bits) whereas theabs()implementation requires extra instructions to test the cases where the memory does not contain multiples of 4 integers.What's amazing here is that by simply using
abs()we can achieve almost exactly the same speed as SIMD with the simplicity of calling a C library function. Theabs()loop without using-march=nativeis only 4 instructions longer, which instead of utilizingPABSD, it usesPSRAD,PXOR, andPSUBDinstructions.Why is portable
abs()faster than the XOR implementation?It turns out the portable (or non-native)
abs()assembly is nearly exactly that of the XOR implementation.Here's the
abs():Here's the XOR:
Now lets convert these back into C code:
Here's the
abs():Here's the XOR:
The difference is in the fact that we have an extra bitwise AND operation in the original XOR implementation.
Final Conclusion
Use
abs()! :D