I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.
Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.
The inner loop that constitutes the hot-spot looks like this:
for (i = 0; i < numberOfGFVectorsInFragment; i++) {
// Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
__m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
__m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);
__m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);
__m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
_mm_store_si128(encodedFragmentResultArray, updatedResultVector);
encodedFragmentResultArray++;
currentMessageFragmentPtr++;
}
Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the
_mm_set_epi32
packing operations. Internally,_mm_set_epi32
, in your case will probably be implemented as a series ofunpacklo/hi
instructions.Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.
If you're willing to use SSE4.1, you can try this. It might be faster, it might not:
I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.
What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...