I have a computation that produces a coefficient vector and returns the dot product of this vector with a data vector taken from a large array. To speed things up, I do this for eight vectors at a time using AVX2 SIMD intrinsics. The problem is that the bulk of the time ends up being consumed by the gather operation getting the data for the dot product.
I tried different ways of implementing the gather, and the intrinsic seems to work best. I would greatly appreciate some advice on speeding this up.
Here is a sketch:
__m256 Compute(__m256 input)
{
__m256 coefficients[56] = ComputeCoefficients(input);
__m256i indices = ComputeIndices(input);
__m256 sum = _mm256_setzero_ps();
for (size_t i = 0; i != 56; ++i)
{
__m256 data = _mm256_i32gather_ps(bigArray + i, indices, sizeof(float)); //
sum = _m256_fmadd_ps(coefficients[i], data, sum);
}
return sum;
}
I would first make sure that you are using the most recent Intel processor possible. Intel has invested a lot of engineering in improving the gather instruction.
This being said, it is not magical. If there are cache misses, you will pay a price for them.
I would try to write the same code without SIMD instructions. Is it about the same speed? If it is, then chances are good that your are limited by memory access. Vectorization is good to solve computational limitations, and to write and store data in vector-size units, but even in principle, it cannot be expected to help much with problems bound by random access and cache issues.
Your code repeatedly calls
VPGATHERDPS
. According to Agner Fog, this instruction has a latency of 12 cycles and a throughput of one instruction every 4 cycles. The latency is, of course, a best-case scenario, cache misses will increase the latency.You should benchmark your code and ensure that you are close to 4 cycles per loop iteration. The main loop should complete in about 300 cycles, and that's quite fast all things said.
You do not tell us a lot about your problem but we can guess that it is much slower than 300 cycles. If so, then you are probably having cache issues. If your table is large and you are accessing it randomly, then it is a hard problem. If you need better performance, you may need to reengineer the problem.