I have a section of my code that needs to run millions of times. It seems that the most costly part of it is this computation:
Vector2d directedInverseSquare(Vector2d &pos1, Vector2d &pos2) {
Vector2d diff = pos2 - pos1;
Vector2d rHat = diff.normalized();
double rSquared = diff.squaredNorm();
return rHat / rSquared;
}
Which computes 1/r^2 in the direction of pos1->pos2, where r is the distance between them.
This is also equivalent to R/r^3, where R is the original difference vector. But I have found the form the code is written here to be faster.
It seems to me that the obvious way to make this code faster would be to somehow "produce" both the squared distance and the normed vector at once. Their computation process is pretty similar, and yet I still currently call them one after the other without any such optimization.
I've tried to do it myself, but Eigen always comes out faster. So perhaps an Eigen function that does this would be best, but I've not found a function like that.
Okay, with the discussion dying off, here is my test code and improved versions.
First, my benchmark. Nothing fancy.
Compile flags
-DNDEBUG -march=native -fno-math-errnocompiled with gcc-12.3. On my TigerLake CPU, this takes 6.2 seconds with your implementation.Switching to the straightforward
diff / r**3gives us 4.5 seconds:chtz's version is about the same speed but fewer instructions. Objectively I think this is the best.
To improve upon this, we should vectorize by using a 2x3 matrix. This allows us to do the
b-aandc-bcase in one step. Thea-coperation has to be done separately.Unfortunately this is only slightly faster in my tests; 4.3 seconds. The assembly looks fine, though.- I also tested a transposed version which is slightly worse.
However, this matrix style, especially the transposed version, works very well if you have many more vectors. The benefit of vectorization increases, the relative cost of that last scalar element decreases.
This takes 2.8 seconds. As usual, be careful with those
autovariables in Eigen.As a bonus, here is the best version I could come up with for the 2x3 matrix version using Intel intrinsics with AVX + FMA instructions. This takes 4.1 seconds.
(note that
unpackloandunpackhihave rather unintuitive results for 256 bit vectors)