I have a 2 dimensional array of doubles representing a matrix which can potentially be large e.g. 200x200.
I need to be able to efficiently calculate the sum of this matrix. How can I achieve this using vectorization in C#?
The current vanilla approach is:
double[,] matrix =
{
{ 0.0, 1, 2, 3 },
{ 4, 5, 6, 7 },
{ 8, 9, 10, 11 },
{ 12, 13, 14, 15 }
};
int rows = matrix.GetLength(0);
int cols = matrix.GetLength(1);
double sum = 0;
for (uint i = 0; i < rows; i++)
{
for (uint j = 0; j < cols; j++)
{
sum += matrix[i, j];
}
}
First of all you should do some benchmarking and/or profiling and ask yourself if it really matters? Summing is a very simple calculation, and 200x200 is not very large. I would guess it may take on the order of a microsecond, but this is just a guess. You also need a benchmark to decide if you have actually achieved any improvements, or if you just made the code more complicated for no good reason.
But is this really the largest bottleneck of your application? Optimization is often about avoiding doing work, or avoid redoing work. The best any SIMD optimization could give you is a constant speedup. There is no point in wasting hours optimizing a function that have no noticeable impact on your user.
If you decide you need optimization then I would start by getting rid of the index-calculations. When you do
matrix[i, j]
the framework essentially does ai * width + j
-calculation. This will likely take longer than the actual summing of values. It is possible the optimizer may remove some of this, but I would not assume anything from the optimizer without actually confirming it. You can either do the unsafe route withfixed (double* ptr = matrix )
, or create a custom matrix class that uses a 1D array for storage that just lets you sum values with a single loop, and implement a 2D indexer yourself if you want the[x, y]
syntax for other reasons.If you really need the performance of SIMD you can go two ways
Vector<T>
See the comparison. In short, intrinstics give a lot better performance, at the cost tying it to a specific cpu platform.
In either case you need to understand the memory layout to correctly load elements. But once that is done it should be really simple, just add all the vectors together and do a sum of the elements in the end. Possibly with some scalar code in the end if the element count is not evenly dividable with the vector length.