How can I calculate the sum of a matrix of doubles using Vectorization in C#?

133 views Asked by At

I have a 2 dimensional array of doubles representing a matrix which can potentially be large e.g. 200x200.

I need to be able to efficiently calculate the sum of this matrix. How can I achieve this using vectorization in C#?

The current vanilla approach is:

double[,] matrix =
{
    { 0.0, 1, 2, 3 },
    { 4, 5, 6, 7 },
    { 8, 9, 10, 11 },
    { 12, 13, 14, 15 }
};

int rows = matrix.GetLength(0);
int cols = matrix.GetLength(1);

double sum = 0;

for (uint i = 0; i < rows; i++)
{
    for (uint j = 0; j < cols; j++)
    {
        sum += matrix[i, j];
    }
}
2

There are 2 answers

4
JonasH On

First of all you should do some benchmarking and/or profiling and ask yourself if it really matters? Summing is a very simple calculation, and 200x200 is not very large. I would guess it may take on the order of a microsecond, but this is just a guess. You also need a benchmark to decide if you have actually achieved any improvements, or if you just made the code more complicated for no good reason.

But is this really the largest bottleneck of your application? Optimization is often about avoiding doing work, or avoid redoing work. The best any SIMD optimization could give you is a constant speedup. There is no point in wasting hours optimizing a function that have no noticeable impact on your user.

If you decide you need optimization then I would start by getting rid of the index-calculations. When you do matrix[i, j] the framework essentially does a i * width + j-calculation. This will likely take longer than the actual summing of values. It is possible the optimizer may remove some of this, but I would not assume anything from the optimizer without actually confirming it. You can either do the unsafe route with fixed (double* ptr = matrix ), or create a custom matrix class that uses a 1D array for storage that just lets you sum values with a single loop, and implement a 2D indexer yourself if you want the [x, y] syntax for other reasons.

If you really need the performance of SIMD you can go two ways

  1. Vector<T>
  2. Intrinstics, like Vector256

See the comparison. In short, intrinstics give a lot better performance, at the cost tying it to a specific cpu platform.

In either case you need to understand the memory layout to correctly load elements. But once that is done it should be really simple, just add all the vectors together and do a sum of the elements in the end. Possibly with some scalar code in the end if the element count is not evenly dividable with the vector length.

2
harold On

This can be done fairly well with the System.Numerics vector API, at least with liberal use of the Unsafe class.

There is as far as I know no good "standard" way to load a vector from a 2D matrix. None of the overloads of the normal loads apply, and there is no normal way to get a Span<T> of a 2D array. But with Unsafe, we can get it done anyway.

Using unrolling by 8 with 8 separate accumulators (see Unrolling FP loops with multiple accumulators), and treating the 2D matrix as if it is a 1D array by using Unsafe to manipulate a reference, we can do this: (not tested, but compiled on sharplab.io)

static unsafe double Sum(double[,] matrix)
{
    Vector<double> sum0 = Vector<double>.Zero;
    Vector<double> sum1 = Vector<double>.Zero;
    Vector<double> sum2 = Vector<double>.Zero;
    Vector<double> sum3 = Vector<double>.Zero;
    Vector<double> sum4 = Vector<double>.Zero;
    Vector<double> sum5 = Vector<double>.Zero;
    Vector<double> sum6 = Vector<double>.Zero;
    Vector<double> sum7 = Vector<double>.Zero;
    double sum8 = 0;
    uint vlen = (uint)Vector<double>.Count;

    ref double unaligneddata = ref matrix[0, 0];
    uint i = 0;
    uint alignmask = vlen * sizeof(double) - 1;
    for (; i < matrix.Length && ((IntPtr)Unsafe.AsPointer(ref unaligneddata) & alignmask) != 0; i++)
    {
        sum8 += unaligneddata;
        unaligneddata = ref Unsafe.Add(ref unaligneddata, 1);
    }
    uint alignment_skipped = i;
    ref Vector<double> data = ref Unsafe.As<double, Vector<double>>(ref unaligneddata);
    uint bigChunk = ((uint)matrix.Length - alignment_skipped & (0u - (vlen * 8))) + alignment_skipped;
    for (; i < bigChunk; i += vlen * 8)
    {
        sum0 += data;
        sum1 += Unsafe.Add(ref data, 1);
        sum2 += Unsafe.Add(ref data, 2);
        sum3 += Unsafe.Add(ref data, 3);
        sum4 += Unsafe.Add(ref data, 4);
        sum5 += Unsafe.Add(ref data, 5);
        sum6 += Unsafe.Add(ref data, 6);
        sum7 += Unsafe.Add(ref data, 7);
        data = ref Unsafe.Add(ref data, 8);
    }
    uint smallChunk = ((uint)matrix.Length - alignment_skipped & (0u - vlen)) + alignment_skipped;
    for (; i < smallChunk; i += vlen)
    {
        sum0 += data;
        data = ref Unsafe.Add(ref data, 1);
    }
    ref double remainder = ref Unsafe.As<Vector<double>, double>(ref data);
    for (; i < matrix.Length; i++)
    {
        sum8 += remainder;
        remainder = ref Unsafe.Add(ref remainder, 1);
    }

    sum0 += sum1;
    sum2 += sum3;
    sum4 += sum5;
    sum6 += sum7;
    sum0 += sum2;
    sum4 += sum6;
    sum0 += sum4;
    return Vector.Dot(sum0, new Vector<double>(1.0)) + sum8;
}

Using Vector.Dot in the end to do a horizontal sum is a bit silly, but short and it only happens once.

The loop at the start that tries to make the address aligned is mostly for when AVX is not used. Unfortunately this requires unsafe (the keyword, not the class), as far as I know, even though the raw pointer is immediately converter to an integer and never used as a pointer.

When AVX2 is available (Vector<T> is 128-bit without AVX2, even if you only use float/double), the main loop may looks like this in assembly:

L008c: vaddpd ymm0, ymm0, [rax]
L0091: vaddpd ymm1, ymm1, [rax+0x20]
L0097: vaddpd ymm2, ymm2, [rax+0x40]
L009d: vaddpd ymm3, ymm3, [rax+0x60]
L00a3: vaddpd ymm4, ymm4, [rax+0x80]
L00ac: vaddpd ymm5, ymm5, [rax+0xa0]
L00b5: vaddpd ymm6, ymm6, [rax+0xc0]
L00be: vaddpd ymm7, ymm7, [rax+0xe0]
L00c7: add rax, 0x100
L00cd: add r8d, 0x20
L00d1: cmp r8d, ecx
L00d4: jb short L008c

Looks good to me. We could save an add here by comparing the address directly rather than keeping a redundant index but that's not a huge deal.