Why is calling masm from C# so slow?

191 views Asked by At

Here is a 64-bit masm program to multiply two 64-bit numbers and add a 64-bit number to give a 128-bit result (using the standard 64-bit calling conventions):

; public static extern ulong MulAdd64(ulong U, ulong V, ref ulong k);
; Return  (U*V + k) % ß  and set  k = (U*V + k) / ß.
;  U in rcx, V in rdx, &k in r8
; Note  0  <=  0*0 + 0  <=  (ß-1)*(ß-1) + (ß-1) = ß*(ß-1)  <  ß^2
MulAdd64  proc  public
        mov     rax,rcx
        mul     rdx
        add     rax,qword ptr [r8]      ; low part of product
        adc     rdx,0
        mov     qword ptr [r8],rdx      ; high part of product
        ret
MulAdd64  endp

This is imported into the C# code via:

    [DllImport(@"C:\path\MulAdd64.dll")]
    public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);

Now here is the same function written in C# along with a test program:

public static void TestCS_Masm_Speed ()
{
    ulong x = 3141592653589793238, y = 2718281828459045, aux = 1234567890123456789, lo = 0;
    // just in case the first invocation is excessivly slow
    //lo = MulAdd64(x, y, ref aux);
    lo = CS_MulAdd64(x, y, ref aux);
    Stopwatch sw = new Stopwatch();
    sw.Restart();
    for (int i = 0; i < 10000000; i++) {
        //lo = MulAdd64(x, y, ref aux);
        lo = CS_MulAdd64(x, y, ref aux);
    }
    Console.WriteLine("Ticks = {0}", sw.ElapsedTicks);
    // verify low-order results  hi:lo = x*y + aux;
    if (x*y+aux != lo) Console.WriteLine("Error in low order result");
    else Console.WriteLine("Low order result is OK");
}

[DllImport(@"C:\path\MulAdd64.dll")]
public static extern ulong MulAdd64 (ulong U, ulong V, ref ulong k);

/*
    Multiplication. We need to multiply two unsigned 64-bit integers, obtaining an
    unsigned 128-bit product. Using Algorithm 4.3.1M of Seminumerical Algorithms,
    with ß = 2^32, the following subroutine computes hi:lo = y*z + aux .  Then
    sets aux to hi and return lo.
*/
public static ulong CS_MulAdd64 (ulong y, ulong z, ref ulong aux)
{
    ulong[] u = new ulong[2],  v = new ulong[2],  w = new ulong[4];
    // Unpack the multiplier, multiplicand, and aux  to  u, v, and w 
    u[1] = (ulong)y >> 32;      u[0] = (ulong)y & 0xFFFFFFFF;
    v[1] = (ulong)z >> 32;      v[0] = (ulong)z & 0xFFFFFFFF;
    w[1] = (ulong)aux >> 32;    w[0] = (ulong)aux & 0xFFFFFFFF;
    // Multiply
    for (int j = 0; j < 2; j++) {
        ulong k = 0;
        for (int i = 0; i < 2; i++) {
            ulong t = u[i] * v[j] + w[i+j] + k;
            w[i+j] = t & 0xFFFFFFFF; k = t >> 32;
        }
        w[j+2] = k;
    }
    // Pack w into the outputs aux and return w
    aux = ((ulong)w[3] << 32) + (ulong)w[2];
    return ((ulong)w[1] << 32) + (ulong)w[0];
}   

The optimized C# code is significantly longer than the masm code (153 instructions versus 6 instructions) but runs almost twice as fast (941694 ticks vs 1722289 ticks)! How can this be? Everything is passed in registers and there is no memory to pin! Apparently something is happening between the call from C# and the execution in masm, but what? I can't step into that code.

1

There are 1 answers

0
Phil1970 On

By using Google, you can have a better understanding of the overhead associated to managed / native transitions. Here are some articles:

In a case like your, the overhead of adjusting calling convention and other required adjustment is probably higher than the call itself.

Usually, if you do enough such multiplications, you would batch them together to minimize the overhead. And you would do some testing to find the optimal size. However, if you are really serious about performance, then you would consider everything like cache size, parallel processing...

I believe that C# compiler can relatively well optimize your code so that at CPU level instructions can be properly sequenced so that there is almost no wasted cycles. It might be able to use SIMD or other tricks to optimize the code. And otherwise, you can probably write much more efficient multiplication in C#.

By the way, why would you write your own code when Windows already provide a function to do such multiplication: UnsignedMultiply128 function?

And have you compared BigInteger performance?