I'm playing around with .NET Core 3.0's new support for hardware intrinsics, in the System.Runtime.Intrinsics namespace.

I have some code where I perform 4 XOR operations in a loop - below is a simplified example (I didn't write this in an IDE, so please ignore any syntax mistakes:

private static unsafe ulong WyHashCore(byte[] array)
{
    fixed (byte* pData = array)
    {
        byte* ptr = pData;

        // Consume 32-byte chunks
        for (int i = 0; i < array.Length; i += 32)
        {
            ulong a = Read64(ptr, i);
            ulong b = Read64(ptr, i + 8);
            ulong c = Read64(ptr, i + 16);
            ulong d = Read64(ptr, i + 24);

            // XOR them with some constants
            ulong xor1 = a ^ SOME_CONSTANT1;
            ulong xor2 = b ^ SOME_CONSTANT2;
            ulong xor3 = c ^ SOME_CONSTANT3;
            ulong xor4 = d ^ SOME_CONSTANT4;

            // Use the resulting values
        }
    }
}

The Read64 method looks like:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static unsafe ulong Read64(byte* ptr, int start)
    => *(ulong*)(ptr + start);

I tried replacing the 4 XOR lines with:

byte[] array; // An array from elsewhere

private static unsafe ulong WyHashCore(byte[] array)
{
    var bVector = Vector256.Create(SOME_CONSTANT1, SOME_CONSTANT2, SOME_CONSTANT3, SOME_CONSTANT4);

    fixed (byte* pData = array)
    {
        byte* ptr = pData;

        // Consume 32-byte chunks
        for (int i = 0; i < array.Length; i += 32)
        {
            ulong a = Read64(ptr, i);
            ulong b = Read64(ptr, i + 8);
            ulong c = Read64(ptr, i + 16);
            ulong d = Read64(ptr, i + 24);

            // Create a 256-bit vector from the 4 64-bit integers
            var aVector = Vector256.Create(a, b, c, d);

            // XOR the 2 vectors
            var res = Avx2.Xor(aVector, bVector);

            // Get the resulting values out of the result vector
            ulong xor1 = res.GetElement(0);
            ulong xor2 = res.GetElement(1);
            ulong xor3 = res.GetElement(2);
            ulong xor4 = res.GetElement(3);

            // Use the resulting values
        }
    }
}

This does give the expected results - but it's 15x slower than just multiplying the scalars!

Am I going wrong somewhere, or mis-using SIMD?

** Update ** I've updated the code to use the 'correct' ways of loading and offloading data to/from the vector, and it's now around 3.75x faster than the scalar code!

byte[] array; // An array from elsewhere
private static readonly Vector256<ulong> PrimeVector = Vector256.Create(SOME_CONSTANT1, SOME_CONSTANT2, SOME_CONSTANT3, SOME_CONSTANT4);

private static unsafe ulong WyHashCore(byte[] array)
{
    // Create space on the stack to hold XOR results
    var xorResult = stackalloc ulong[4];

    fixed (byte* pData = array)
    {
        byte* ptr = pData;

        // Consume 32-byte chunks
        for (int i = 0; i < array.Length; i += 32)
        {
            // Create a 256-bit vector from the 4 64-bit integers
            var vector = Avx.LoadVector256((ulong*)(ptr + i));

            // XOR the 2 vectors
            var res = Avx2.Xor(vector, PrimeVector);

            // Store the resulting vector in memory
            Avx2.Store(xorResult, res);

            // Get the resulting values out of the result vector
            var xor1 = *xorResult;
            var xor2 = *(xorResult + 1);
            var xor3 = *(xorResult + 2);
            var xor4 = *(xorResult + 3);

            // Use the resulting values
        }
    }
}

1 Answers

2
Jacek Blaszczynski On

TL;DR The AVX2 HW intrinsics are used incorrectly what results in generation of very inefficient SIMD code.

The error lies in a way the instructions load, operate on and store data in a buffer. The operation should be performed using AVX/AVX2 Avx2.Xor intrinsics with memory what will speedup load time 4x and would return Vector256. This on the other hand would make call to Vector256.Create redundant and would speedup execution further. Finally, the data should be stored in the array by using Avx2.Store() intrinsic. This again would speedup the code roughly 4x.

The last optimization which should be applied is an exploitation of CPU instruction level parallelism. Usually SIMD instructions are executed over predefined number of CPU cycles with latency which maybe larger than 1 CPU cycle. These parameters are CPU specific and can be found in:

As all optimizations which could be applied are quite complex I will explain them in longer writeup a bit later but in general I would expect up to 4x speedup due to vectorisation in comparison to base case for the problem you are working on.

The code sample you are using is a simple loop modifying data in quad unsigned quadword steps and is a perfect candidate for autovectorization by optimizing compilers. When identical C++ loop is optimized by GCC 9.1 with options -O3 -march=haswell the resulting machine code shows all standard optimizations applied to the loop:

#include <cstdint>
void hash(uint64_t* buffer, uint64_t length) {

    uint64_t* pBuffer = buffer;
    const uint64_t CONST1 = 0x6753ul;
    const uint64_t CONST2 = 0x7753ul;
    const uint64_t CONST3 = 0x8753ul;
    const uint64_t CONST4 = 0x9753ul;

    for(uint64_t i = 0; i < length; i += 4)
    {
        *pBuffer ^= CONST1;
        *(pBuffer + 1) ^= CONST2;
        *(pBuffer + 2) ^= CONST3;
        *(pBuffer + 3) ^= CONST4;
    }
}

Godbolt Compiler Explorer result GCC 9.1

    test    rsi, rsi
    je      .L11
    cmp     rsi, -4
    ja      .L6
    lea     rdx, [rsi-1]
    vmovdqa ymm1, YMMWORD PTR .LC0[rip]
    xor     eax, eax
    shr     rdx, 2
    inc     rdx
.L5:
    vpxor   ymm0, ymm1, YMMWORD PTR [rdi]
    inc     rax
    add     rdi, 32
    vmovdqu YMMWORD PTR [rdi-32], ymm0
    cmp     rax, rdx
    jb      .L5
    vzeroupper
.L11:
    ret
.L6:
    vmovdqa ymm1, YMMWORD PTR .LC0[rip]
    xor     eax, eax
.L3:
    vpxor   ymm0, ymm1, YMMWORD PTR [rdi]
    add     rax, 4
    add     rdi, 32
    vmovdqu YMMWORD PTR [rdi-32], ymm0
    cmp     rsi, rax
    ja      .L3
    vzeroupper
    jmp     .L11
.LC0:
    .quad   26451
    .quad   30547
    .quad   34643
    .quad   38739

Godbolt Compiler Explorer result Clang 8.0

 .LCPI0_0:
    .quad   26451                   # 0x6753
    .quad   30547                   # 0x7753
    .quad   34643                   # 0x8753
    .quad   38739                   # 0x9753
 hash(unsigned long*, unsigned long):                             # @hash(unsigned long*, unsigned long)
    test    rsi, rsi
    je      .LBB0_3
    xor     eax, eax
    vmovaps ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = [26451,30547,34643,38739]
 .LBB0_2:                                # =>This Inner Loop Header: Depth=1
    vxorps  ymm1, ymm0, ymmword ptr [rdi + 8*rax]
    vmovups ymmword ptr [rdi + 8*rax], ymm1
    add     rax, 4
    cmp     rax, rsi
    jb      .LBB0_2
 .LBB0_3:
    vzeroupper
    ret