gcc not autovectorising matrix-vector multiplication

542 views Asked by At

I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.

I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.

For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.

Environment: compiling with gcc on WSL2

gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define N 4000 // Matrix size will be N x N
#define T 1

//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt

void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
    omp_set_num_threads(numThreads);
    unsigned long i, j;

    #pragma omp parallel for simd private(j)
    for (i = 0; i < matrixSize; i++)
    {
        // double *AHead = &A[i * matrixSize];
        // double tmp = 0;

        for (j = 0; j < matrixSize; j++)
        {
            results[i] += A[i * matrixSize + j] * V[j];
            // also tried tmp += A[i * matrixSize + j] * V[j];
        }
        // results[i] = tmp;
    }

}

void genRandVector(double *S, unsigned long size)
{
    srand(time(0));
    unsigned long i;

    for (i = 0; i < size; i++)
    {
        double n = rand() % 5;
        S[i] = n;
    }
}

void genRandMatrix(double *A, unsigned long size)
{
    srand(time(0));
    unsigned long i, j;
        for (i = 0; i < size; i++)
        {
            for (j = 0; j < size; j++)
            {
                double n = rand() % 5;
                A[i*size + j] = n;
            }

        }
    }

int main(int argc, char *argv[])
{

    double *V = (double *)malloc(N * sizeof(double));     // v in our A*v = parV computation
    double *parV = (double *)malloc(N * sizeof(double));  // Parallel computed vector
    double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
    genRandVector(V, N);
    doParallelComputation(A, V, parV, N, T);

    free(parV);
    free(A);
    free(V);
    
    return 0;
}
1

There are 1 answers

7
Peter Cordes On BEST ANSWER

Adding double *restrict results to promise non-overlapping input/output helped, without OpenMP but with -ffast-math. https://godbolt.org/z/qaPh1v

You need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (-ffast-math doesn't help the OpenMP vectorizer). With that as well, we get what you want:
#pragma omp simd reduction(+:tmp)


With just restrict and no -ffast-math or -fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4x vaddsd into the scalar accumulator, not helping hide FP latency at all.

With restrict and -fopenmp (without fast-math), it just does scalar FMA.

With restrict and -ffast-math (without -fopenmp or #pragma commented) it auto-vectorizes nicely: vfmadd231pd ymm inside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3

With restrict and -ffast-math (with -fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?


Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4 result[i+0..3] results in 1 vector.

However, unrolling the outer loop by 4 to use each V[j+0..3] with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)

(As long as V[] and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)


Also note that double tmp = 0; would be a good idea: your current version adds into result[i], reading it before writing. That would require zero-init before you could use it as a pure output.

Auto-vec auto-par version:

I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.

void doParallelComputation(double *restrict A, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
    omp_set_num_threads(numThreads);
    unsigned long i, j;

    #pragma omp parallel for private(j)
    for (i = 0; i < matrixSize; i++)
    {
        // double *AHead = &A[i * matrixSize];
        double tmp = 0;

         // TODO: unroll outer loop and cache-block it.
        #pragma omp simd reduction(+:tmp)
        for (j = 0; j < matrixSize; j++)
        {
            //results[i] += A[i * matrixSize + j] * V[j];
            tmp += A[i * matrixSize + j] * V[j];  // 
        }
        results[i] = tmp;  // write-only to results, not adding to old value.
    }

}

Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function doParallelComputation._omp_fn.0:

# gcc7.5 -xc -O3 -fopenmp -march=skylake
.L6:
        add     rdx, 1               # loop counter; newer GCC just compares the end-pointer
        vmovupd ymm2, YMMWORD PTR [rcx+rax]            # 32-byte load
        vfmadd231pd     ymm0, ymm2, YMMWORD PTR [rsi+rax]  # 32-byte memory-source FMA
        add     rax, 32                                # pointer increment
        cmp     rdi, rdx
        ja      .L6

Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal" -ftree-vectorize vectorizer, but that requires -ffast-math to do anything here.