I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.
I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.
For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.
Environment: compiling with gcc on WSL2
gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 4000 // Matrix size will be N x N
#define T 1
//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt
void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
    omp_set_num_threads(numThreads);
    unsigned long i, j;
    #pragma omp parallel for simd private(j)
    for (i = 0; i < matrixSize; i++)
    {
        // double *AHead = &A[i * matrixSize];
        // double tmp = 0;
        for (j = 0; j < matrixSize; j++)
        {
            results[i] += A[i * matrixSize + j] * V[j];
            // also tried tmp += A[i * matrixSize + j] * V[j];
        }
        // results[i] = tmp;
    }
}
void genRandVector(double *S, unsigned long size)
{
    srand(time(0));
    unsigned long i;
    for (i = 0; i < size; i++)
    {
        double n = rand() % 5;
        S[i] = n;
    }
}
void genRandMatrix(double *A, unsigned long size)
{
    srand(time(0));
    unsigned long i, j;
        for (i = 0; i < size; i++)
        {
            for (j = 0; j < size; j++)
            {
                double n = rand() % 5;
                A[i*size + j] = n;
            }
        }
    }
int main(int argc, char *argv[])
{
    double *V = (double *)malloc(N * sizeof(double));     // v in our A*v = parV computation
    double *parV = (double *)malloc(N * sizeof(double));  // Parallel computed vector
    double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
    genRandVector(V, N);
    doParallelComputation(A, V, parV, N, T);
    free(parV);
    free(A);
    free(V);
    
    return 0;
}
 
                        
Adding
double *restrict resultsto promise non-overlapping input/output helped, without OpenMP but with-ffast-math. https://godbolt.org/z/qaPh1vYou need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (
-ffast-mathdoesn't help the OpenMP vectorizer). With that as well, we get what you want:#pragma omp simd reduction(+:tmp)With just
restrictand no-ffast-mathor-fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4xvaddsdinto the scalar accumulator, not helping hide FP latency at all.With
restrictand-fopenmp(without fast-math), it just does scalar FMA.With
restrictand-ffast-math(without-fopenmpor#pragmacommented) it auto-vectorizes nicely:vfmadd231pd ymminside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3With
restrictand-ffast-math(with-fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4
result[i+0..3]results in 1 vector.However, unrolling the outer loop by 4 to use each
V[j+0..3]with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)(As long as
V[]and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)Also note that
double tmp = 0;would be a good idea: your current version adds intoresult[i], reading it before writing. That would require zero-init before you could use it as a pure output.Auto-vec auto-par version:
I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.
Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function
doParallelComputation._omp_fn.0:Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal"
-ftree-vectorizevectorizer, but that requires-ffast-mathto do anything here.