How to use OpenMP with OpenBLAS on Apple Sillicon M1 Max macOS Sonoma 14.3.1?

79 views Asked by At

I am desperately trying to make OpenMP and OpenBLAS work together without any success for now. I have reviewed many posts and documentation about this topic. I give here the most relevant of them:

I want to run a BLAS procedure inside a parallel region. Here is a minimal code showing what I am trying to do:

#include <cblas.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>

#ifdef _OPENMP
#include <omp.h>
#endif

#ifndef VAL_N
#define VAL_N 2400
#endif

#ifndef VAL_M
#define VAL_M 24
#endif

#ifndef VAL_LDA
#define VAL_LDA 2401
#endif

// Initialize random array
void random_number(double* array, int size) {
    for (int i = 0; i < size; i++) {
        // Generate number in [0, 1[
        double r = (double)rand() / (double)(RAND_MAX - 1);
        array[i] = r;
    }
}

int main() {
    const int n = VAL_N, m = VAL_M, lda = VAL_LDA, ldb = lda, ldx = lda;
    double a[n][lda];  // column-major order (Fortran)
    double b[m][ldb];
    double x[m][ldx];
    double pr[n];
    double norme;
    int nb_iter_min, nb_iter_max;

#ifdef _OPENMP
    int num_threads;
#pragma omp parallel
    { num_threads = omp_get_num_threads(); }
    printf("\nMutli-threaded mode on with %d thread(s).\n\n", num_threads);
#endif

    // Initialize matrix and right-hand side
    random_number(&a[0][0], n * lda);
    random_number(&b[0][0], m * ldb);

    // Muscle up the diagonal
    for (int i = 0; i < n; i++) a[i][i] += 100;

    // Preconditionning Jacobi=Diagonal(a)
    for (int i = 0; i < n; i++) pr[i] = a[i][i];

    // Initial guess
    for (int i = 0; i < m; i++)
        for (int j = 0; j < ldx; j++) x[i][j] = 1;

    // Initial CPU time
    clock_t t_cpu_0 = clock();

    double* pb = &b;
    double* pa = &a;
    double* px = &x;

    // Initial time.
    struct timeval t_elapsed_0;
    gettimeofday(&t_elapsed_0, NULL);

    double r[n];

#pragma omp parallel for schedule(runtime) private(r)
    for (int j = 0; j < m; j++) {
        double* b_j = pb + j * ldb;  // b(1:n, j)
        double* x_j = px + j * ldx;  // x(1:n, j)

        // r(:) = b(1:n,j) - matmul(a(1:n,1:n),x(1:n,j))
        memcpy(&r[0], b_j, n * sizeof(double));
        cblas_dgemv(CblasColMajor, CblasNoTrans, n, n, -1.0, pa, lda, x_j, 1,
                    1.0, r, 1);
    }

    // Final time
    struct timeval t_elapsed_1;
    gettimeofday(&t_elapsed_1, NULL);
    double t_elapsed =
        (t_elapsed_1.tv_sec - t_elapsed_0.tv_sec) +
        (t_elapsed_1.tv_usec - t_elapsed_0.tv_usec) / (double)1000000;

    // Final CPU time
    clock_t t_cpu_1 = clock();
    double t_cpu = (t_cpu_1 - t_cpu_0) / (double)CLOCKS_PER_SEC;

    // Print results
    fprintf(stdout,
            "\n\n"
            "   Elapsed time              : %10.3E sec.\n"
            "   CPU time                  : %10.3E sec.\n",
            t_elapsed, t_cpu);

    return EXIT_SUCCESS;
}

But no matter what I have tried there is no significant gain in using multiple threads. Profiling tools like Instruments show that the code is not running on multiple threads no matter the environment variable passed.

Does anyone have a clue how to make OpenMP and OpenBlas work together?

I give here the whole procedure that I used the build the openBlas library, compile, and run my code.

Building OpenBLAS library

I build the OpenBLAS library using Brew.

$ brew install -i openblas
$ make clean
$ make USE_OPENMP=1 USE_LOCKING=1 CC=gcc FC=gfortran
$ make USE_OPENMP=1 USE_LOCKING=1 CC=gcc FC=gfortran PREFIX=/your-path-to/openblas install

At the end of the procedure, I get the message Install OK!. However, running the command exit causes an error message Error: Empty installation. Regarding the library files being at the correct location, I ignore this message.

Compiling the code

I compile my code with the following command.

$ gcc -Xclang -fopenmp -I/your-path-to/libomp/include/ -L/your-path-to/libomp/lib/ -lomp
 -I/your-path-to/openblas/include -L/your-path-to/openblas/lib -lcblas main.c -o my_app

Running the code

Then I run my code exporting the environment variables to get a multithreaded application.

$ export OPENBLAS_NUM_THREADS=1 OMP_SCHEDULE="STATIC,4" OMP_NUM_THREADS=4
$ ./my_app

EDIT OPENMP ENVIRONMENT

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='1'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED: deprecated; max-active-levels-var=1
  [host] OMP_NUM_TEAMS='0'
  [host] OMP_NUM_THREADS='4'
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static,4'
  [host] OMP_STACKSIZE='16M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_TEAMS_THREAD_LIMIT='0'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_TOOL_VERBOSE_INIT: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END
0

There are 0 answers