I am developing my own implementation of sparse BLAS functions for CSC storage formats. To do so, I created the following data structure:
typedef struct SparseMatrixCSC {
int m; // Number of rows
int n; // Number of columns
int nnz; // Number of stored values
double *val; // Stored values
int *row_idx; // Row indices of stored values
int *col_start; // Column j contains values with indices in col_start[j]:(col_start[j+1]-1)
} SparseMatrixCSC;
Then, I wanted to use OpenMP in order to parallelize the matrix-vector product (SpMV). I used different approaches to circumvent race conditions.
First, I used atomic operations as follows:
void dcscmv_atomic(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
#pragma omp atomic
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
This works fine, but it is terribly slow and actually rather yields slowdown than speedup.
Second, I tried to use OpenMP's built-in vector reduction feature as follows:
void dcscmv_builtin_array_reduction(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
#pragma omp parallel for reduction(+:y[:A->m])
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
y[A->row_idx[ii]] += A->val[ii] * x[j];
}
}
}
While this code compiles correctly, it works correctly with one single thread, but it leads to a segmentation fault when using multiple threads.
Third, since I could not get OpenMP's built-in vector reduction to work, I tried coding my own reduction as follows:
void dcscmv_array_reduction_from_scratch(SparseMatrixCSC *A, double *x, double *y) {
for (int i=0; i<A->m; i++) y[i] = 0.;
double *YP;
#pragma omp parallel
{
int P = omp_get_num_threads();
int p = omp_get_thread_num();
#pragma omp single
{
YP = (double*)mkl_malloc(A->m * P * sizeof(double), sizeof(double));
for (int i=0; i<A->m*P; i++) YP[i] = 0.;
}
#pragma omp for
for (int j=0; j<A->n; j++) {
for (int ii=A->col_start[j]; ii<A->col_start[j+1]; ii++) {
YP[p * A->m + A->row_idx[ii]] += A->val[ii] * x[j];
}
}
#pragma omp for
for (int i=0; i<A->m; i++) {
for (int p=0; p<P; p++) {
y[i] += YP[A->m * p + i];
}
}
}
mkl_free(YP);
}
This function worked, but still gave me a slowdown, although not as bad as with dcscmv_atomic.
I still have hope that if I get dcscmv_builtin_vector_reduction to work, I might be able to get some speedup. Hence, my question is: what is wrong with the way I implemented the vector reduction in dcscmv_builtin_vector_reduction, and how can I get rid of this segmentation fault?
I tried to apply OpenMP's buit-in vector reduction feature with #pragma omp parallel for reduction(+:y[:A->m]), but although the code compiled, it results in a segmentation fault at execution with multiple threads.
Your problem is with the indirect addressing which makes the loop not trivially parallel. As you have noticed, simple solutions don't work or are slow. Two ways out.