Matrix Multiplication on SYCL using 2D std::vector

1.5k views Asked by At

I'm new to SYCL and C++. This is my kernel for simple matrix multiplication using 2D std::vector.


void MatrixMulParallel(queue& q, 
    const std::vector<std::vector<double>>& a_host,
    const std::vector<std::vector<double>>& b_host,
    std::vector<std::vector<double>>& c_gpu) {
    /*
        To Multiply: C[M][P] = A[M][N] * B[N][P]
    */
    PROFILE_FUNCTION();
    try {
        size_t M = a_host.size();
        size_t N = a_host[0].size();
        size_t P = b_host[0].size();
        // Create device buffers for A, B, C
        buffer a(a_host.data(), range<2>{M, N});
        buffer b(b_host.data(), range<2>{N, P});
        buffer c(c_gpu.data(), range<2>{M, P});

        PROFILE_SCOPE("Starting Multiply on GPU");
        std::cout << "GPU::Multiplying A and B into C.\n";
        auto e = q.submit([&](handler& h) {

            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);

            h.parallel_for(range<2>{M, P}, [=](id<2> index) {
                // index[0] allows accessing ROW index, index[1] is column index
                
                int row = index[0];
                int col = index[1];
                auto sum = 0.0;
                for (int i = 0; i < N; i++)
                    sum += A[row][i] * B[i][col]; // Error #1
                C[index] = sum; // Error #2
                });
            });
        e.wait();
    }
    catch (sycl::exception const& e) {
        std::cout << "An exception is caught while multiplying matrices.\n";
        terminate();
    }
}

I get two errors, indicated along the lines:

  1. Error #1: invalid operands to binary expression ('const std::vector<double, std::allocator<double>>' and 'const std::vector<double, std::allocator<double>>')
  2. Error #2: no viable overloaded '='

I tried looking for errors similar to invalid operands for binary expression (...), but none of them seem to help debug my specific case. Maybe because this ain't beginner-friendly.

From what I understood so far, a_host.data() shows a return type std::vector<double> (shouldn't it be std::vector< std::vector<double> >?).

I have tried using std::array with statically known sizes, and it works.

How can I make this work using 2D std::vector?

Any help would be appreciated.

2

There are 2 answers

0
Karan Shah On BEST ANSWER

A 2D std::vector<std::vector<T>> does not have elements stored contiguously in the memory.

A better way around would be to declare std::vector<T> with sizes M*N, i.e. linear arrays, and operate on them as contiguous blocks.

Since the destination vector C, is supposed to be 2D, create a kernel that indexes in both rows and cols. SYCL index actually fills up in linearly-accessible memory blocks.

Here's what I did to make it work using std::vector:

template <typename T>
void MatrixMulParallelNaive(queue& q, 
    const std::vector<T>& a_host,
    const std::vector<T>& b_host,
    std::vector<T>& c_gpu) {
    /*
        To Multiply: C[M][P] = A[M][N] * B[N][P]
    */
    PROFILE_FUNCTION();
    try {
        
        buffer<double, 1> a(a_host.data(), range<1>{a_host.size()}); // 1D
        buffer<double, 1> b(b_host.data(), range<1>{b_host.size()}); // 1D
        buffer<double, 2> c(c_gpu.data(), range<2>{M, P}); // Create 2D buffer
        PROFILE_SCOPE("Starting Multiply on GPU");
        std::cout << "GPU::Multiplying A and B into C.\n";
        auto e = q.submit([&](handler& h) {

            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);
            
            h.parallel_for(range<2>{M, P}, [=](id<2> index) {
                // Threading index that iterates over C.
                int row = index[0];
                int col = index[1];
                auto sum = 0.0;
                // Compute result of ONE element of C
                for (int i = 0; i < N; i++)
                    sum += A[row * M + i] * B[i * N + col];
                C[index] = sum;
                });
            });
        e.wait();
    }
    catch (sycl::exception const& e) {
        std::cout << "An exception is caught while multiplying matrices.\n";
        terminate();
    }
}


0
Ronan Keryell On

More generally, avoid non compact data structure when doing HPC. It is less friendly for the memory hierarchy than contiguous array elements and the initialization is complex. Use instead things similar to md_span and md_array (basically Fortran arrays on steroids :-) ).