I'm new to CUDA and Thrust and I'm trying to implement a matrix multiplication and I want to achieve this by only using the thrust algorithms, because I want to avoid calling a kernel manually.
Is there a way I can achieve this efficiently? (At least without Using 2 nested for loops)
Or do I have to resign and call a CUDA Kernel?
//My data
thrust::device_vector<float> data(n*m);
thrust::device_vector<float> other(m*r);
thrust::device_vector<float> result(n*r);
// To make indexing faster, not really needed
transpose(other);
// My current approach
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < r;++j)
{
result[i*r+ j] = thrust::inner_product(data.begin()+(i*m), data.begin()+((i+1)*m),other+(j*m), 0.0f);
}
}
If you are interested in performance (usually why people use GPUs for computing tasks) you should not use thrust and you should not call or write your own CUDA kernel. You should use the CUBLAS library. For a learning exercise, if you want to study your own CUDA kernel, you can refer to a first-level-optimized CUDA version in the CUDA programming guide in the shared memory section. If you really want to use thrust with a single thrust call, it is possible.
The basic idea is to use an element-wise operation like thrust::transform as described here. The per-output-array-element dot-product is computed with a functor consisting of a loop.
Here's a worked example considering 3 methods. Your original double-nested loop method (relatively slow), a single thrust call method (faster) and the cublas method (fastest, certainly for larger matrix sizes). The code below only runs method 1 for matrix side dimensions of 200 or less because it is so slow. Here is an example on Tesla P100:
For the default dimension 200 case, the single thrust call and cublas method are fairly close, but are much faster than the loop method. For a side dimension of 1024, the cublas method is almost 20x faster than the single thrust call method.
Note that I have chosen "optimal" transpose configurations for all 3 methods. For method 1, the best case timing is when the inner_product is using a "row" from each input matrix (effectively the tranpose of the 2nd input matrix). For method 2, the best case timing is when the functor is traversing a "column" from each input matrix (effectively the transpose of the first input matrix). For method 3, the choice of
CUBLAS_OP_T
for both input matrices seems to be fastest. In reality, only the cublas method has the flexibility to be useful for a variety of input cases with good performance.