I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200.
#include <stdio.h>
#include <cuda.h>
#define width 1000
__global__ void kernel(int *a,int *b,int *c)
{
int tx = threadIdx.x + blockIdx.x*blockDim.x;
int ty = threadIdx.y + blockIdx.y*blockDim.y;
int sum=0,k;
for(k=0;k<(width);++k)
{
sum += a[ty*width +k]*b[k*width + tx];
}
c[ty*width + tx] = sum;
}
int main()
{
int a[width*width],c[width*width],b[width*width];
int *dev_a,*dev_b,*dev_c;
int i,count=0;
int size = (width*width)*sizeof(int);
for(i=0;i<(width*width);i++)
{
a[i] = 1;
b[i] = 1;
}
cudaMalloc((void **)&dev_a,size);
cudaMalloc((void **)&dev_b,size);
cudaMalloc((void **)&dev_c,size);
cudaMemcpy(dev_a,&a,size,cudaMemcpyHostToDevice);
cudaMemcpy(dev_b,&b,size,cudaMemcpyHostToDevice);
dim3 dimBlock(20,20);
dim3 blockID(50,50);
kernel<<<blockID,dimBlock>>>(dev_a,dev_b,dev_c);
cudaMemcpy(&c,dev_c,size,cudaMemcpyDeviceToHost);
for(i=0;i<(width*width);i++)
{
count++;
if(count == (width+1))
{
count = 1;
printf("\n");
}
printf("%d ",c[i]);
}
printf("\n");
return 0;
}
This code will work for very specific dimensions but not for others.
It will work for square matrix multiplication when
width
is exactly equal to the product of your block dimension (number of threads - 20 in the code you have shown) and your grid dimension (number of blocks - 50 in the code you have shown).So when
width
is 20*50 (1000) it will work as shown. But if I changewidth
to some other value (say 800) and make no other changes, your code won't work. In the case of 800, however, I could get your code working by changing the grid dimension from 50 to 40, thenwidth
= 800 = 20 *40.But what if I need to multiply two matrices of
width
799? I can't come up with a product of grid and block dimension that will match thatwidth
exactly.This is a fairly standard problem in CUDA programming - I cannot come up with convenient block and grid dimensions to exactly match my work (i.e. data) size, and if I launch too many (threads/blocks) things don't seem to work.
To fix this problem we must do 2 things:
To address item 1 above, we modify our grid dimension calculations to something like this:
To address item 2 above we modify our kernel code to condition thread behavior on whether or not the thread corresponds to valid data:
And since we've modified the kernel with a new parameter, we have to pass that parameter on invocation:
That should be what is needed to logically extend the code you have shown to handle "arbitrary" dimensions. I would also suggest adding proper cuda error checking any time you are having trouble with a CUDA code.