CUDA: Allocating 2D array on GPU

7k views Asked by At

I have already read the following thread , but I couldn't get my code to work.
I am trying to allocate a 2D array on GPU, fill it with values, and copy it back to the CPU. My code is as follows:

__global__ void Kernel(char **result,int N)
{
    //do something like result[0][0]='a';
}
int N=20;
int Count=5;
char **result_h=(char**)malloc(sizeof(char*)*Count);
char **result_d; 
cudaMalloc(&result_d, sizeof(char*)*Count);
for(int i=0;i<Count;i++)
{
    result_h[i] = (char*)malloc(sizeof(char)*N);    
    cudaMalloc(&result_d[i], sizeof(char)*N); //get exception here
}

//call kernel
//copy values from result_d to result_h
printf("%c",result_h[0][0])//should print a

How can i achieve this?

4

There are 4 answers

0
peakxu On

For doing the simplest 2D operations on a GPU, I'd recommend you just treat it as a 1D array. cudaMalloc a block of size w*h*sizeof(char). You can access the element (i,j) through index j*w+i.

Alternatively, you could use cudaMallocArray to get a 2D array. This has a better sense of locality than linear mapped 2D memory. You can easily bind this to a texture, for example.

Now in terms of your example, the reason why it doesn't work is that cudaMalloc manipulates a host pointer to point at a block of device memory. Your example allocated the pointer structure for results_d on the device. If you just change the cudaMalloc call for results_d to a regular malloc, it should work as you originally intended.

That said, perhaps one of the two options I outlined above might work better from an ease of code maintenance perspective.

0
fabmilo On

When allocating in that way you are allocating addresses that are valid on the CPU memory. The value of the addresses is transferred as a number without problems, but once on the device memory the char* address will not have meaning.

Create an array of N * max text length, and another array of length N that tells how long each word is.

This is a bit more advanced but if you are processing a set of defined text (passwords for example) I would suggest you to group it by text length and create specialized kernel for each length

template<int text_width>
__global__ void Kernel(char *result,int N)
{
    //pseudocode
    for i in text_width:
        result[idx][i] = 'a'
}

and in the kernel invocation code you specify:

switch text_length
case 16:
   Kernel<16> <<<>>> ()
0
Paul R On

You can't manipulate device pointers in host code, which is why the cudaMalloc call inside the loop fails. You should probably just allocate a single contiguous block of memory and then treat that as a flattened 2D array.

0
user1210922 On

The following code sample allocates a width×height 2D array of floating-point values and shows how to loop over the array elements in device code[1]

// host code

float* devPtr; 

int pitch;

cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); 

myKernel<<<100, 192>>>(devPtr, pitch); 

// device code 
__global__ void myKernel(float* devPtr, int pitch) 

{ 
 for (int r = 0; r < height; ++r) { 

  float* row = (float*)((char*)devPtr + r * pitch); 

  for (int c = 0; c < width; ++c) { 
            float element = row[c]; } 
                                           } 
 }

The following code sample allocates a width×height CUDA array of one 32-bit floating-point component[1]

 cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); 
 cudaArray* cuArray; 
 cudaMallocArray(&cuArray, &channelDesc, width, height);

The following code sample copies the 2D array to the CUDA array allocated in the previous code samples[1]:

cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch, width * sizeof(float), height, 
cudaMemcpyDeviceToDevice);

The following code sample copies somehost memory array to device memory[1]:

float data[256]; 
int size = sizeof(data); 
float* devPtr; 
cudaMalloc((void**)&devPtr, size); 
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);

you can understand theses examples and apply them in your purpose.

[1] NVIDIA CUDA Compute Unified Device Architecture