CUDA: Allocating 2D array on GPU

7k views Asked by At

I have already read the following thread , but I couldn't get my code to work.
I am trying to allocate a 2D array on GPU, fill it with values, and copy it back to the CPU. My code is as follows:

__global__ void Kernel(char **result,int N)
    //do something like result[0][0]='a';
int N=20;
int Count=5;
char **result_h=(char**)malloc(sizeof(char*)*Count);
char **result_d; 
cudaMalloc(&result_d, sizeof(char*)*Count);
for(int i=0;i<Count;i++)
    result_h[i] = (char*)malloc(sizeof(char)*N);    
    cudaMalloc(&result_d[i], sizeof(char)*N); //get exception here

//call kernel
//copy values from result_d to result_h
printf("%c",result_h[0][0])//should print a

How can i achieve this?


There are 4 answers

peakxu On

For doing the simplest 2D operations on a GPU, I'd recommend you just treat it as a 1D array. cudaMalloc a block of size w*h*sizeof(char). You can access the element (i,j) through index j*w+i.

Alternatively, you could use cudaMallocArray to get a 2D array. This has a better sense of locality than linear mapped 2D memory. You can easily bind this to a texture, for example.

Now in terms of your example, the reason why it doesn't work is that cudaMalloc manipulates a host pointer to point at a block of device memory. Your example allocated the pointer structure for results_d on the device. If you just change the cudaMalloc call for results_d to a regular malloc, it should work as you originally intended.

That said, perhaps one of the two options I outlined above might work better from an ease of code maintenance perspective.

fabmilo On

When allocating in that way you are allocating addresses that are valid on the CPU memory. The value of the addresses is transferred as a number without problems, but once on the device memory the char* address will not have meaning.

Create an array of N * max text length, and another array of length N that tells how long each word is.

This is a bit more advanced but if you are processing a set of defined text (passwords for example) I would suggest you to group it by text length and create specialized kernel for each length

template<int text_width>
__global__ void Kernel(char *result,int N)
    for i in text_width:
        result[idx][i] = 'a'

and in the kernel invocation code you specify:

switch text_length
case 16:
   Kernel<16> <<<>>> ()
Paul R On

You can't manipulate device pointers in host code, which is why the cudaMalloc call inside the loop fails. You should probably just allocate a single contiguous block of memory and then treat that as a flattened 2D array.

user1210922 On

The following code sample allocates a width×height 2D array of floating-point values and shows how to loop over the array elements in device code[1]

// host code

float* devPtr; 

int pitch;

cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); 

myKernel<<<100, 192>>>(devPtr, pitch); 

// device code 
__global__ void myKernel(float* devPtr, int pitch) 

 for (int r = 0; r < height; ++r) { 

  float* row = (float*)((char*)devPtr + r * pitch); 

  for (int c = 0; c < width; ++c) { 
            float element = row[c]; } 

The following code sample allocates a width×height CUDA array of one 32-bit floating-point component[1]

 cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); 
 cudaArray* cuArray; 
 cudaMallocArray(&cuArray, &channelDesc, width, height);

The following code sample copies the 2D array to the CUDA array allocated in the previous code samples[1]:

cudaMemcpy2DToArray(cuArray, 0, 0, devPtr, pitch, width * sizeof(float), height, 

The following code sample copies somehost memory array to device memory[1]:

float data[256]; 
int size = sizeof(data); 
float* devPtr; 
cudaMalloc((void**)&devPtr, size); 
cudaMemcpy(devPtr, data, size, cudaMemcpyHostToDevice);

you can understand theses examples and apply them in your purpose.

[1] NVIDIA CUDA Compute Unified Device Architecture