I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).
In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.
Here an example:
CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter
// Invoke kernel with pointer_dev as parameter. Now it should contain some results
CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel
cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction
What am I doing wrong?
EDIT:
Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link
Any workaround? I'm using CUDA Toolkit v5.0
Here we are copying a two dimensional array of integers from the device to host.
First, create a single dimensional array with size equal to size of another single dimension array (here
blockSizeX
).Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.
Launch the kernel.
Transfer the output from the device to host.