How can I pass a struct to a kernel in JCuda

524 views Asked by At

I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda.

I would imagine the usual implementation would be to create a case class (scala terminology) that extends some native api, which can then be turned into a struct that can be safely passed into the kernel. Unfortunately I haven't found anything on google, hence the question.

1

There are 1 answers

0
Marco13 On BEST ANSWER

(The author of JCuda here (not "JCUDA", please))

As mentioned in the forum post linked from the comment: It is not impossible to use structs in CUDA kernels and fill them from JCuda side. It is just very complicated, and rarely beneficial.

For the reason of why it is rarely beneficial to use structs at all in GPU programming, you will have to refer to the results that you'll find when you search about the differences between

"Array Of Structures" versus "Structure Of Arrays".

Usually, the latter is preferred for GPU computations, due to improved memory coalescing, but this is beyond what I can profoundly summarize in this answer. Here, I will only summarize why using structs in GPU computing is a bit difficult in general, and particularly difficult in JCuda/Java.


In plain C, structs are (theoretically!) very simple, regarding the memory layout. Imagine a structure like

struct Vertex {
    short a;
    float x;
    float y;
    float z;
    short b;
};

Now you can create an array of these structs:

Vertex* vertices = (Vertex*)malloc(n*sizeof(Vertex));

These structs will be guaranteed to be are laid out as one contiguous memory block:

            |   vertices[0]      ||   vertices[1]      |
            |                    ||                    |
vertices -> [ a|  x |  y |  z | b][ a|  x |  y |  z | b]....

Since the CUDA kernel and the C code are compiled with the same compiler, there is not much room for musinderstandings. The host side says "Here is some memory, interpret this as Vertex objects", and the kernel will receive the same memory and work with it.

Still, even in plain C, there is in practice some potential for unexpected problems. Compilers will often introduce paddings into these structs, to achieve certain alignments. The example structure might thus in fact have a layout like this:

struct Vertex {
    short a;        // 2 bytes
    char PADDING_0  // Padding byte
    char PADDING_1  // Padding byte
    float x;        // 4 bytes
    float y;        // 4 bytes
    float z;        // 4 bytes
    short b;        // 2 bytes
    char PADDING_2  // Padding byte
    char PADDING_3  // Padding byte
};

Something like this may done in order to make sure that the structures are aligned to 32bit (4byte) word boundaries. Moreover, there are certain pragmas and compiler directives that may influence this alignment. CUDA additionally prefers certain memory alignments, and therefore these directives are used heavily in the CUDA headers.

For short: When you define a struct in C, and then print the sizeof(YourStruct) (or the actual layout of the struct) to the console, you will have a hard time to predict what it will actually print. Expect some surprises.


In JCuda/Java, the world is different. There simply are no structs. When you create a Java class like

class Vertex {
    short a;
    float x;
    float y;
    float z;
    short b;
}

and then create an array of these

Vertex vertices[2] = new Vertex[2];
vertices[0] = new Vertex();
vertices[1] = new Vertex();

then the these Vertex objects may be arbirarily scattered in memory. You don't even know how large one Vertex object is, and will hardly be able to find it out. Thus, trying to create an array of structures in JCuda and pass it to a CUDA kernel simply does not make sense.


However, as mentioned above: It is still possible, in some form. If you know the memory layout that your structures will have in the CUDA kernel, then you can create a memory block that is "compatible" with this structure layout, and fill it from Java side. For something like the struct Vertex mentioned above, this could roughly (involving some pseudocode) look like this:

// 1 short + 3 floats + 1 short, no paddings
int sizeOfVertex = 2 + 4 + 4 + 4 + 2; 

// Allocate data for 2 vertices
ByteBuffer data = ByteBuffer.allocateDirect(sizeOfVertex * 2);

// Set vertices[0].a and vertices[0].x and vertices[0].y
data.position(0).asShortBuffer().put(0, a0);
data.position(2).asFloatBuffer().put(0, x0);
data.position(2).asFloatBuffer().put(1, y0);

// Set vertices[1].a and vertices[1].x and vertices[1].y
data.position(sizeOfVertex+0).asShortBuffer().put(0, a1);
data.position(sizeOfVertex+2).asFloatBuffer().put(0, x1);
data.position(sizeOfVertex+2).asFloatBuffer().put(1, y1);

// Copy the Vertex data to the device
cudaMemcpy(deviceData, Pointer.to(data), cudaMemcpyHostToDevice);

It basically boils down to keeping the memory in a ByteBuffer, and to manually access the memory regions that correspond to the desired fields of the desired structs.

However, a warning: You have to consider the possibility that this will not be perfectly portable among several CUDA-C compiler versions or platforms. When you compile your kernel (that contains the struct definition) once on a 32bit Linux machine and once on a 64 bit Windows machine, then the structure layout might be different (and your Java code would have to be aware of this).

(Note: One could define interface to simplify these accesses. And for JOCL, I tried to create utility classes that feel a bit more like C structs and automate the copying process to some extent. But in any case, it will be inconvenient (and not achieve a really good performance) compared to plain C)