I'm beginner at CUDA programming and have a question.
When I pass parameters by value, like this:
__global__ void add(int a, int b, int *c) {
// some operations
}
Since variable a and b are passed to kernel function add as copied value in function call stack, I guessed some memory space would be needed to copy in.
If I'm right, is that additional memory space where those parameters are copied in GPU or in Host's main memory?
The reason why I wonder this problem is that I should pass a big struct to kernel function.
I also thought pass a pointer of the struct, but these way seems to be required to call cudamalloc for the struct and each member variables.
The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.
In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call
.paramwhich hold kernel and device parameter arguments. See here. Quoting from that documentation:It further notes that:
So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:
[Emphasis mine]
So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the
.conststate space accessible by defining a variable as__constant__in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.