I was reading the Even Easier Introduction to CUDA, and I was thinking about examples like this:
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
In which each thread strides through the array. In normal CPU computing, one would rather split the array into contiguous sub-arrays that are splitted among the threads, so that they can each better exploit spatial locality.
Does this concept apply to CUDA's unified memory as well? I would like to understand what the most efficient approach would be in such a situation.
The reason a grid-stride loop is beneficial for memory access is that it promotes "coalesced" access to global memory. In a nutshell, coalesced access means that adjacent threads in the warp are accessing adjacent locations in memory, on any given read or write cycle/operation, considered warp-wide.
The grid-stride loop arranges the indices across the warp to promote this pattern.
This is orthogonal to whether the memory was allocated with an "ordinary" device allocator (e.g.
cudaMalloc
) or a "unified" allocator (e.g.cudaMallocManaged
). In either case, the best way for device code to access such an allocation is using coalesced access.You didn't ask about it, but CUDA shared memory also has one of its "optimal access patterns" consisting of adjacent threads in the warp accessing adjacent locations in (shared) memory.