Spatial locality in CUDA loops

Question

Spatial locality in CUDA loops

314 views Asked by Sean At 19 May 2021 at 16:53

I was reading the Even Easier Introduction to CUDA, and I was thinking about examples like this:

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

In which each thread strides through the array. In normal CPU computing, one would rather split the array into contiguous sub-arrays that are splitted among the threads, so that they can each better exploit spatial locality.

Does this concept apply to CUDA's unified memory as well? I would like to understand what the most efficient approach would be in such a situation.

Original Q&A

There are 1 answers

**Robert Crovella** · Accepted Answer · 2021-05-19T18:07:57+00:00

The reason a grid-stride loop is beneficial for memory access is that it promotes "coalesced" access to global memory. In a nutshell, coalesced access means that adjacent threads in the warp are accessing adjacent locations in memory, on any given read or write cycle/operation, considered warp-wide.

The grid-stride loop arranges the indices across the warp to promote this pattern.

This is orthogonal to whether the memory was allocated with an "ordinary" device allocator (e.g. cudaMalloc) or a "unified" allocator (e.g. cudaMallocManaged). In either case, the best way for device code to access such an allocation is using coalesced access.

You didn't ask about it, but CUDA shared memory also has one of its "optimal access patterns" consisting of adjacent threads in the warp accessing adjacent locations in (shared) memory.

TechQA.

Spatial locality in CUDA loops

There are 1 answers

Related Questions in C++

Related Questions in PERFORMANCE

Related Questions in CUDA

Related Questions in UNIFIED-MEMORY

Popular Questions

Popular Tags

Trending Questions