It is my understanding that CUDA's memory model is hierarchical, that each streaming processor gets its own region of contiguous memory. Therefore, I thought that if a single thread tries to access two indexes that are far enough apart, they'd be associated with different streaming processors and even if it's not impossible, it ought to be slower.
And yet, that's not what I found. In the code below, which uses CuPy to set up an experiment but the kernel is written in C++ CUDA, a calculation on two distant array indexes, data[i] and data[ii], takes the same time within 50%.
- I know that the time is dominated by kernel-running, rather than dispatch or Python overhead, because it scales with the number of loops within the kernel. (And my GPU temperature increases by 15°C.)
- I know that the calculation in the loop can't be optimized away (checked x86 in https://godbolt.org/).
- I know that any trends in the output depend only on the
offsetindependent variable because I randomized the order in which they were measured.
Also, I'm aware that the results depend on a race condition. (Some kernels may be assigning to an array index before or after others are reading from it.) In this experiment, I'm just measuring the time, not correctness.
import time
import numpy as np
import cupy as cp
billyon = 1024**3
kernel = cp.RawKernel(fr"""
typedef signed int int32_t;
extern "C" __global__
void kernel(int32_t* data, int offset) {{
int i = blockDim.x*blockIdx.x + threadIdx.x;
int ii = (i + offset) % {billyon};
for (int _ = 0; _ < 100; _++) {{
data[ii] += data[i]; // depends on indexes that differ by 'offset'
data[ii] /= 3;
}}
}}""", "kernel")
data = cp.arange(billyon, dtype=cp.int32)
threads_per_block = kernel.attributes["max_threads_per_block"] # this is 1024
blocks_per_grid = (len(data) + (threads_per_block - 1)) // threads_per_block
# [ offset, time]
results = [[2**n - 1, None] for n in range(0, 21)]
results += [[2**n + 0, None] for n in range(0, 21)]
results += [[2**n + 1, None] for n in range(0, 21)]
# measure time-to-complete for each 'offset' in a random order
for i in random.sample(range(len(results)), len(results)):
offset = results[i][0]
start = time.monotonic()
kernel((blocks_per_grid,), (threads_per_block,), (data, offset))
cp.cuda.Stream.null.synchronize()
stop = time.monotonic()
results[i][1] = stop - start
for offset, result in results:
print(f"{str(offset):8s} {result}")
outputs
0 0.4711587410001812
1 0.6806426660004945
3 0.6757751169998301
7 0.680451019999964
15 0.6785046710001552
31 0.677074382999308
63 0.6586064880002596
127 0.6477001470002506
255 0.6375246979996518
511 0.6210620290003135
1023 0.5479212069994901
2047 0.53610887800005
4095 0.5359455879997768
8191 0.5361576279992732
16383 0.535943772999417
32767 0.5361106939999445
65535 0.532854672999747
131071 0.5325410839996039
262143 0.5357767109999259
524287 0.5369865089996892
1048575 0.5330084649995115
1 0.6803918630002954
2 0.6803237680005623
4 0.6803114760004974
8 0.6009420999998838
16 0.611940212999798
32 0.5155299479993118
64 0.5243394320004882
128 0.4690608370001428
256 0.46456107500034705
512 0.4617370249998203
1024 0.47083607199965627
2048 0.47074143800000456
4096 0.47017746600067767
8192 0.4700965889996951
16384 0.4682883170007699
32768 0.4705678110003646
65536 0.47058820399979595
131072 0.4701041320004151
262144 0.47007004900024185
524288 0.4720496420004565
1048576 0.47133613899950433
2 0.6800517339997896
3 0.6804547910005567
5 0.6759346360004201
9 0.6780069090000325
17 0.6807731999997486
33 0.6569047249995492
65 0.6672576410001057
129 0.6313282720002462
257 0.6193459490004898
513 0.5945520719997148
1025 0.5358869210003832
2049 0.535689688000275
4097 0.535666919000505
8193 0.5326644940005281
16385 0.535492455000167
32769 0.5357592509999449
65537 0.5357777580002221
131073 0.5352709169992522
262145 0.5355789889999869
524289 0.5369978230000925
1048577 0.5360762620002788
While there's something interesting going on here (having an offset just below or above a power of 2 is 50% worse), it's not orders of magnitude bad, which is what I would have expected for the consequences of ignoring data locality.
Is it really the case that we can write CUDA code that ignores data locality (assuming that we avoid race conditions) and the worst that can happen is a factor of 50%?
Oh, for completeness, I tested this on a NVIDIA GeForce RTX 3060, Compute Capability: 8.6, PCI bus, hardware driver version 525.89.02, CUDA driver version 12.0, runtime version 11.8, with 12 GiB of memory (for this 4 GiB test).
