I compute trajectories of N particles which move in their gravitation force field. I wrote the following OpenCL kernel:
#define G 100.0f
#define EPS 1.0f
float2 f (float2 r_me, __constant float *m, __global float2 *r, size_t s, size_t n)
{
size_t i;
float2 res = (0.0f, 0.0f);
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
return res;
}
__kernel void take_step_rk2 (__constant float *m,
__global float2 *r,
__global float2 *v,
float delta)
{
size_t n = get_global_size(0);
size_t s = get_global_id(0);
float2 mv = f(r[s], m, r, s, n);
float2 mr = v[s];
float2 vpred1 = v[s] + mv * delta;
float2 rpred1 = r[s] + mr * delta;
float2 nv = f(rpred1, m, r, s, n);
float2 nr = vpred1;
barrier (CLK_GLOBAL_MEM_FENCE);
r[s] += (mr + nr) * delta / 2;
v[s] += (mv + nv) * delta / 2;
}
Then I run this kernel multiple times as one-dimensional problem with global work size = [number of bodies]:
void take_step (struct cl_state *state)
{
size_t n = state->nbodies;
clEnqueueNDRangeKernel (state->queue, state->step, 1, NULL, &n, NULL, 0, NULL, NULL);
clFinish (state->queue);
}
This is a quote from AMD OpenCL Optimization Guide (year 2015):
Under certain conditions, one unexpected case of a channel conflict is that reading from the same address is a conflict, even on the FastPath. This does not happen on the read-only memories, such as constant buffers, textures, or shader resource view (SRV); but it is possible on the read/write UAV memory or OpenCL global memory.
Work items in my queue all try to get access to the same memory in this loop, so there must be a channel conflict:
for (i=1; i<n; i++) {
size_t idx = i;
// size_t idx = (i + s) % n;
float2 dir = r[idx] - r_me;
float dist = length (dir);
res += G*m[idx]/pown(dist + EPS, 3) * dir;
}
I replaced
size_t idx = i;
// size_t idx = (i + s) % n;
with
// size_t idx = i;
size_t idx = (i + s) % n;
so the first work item (with global id 0
) firstly access the first element in array r
, the second work item access the second element and so on.
I expected that this change must result in performance improvement, but to the contrary, it resulted in significant performance degradation (roughly by the factor of 2). What am I missing? Why all-to-the-same memory access it better in this situation?
If you have other tips how to improve the performance, please share with me. OpenCL optimization guide is very confusing.
The f function's loop does not have a barrier for reconvergence for coalesced access. Once some items get their r data, they start computing but those couldn't will wait their data hence, lose the coalescence integrity. To re-group them, add 1 barrier at least per 10 iterations or 2 iterations or maybe even every iteration. But accessing to global has high latency. Barrier + latency is bad for performance. You need local memory here since it has low latency and broadcasting ability which lets it lose coalescedness only on grains bigger than local thread number (64?) which is not bad for global memory access either(you need to fill local memory from global in every Kth iteration where N is divided into K sized groups).
A source from 2013 ( http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf):
Using LDS(
__local
) for this will give good performance. Since LDS is small, you should do it in small patches like 256 particles at a time.Also, using i as idx is very cache friendly but modulus version is very cache enemy. Once data can exist in cache, it doesn't matter if N requests are done. They come from cache now. But with modulus, you destroy cache ingredients before they are re-used, depending on N. For small N it should be faster as you foresee. For big N, and with small GPU cache, it would be much worse. Like only 1 global request per cycle versus N-cache_size global requests per cycle.
I guess with such strong GPU, you had a high N value such as 64k bodies which needed 2 variables per body and 4 bytes per variable totaling 512kB which can not fit L1. Maybe only L2 which is slower than idx=i through L1.
Answer:
all to same L1 cache adr is faster than all to global and L2 cache adr
use local memory in "blocking/patching" algorithm to achieve high speed