I'm beginning to architect my first serious OpenCL program, and I want to make sure I understand how my AMD R9 290x is set up. (GCN 2.0 Architecture). So I'll just say what I understand, and hopefully someone out there can tell me where I'm right or wrong?
It seems to me that a major problem of optimized kernels is memory-bound performance. I don't really want to do "premature" optimization here, but it seems like at least thinking about memory is very important to OpenCL code in general. (See Sorting with GPUs: A Survey (Dmitri I. Arkhipov + others))
According to AMD's optimization guide, each vector-unit can run 64-work items every 4-clocks, and each work-item has access to 256 32-bit vGPRs. In effect, __private data (tries to) be stored in vGPRs.
This leads to 1024 "easy" access registers for OpenCL Kernels.
It seems like the LDS (aka __local) can be used as efficient storage for OpenCL kernels, but their primary design seems to be to communicate data across the workgroup. Since the 64kb of the compute unit is shared between 4-vector units (each of which ideally has at least 1-wavefront of 64-work items executing), there's at most 256 bytes of LDS space that can be given to each work-item to have the system run as "wide" as possible.
So with some careful __local and __private variable allocations, it seems like 1280 "quick-access" bytes are available per work item. I guess L1 cache grants another 16kb (which is only 64-bytes per work-item), but that'd take some careful juggling to use. Besides, L1 cache also contains the code / instructions, and can't be all used for data. "__constant" space seems to be effectively L1 space, with some "multiplexing" magic that can happen if a wavefront syncs up to the same index.
None of the above calculations take into account the theoretical 10-wave fronts per vector-unit (which seem to share vGPRs). If the R9 290x actually had 40-wave fronts simultaneously executing, they'd be using only 100-bytes (25-vGPRs per work item) of fast-access memory.
So... is this the correct understanding of how much "quick" memory each work-item gets in a GCN device like a R9 290x? If we consider vGPRs, L1, and LDS space to be the totality of "quick" storage for the R9 290x to work with, we're only looking at 1024 (vGPR) + 64 (L1) + 256 (LDS) space to work with.
I do realize that kernels can reach out to global memory (~4GB on a typical R9 290x, which is roughly ~1MB per work item). Since global memory on a GPU is still highly-parallelized GDDR5 memory, I'd expect it to be pretty fast, but its still a magnitude slower than L1 / vGPRs / LDS space. So I'd expect that optimal programs try to avoid using the global RAM when they don't need to.