Summary:
I'm trying to write a memory bound OpenCL program that comes close to the advertised memory bandwidth on my GPU. In reality I'm off by a factor of ~50.
Setup:
I only have a relatively old Polaris Card (RX580), so I can't use CUDA and have to settle on OpenCL for now. I know this is suboptmial, and I can't get any debugging/performance counters to work, but it's all I have.
I'm new to GPU computing and want to get a feel for some of the performance that I can expect from GPU vs CPU. First thing to work on for me is memory bandwidth.
I wrote a very small OpenCL Kernel, which reads from strided memory locations in a way that I want all workers in the wavefront together to perform continuous memory access over a large memory segment, coalescing the accesses. All that the kernel then does with the loaded data is to sum the values up and write the sum back to another memory location at the very end. The code (which I shamelessly copied together from various sources for the most part) is quite simply
__kernel void ThroughputTestKernel(
__global float* vInMemory,
__global float* vOutMemory,
const int iNrOfIterations,
const int iNrOfWorkers
)
{
const int gtid = get_global_id(0);
__private float fAccumulator = 0.0;
for (int k = 0; k < iNrOfIterations; k++) {
fAccumulator += vInMemory[gtid + k * iNrOfWorkers];
}
vOutMemory[gtid] = fAccumulator;
}
I spawn iNrOfWorkers
of these Kernels and measure the time it takes them to finish processing. For my tests I set iNrOfWorkers = 1024
and iNrOfIterations = 64*1024
. From the processing time and the iMemorySize = iNrOfWorkers * iNrOfIterations * sizeof(float)
I calculate a memory bandwidth of around 5GByte/s.
Expectations:
My problem is that memory accesses seem to be one to two orders of magnitude slower than the 256GByte/s that I was led to believe I have available.
The GCN ISA Manual [1] has me assuming that I have 36 CUs, each of which contains 4 SIMD units, each of which process vectors of 16 elements. Therefore I should have 36416 = 2304 processing elements available.
I spawn less than that amount, i.e. 1024, global work units ("threads"). The threads access memory locations in order, 1024 locations apart, so that in each iteration of the loop, the entire wavefront accesses 1024 consecutive elements. Therefore I believe that the GPU should be able to produce consecutive memory address accesses with no breaks in between.
My guess is that, instead of 1024, it only spawns very few threads, one per CU maybe? That way it would have to re-read the data over and over again. I don't know how I would be able to verify that, though.
[1] http://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf
A few issues with your approach:
#pragma unroll
; then the compiler does all the index calculation already. You can also bake the constantsiNrOfIterations
andiNrOfWorkers
right into the OpenCL code with#define iNrOfIterations 16
/#define iNrOfWorkers 15728640
via C++ string concatenation or by hardcoding.There is 4 different memory bandwidths based on your access pattern: coalesced/misaligned reads/writes. Coalesced is much faster than misaligned and the performance penalty for misaligned reads is less than misaligned writes. Only coalesced memory access gets you anywhere near the advertised bandwidth. You measure
iNrOfIterations
coalesced reads and 1 coalesced write. To measure all four types separately, you can use this:Here the
data
array has the sizeN*M
and each kernel is executed across the rangeN
. For bandwidth calculation, execute each kernel a few hundred times (better average) and get the average execution timestime1
,time2
,time3
andtime4
. The bandwidths are then computed like this:4.0E-9f*M*N/(time2-time1/M)
4.0E-9f*M*N/( time1 )
4.0E-9f*M*N/(time4-time1/M)
4.0E-9f*M*N/(time3 )
For reference, here are a few bandwidth values measured with this benchmark.
Edit: How to measure kernel execution time:
K
executions of a kernel