Determine limiting factor of OpenCL workgroup size?

1.1k views Asked by At

I am trying to run some OpenCL kernels written for desktop graphics cards on an embedded GPU with less resources. In particular, the desktop version assumes a work group size of at least 256 is always supported, but the Mali T628 ARM-based GPU only guarantees 64+ work group size.

Indeed, some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64, and I can't figure out why. I checked the CL_KERNEL_LOCAL_MEM_SIZE for the kernels in question and it is <2 KiB, whereas the CL_DEVICE_LOCAL_MEM_SIZE is 32 KiB, so I think I can rule out __local storage.

What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage? I am open to both programmatic introspection (such as clGetKernelWorkGroupInfo() which I have already done some), and any development tools I may not know about.

EDIT:

The kernels are part of the OpenCL v2.4 module of OpenCV. In particular, the kernel icvCalcOrientation in surf.cl. The code is fairly complex, and there are several compile-time parameters set, so that's why it is a bit infeasible to manually analyze the kernel for the issue without some hint of what to look at.

If there is a way to troubleshoot this on NVidia or AMD hardware (which I have access to), I am open to it.

2

There are 2 answers

4
Baiz On

EDIT

Since my previous answer was plainly wrong, I need more info on the problem.

By saying "some kernels report CL_KERNEL_WORK_GROUP_SIZE of only 64" you're implying that kernels exist where a larger work-group size is available. Is that the case? If not then the answer unfortunatlely is that the device is simply not capable of supporting more than 64 work-items.

Could you please query all available infos from the device in the kernel after setting all kernel agruments and before executing the kernel. The parameters (mostly taken from (Source) ) to query are

  • CL_DEVICE_GLOBAL_MEM_SIZE
  • CL_DEVICE_LOCAL_MEM_SIZE
  • CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
  • CL_DEVICE_MAX_MEM_ALLOC_SIZE
  • CL_DEVICE_MAX_WORK_GROUP_SIZE
  • CL_DEVICE_MAX_WORK_ITEM_SIZES
  • CL_KERNEL_WORK_GROUP_SIZE
  • CL_KERNEL_LOCAL_MEM_SIZE
  • CL_KERNEL_PRIVATE_MEM_SIZE There might be more, but currently none come to mind.

General information:

A workgroup size can be limited because the local memory is limited. And this limit can be reached if you have a kernel that uses lots of private memory (“lots” is a relative term – on weaker hardware this may be reached even with seemingly few variables). "However this limit is just under ideal conditions. If your kernel uses high amount of WI per WG maybe some of the private WI data is being spilled out to local memory. [...]" (Source).

So some of this private memory may be swapped to local memory without you realizing it so the accumulated size of local memory used and the one needed for swapped private memory is bigger than the available local memory size.

CL_DEVICE_LOCAL_MEM_SIZE returns the available size of local memory, CL_KERNEL_LOCAL_MEM_SIZE tells you how much local memory you have used. Aparently this also takes dynamic local memory into consideration by looking at clSetKernelArg, however I am unsure how this is supposed to work if you query CL_KERNEL_LOCAL_MEM_SIZE before setting the kernel argument (which is what you would want to do in order to determine the size of local memory...)

Anyway, OpenCL knows exactly how much local memory you use, so it can calculate how many work-items (each of which has private memory that may need swapping to local memory) it can support. This reduced local working size may be what you get when querying CL_KERNEL_WORK_GROUP_SIZE.

After looking at the kernel you posted I don't think that local memory is the problem here (which is what you already suspected), especially since you only use 2 of the 32 KiB of local memory.

0
solidpixel On

What other factors (eg, registers/__private memory?) contribute to low CL_KERNEL_WORK_GROUP_SIZE, and how do I check usage?

On Mali all memory used by compute workloads is global (i.e. backed my system RAM), so that memory pressure shouldn't cause any problems except through secondary effects (such as cache thrashing). I suspect register allocation constraints may come into play here - larger workgroups mean more concurrent threads active in the shader core, which means higher pressure on the register file - although I don't know for sure.

The Mali offline compiler for OpenGL ES reports work register usage - for example it can report this type of information:

./malisc -c Mali-T760 -r r1p0 -d Mali-T600_r5p0-00rel0 --fragment -V test.frag 
ARM Mali Offline Compiler v4.5.0
(C) Copyright 2007-2014 ARM Limited.
All rights reserved.


1 work registers used, 0 uniform registers used, spilling not used.

        A   L/S T   Total   Bound
Cycles:     2   0   0   2   A
Shortest Path:  1   0   0   1   A
Longest Path:   1   0   0   1   A

Note: The cycles counts do not include possible stalls due to cache misses.

I'm not sure if ARM have an offline compiler for OpenCL which can report similar information - it might be worth asking over on the ARM Connected Community site.