How to cleanly exit OpenCL code

815 views Asked by At

I'm looking for a simple and clean way, to tell my host, that some code in OpenCL lead to an error and it should abandon further work. I (think to) know, that try, catch or assert doesn't work in OpenCL C. Furthermore, Kernels have to be defined as non-returning functions, so simply Error-Code returning is out of the image as well. The only idea I had is passing a cl_mem object between host and kernel and check its value between Kernel enqueues or launches, which somehow enforces a very strong kind of serialization. Is there a better idea, maybe using events?

2

There are 2 answers

5
huseyin tugrul buyukisik On BEST ANSWER

If you need a similar thing like

for(i 0 to N)
{
    do work (i)
    error ? break;
}

in a parallel fashion,

int threadId=get_global_id(0);

// a broadcast read(for a new gpu) so no performance hit
mem_fence(CLK_GLOBAL_MEM_FENCE_READ)
if(error[0]==0) // or use atomic_add(&error[0],0) to read atomically(when total number of threads is low like thousands)
{
     do work (threadId);
     error? atomic_add(&error[0],errCode)
     mem_fence(CLK_GLOBAL_MEM_FENCE_WRITE)
}

so at least you save cycles at thread-group-level which should let later threads complete quickly if they started after an atomic error write. Atomic operations are slow but an error handling should make it less important right? Also depending on device type and drivers, it could need at least thousands of threads between an atomic write and a proper non-atomic read so for a million threads it could be efficient but for a thousand threads, you should use atomic read (atomic add with a zero value) so each thread will add extra cycle(s) before actual work begins but at least its latency may be hidden with heavy compute.

If you have multiple devices that need to inform each other about errors, you should use USE_HOST_PTR for error buffer to read/write error codes on host memory directly instead of using device memory. This could be less performant than device memory since error buffer will not be cached and will be far away from device so maybe 5GB/s of pci-e bandwidth bottleneck instead of 5TB/s of device memory(assuming broadcasts to all cores with single cycle, for latest graphics cards)

0
Dithermaster On

In order to not cause serialization use an asynchronous solution. Each kernel takes a cl_mem object and if it wants to tell the host side to shut down, it writes a sentinal value. When the host later detects this, it stops enqueueing work. Because this is async, some extra work items might be enqueued between the writing and the reading, and your system will need to handle this condition. You may need a ring buffer of cl_mem objects, since you'll need to read or map them on the host to check their contents, and you don't want to block any kernels to do this.

In OpenCL 2.x you could perhaps use a pipe for this communication, but I haven't used them so I can't offer more details on this solution.