How to reduce CUDA synchronize latency / delay

3.9k views Asked by At

This question is related to using cuda streams to run many kernels

In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty.

I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.

Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?

1

There are 1 answers

4
ArchaeaSoftware On

The main difference between synchronize methods is "polling" and "blocking."

"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.

"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.

The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).