How can I have a CUDA stream wait for not-yet-scheduled work? (i.e. user-event-like pattern)

177 views Asked by At

I have some work I want to do on a CUDA stream, say a kernel K, which depends on previous work that needs to be done on the CPU. The exact details of the CPU work is not something that's known to me when I'm scheduling K; I just want K not to start until it is given an indication that everything is ready.

Now, if I had known exactly what CPU work is to be done, e.g. that K could start after some function foo() concludates, I could do the following:

  • Enqueue a call to foo() on stream SideStream
  • Enqueue an event E1 on SideStream
  • Enqueue a wait on event E1 on MainStream
  • Enqueue K on MainStream

but - what my CUDA scheduling code doesn't have an access to such a foo()? I want to allow some other, arbitrary place in my code to fire E1 when it is good and ready, and have that trigger K on MainStream. ... but I can't do that, since in CUDA, you can only wait on an already-enqueued (already "recorded") event.

This seems to be one of the few niches in which OpenCL offers a richer API than CUDA's: "User Events". They can be waited upon, and their execution completion status can be set by the user. See:

But certainly CUDA is able to provide something like this itself, if only to implement the OpenCL API call. So, what is the idiomatic way to achieve this effect with CUDA?

3

There are 3 answers

2
einpoklum On

Here's a possible idea - based on @AbatorAbetor's comment, although I have no idea if that's what people use in practice.

  • Write a function foo() which takes a condition variable as a paramter and wait on the variable. You can use std::condition_variable for example.
  • Define a condition variable.

Now proceed as in your question - as you have exactly the function you were missing:

  • Enqueue a call to foo() on stream SideStream
  • Enqueue an event E1 on SideStream
  • Enqueue a wait on event E1 on MainStream
  • Enqueue K on MainStream

but you are not quite done: Your scheduler now passes the condition variable (while keeping it alive!) onwards/outwards, so that finally, the "CPU work" you mentioned has a reference to it. When it is done, all it needs to do is a notify operation on the condition variable: This will wake up foo(), then immediate trigger E and then K.

Caveat: I am assuming that letting a CUDA callback function block like this doesn't interfere with other CUDA runtime/driver work.

2
einpoklum On

One can probably use CUDA's "stream-ordered memory operations" functionality, avoiding host function scheduling altogether:

  • Designate a 32-bit value (preferably well-aligned) v for signaling MainStream.
  • Invoke cuMemHostRegister() on v, to get its device address (possibly the same as its host address).
  • Enqueue a wait on v (using cuStreamWaitValue32()) on MainStream
  • Enqueue K
  • Pass v to whatever code schedules the extra CPU work.
  • Make sure 1 is written to v when the CPU work is done.
3
Abator Abetor On

One could launch a kernel before K that simply waits until a flag is set from the host. For newer GPUs, cuda::latch may be more efficient since it appears to use the nanosleep function while spinning

#include <cstdio>
#include <chrono>
#include <thread>

#include <cuda/latch>

__global__ 
void kernel(){
    printf("kernel\n");
}

__global__ 
void waitKernel(volatile int* flag){
    while(*flag != 1);
}

__global__ 
void waitKernelLatch(cuda::latch<cuda::thread_scope_system>* latchPtr){
    latchPtr->wait();
}

int main(){
    int* waitFlag;
    cudaMallocHost(&waitFlag, sizeof(int));

    cuda::latch<cuda::thread_scope_system>* latchPtr;
    cudaMallocHost(&latchPtr, sizeof(cuda::latch<cuda::thread_scope_system>));
    

    printf("wait using flag\n");
    *waitFlag = 0;
    waitKernel<<<1,1>>>(waitFlag);
    kernel<<<1,1>>>();

    printf("do some cpu stuff\n");
    std::this_thread::sleep_for(std::chrono::seconds(3));

    *waitFlag = 1;
    cudaDeviceSynchronize();



    printf("wait using latch\n");
    new (latchPtr) cuda::latch<cuda::thread_scope_system>(1);
    waitKernelLatch<<<1,1>>>(latchPtr);
    kernel<<<1,1>>>();

    printf("do some cpu stuff\n");
    std::this_thread::sleep_for(std::chrono::seconds(3));

    latchPtr->count_down();
    cudaDeviceSynchronize();


    cudaFreeHost(waitFlag);
}