CUDA streams are blocking despite Async

3.1k views Asked by At

I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)

Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames. So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.

My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...

I tried many solutions. None worked:

  1. I tried to set-up jobs on 2 streams (image below):
    • the host gets a frame
    • First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
    • the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
    • Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.

NSight capture

Practically, I wrote :

cudaStream_t  sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);

cudaEvent_t event_1;
cudaEventCreate(&event_1);

if (frame has arrived)
{
    cudaMemcpyAsync(..., sHigh);        // HtoD, to upload images in the GPU
    blur_Image <<<... , sHigh>>> (...)
    if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
    else printf("Event 2 not ready\n");
}

void hard_work( cudaStream_t sLow_)
{
    kernelCalcul_W2<<<... , sLow_>>> (...);
    cudaMemcpyAsync(... the result..., sLow_); //DtoH
    cudaEventRecord(event_1, sLow_);    
}
  1. I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
    • host gets a frame
    • Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round. Practically, the stream is ALWAYS available: I never fall in the "else" part.

This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.

  1. I tried to put the two streams in two different threads (in C). Not better (even worse).

So the question is: how to ensure that the first stream captures ALL frames? I really have the feeling that the different streams block the CPU.

I display the images with OpenGL. Would it interfere?

Any idea of ways to improve this? Thanks a lot!

EDIT: As requested, I put here a MCVE.

There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).

Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5

#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>

#define ADJUST  400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly

unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float *  Images_as_Float_in_Device;
float * imageOutput_in_Device;

unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;


cudaStream_t  s1, s2;
cudaEvent_t event_2;
clock_t timeRef;

// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_, 
    unsigned long  imagePixelSize_, short blur_distance)
{
    // we start from 'blur_distance' from the edge
    // p0 is the point we will calculate. p is a pointer which will move around for average
    unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
    unsigned long p = p0;
    unsigned short * us;
    if (p >= imagePixelSize_) return;
    unsigned long tot = 0;
    short a, b, n, k;
    k = 0;
    // p starts from the top edge and will move to the right-bottom
    p -= blur_distance + blur_distance * imageWidth_;
    us = Images_as_Unsigned_in_Device_ + p;
    for (a = 2 * blur_distance; a >= 0; a--)
    {
        for (b = 2 * blur_distance; b >= 0; b--)
        {
            n = *us;
            if (n > 0) { tot += n; k++; }
            us++;
        }
        us += imageWidth_ - 2 * blur_distance - 1;
    }
    if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
    else Images_as_Float_in_Device_[p0] = 128.f;
}


__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long  imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
    // point the pixel and crunch it
    unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
    if (p >= imagePixelSize_)   { return; }
    float result;
    long a, b, n, n0;
    float input;
    b = 3;

    // this is not the right algorithm (which is pretty complex). 
    // I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
    for (n = 0; n < 10; n++)
    {
        n0 = slot - n;
        if (n0 < 0) n0 += totImages;
        input = inputImage[p + n0 * imagePixelSize_]; 
        for (a = 0; a < ADJUST ; a++)
                result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
    }
    outputImage[p] = result;
}


void hard_work( cudaStream_t s){

    cudaError err;
    // launch the hard work
    printf("Hard work is launched after image %d is captured  ==> ", imageSlot);
    kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
    err = cudaPeekAtLastError();
    if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
    else printf("running ok\n");

    // copy the result back to Host
    //printf(" %p  %p  \n", images_as_Output_in_Host, imageOutput_in_Device);
    cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) *  imagePixelSize, cudaMemcpyDeviceToHost, s);
    cudaEventRecord(event_2, s);
}


void createStorageSpace()
{
    imageWidth = 640;
    imageHeight = 480;
    totNbOfImages = 300;
    imageSlot = 0;
    imagePixelSize = 640 * 480;
    lastImageFromCamera = 0;

    camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
    for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
    // storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
    images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
    images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));

    cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
    cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);

    cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));



    int priority_high, priority_low;
    cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
    cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
    cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
    cudaEventCreate(&event_2);

}

void releaseMapFile()
{
    cudaFree(Images_as_Unsigned_in_Device);
    cudaFree(Images_as_Float_in_Device);
    cudaFree(imageOutput_in_Device);
    free(images_as_Output_in_Host);
    free(camera);

    cudaStreamDestroy(s1);
    cudaStreamDestroy(s2);
    cudaEventDestroy(event_2);
}

void putImageCUDA(const void * data)
{       
    // We put the image in a round-robin. The slot to put the image is imageSlot
    printf("\nDealing with image %d\n", imageSlot);
    // Copy the image in the Round Robin
    cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) *  imagePixelSize, cudaMemcpyHostToDevice, s1);

    // We will blur the image. Let's prepare the memory to get the results as floats
    cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) *  imagePixelSize, s1);

    // blur image
    blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
                Images_as_Float_in_Device + imageSlot * imagePixelSize,
                imageWidth, imagePixelSize, 3);


    // launches the hard-work
    if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
    else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);

    imageSlot++;
    if (imageSlot >= totNbOfImages) {
        imageSlot = 0;
    }
}

int main()
{
    createStorageSpace();
    printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");

    for (int i = 0; i < 10; i++)
    {
        putImageCUDA(camera);  // Puts an image in the GPU, does the bluring, and tries to do the hard-work
        Sleep(30);  // to simulate Camera
    }
    releaseMapFile();
    getchar();
}
1

There are 1 answers

1
Robert Crovella On

The primary issue here is that cudaMemcpyAsync is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated using cudaHostAlloc. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.

The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:

$ cat t33.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#define ADJUST  400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly

unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float *  Images_as_Float_in_Device;
float * imageOutput_in_Device;

unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;


cudaStream_t  s1, s2;
cudaEvent_t event_2;
clock_t timeRef;

// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
    unsigned long  imagePixelSize_, short blur_distance)
{
    // we start from 'blur_distance' from the edge
    // p0 is the point we will calculate. p is a pointer which will move around for average
    unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
    unsigned long p = p0;
    unsigned short * us;
    if (p >= imagePixelSize_) return;
    unsigned long tot = 0;
    short a, b, n, k;
    k = 0;
    // p starts from the top edge and will move to the right-bottom
    p -= blur_distance + blur_distance * imageWidth_;
    us = Images_as_Unsigned_in_Device_ + p;
    for (a = 2 * blur_distance; a >= 0; a--)
    {
        for (b = 2 * blur_distance; b >= 0; b--)
        {
            n = *us;
            if (n > 0) { tot += n; k++; }
            us++;
        }
        us += imageWidth_ - 2 * blur_distance - 1;
    }
    if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
    else Images_as_Float_in_Device_[p0] = 128.f;
}


__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long  imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
    // point the pixel and crunch it
    unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
    if (p >= imagePixelSize_)   { return; }
    float result;
    long a, n, n0;
    float input;

    // this is not the right algorithm (which is pretty complex).
    // I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
    for (n = 0; n < 10; n++)
    {
        n0 = slot - n;
        if (n0 < 0) n0 += totImages;
        input = inputImage[p + n0 * imagePixelSize_];
        for (a = 0; a < ADJUST ; a++)
                result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
    }
    outputImage[p] = result;
}


void hard_work( cudaStream_t s){
#ifndef QUICK
    cudaError err;
    // launch the hard work
    printf("Hard work is launched after image %d is captured  ==> ", imageSlot);
    kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
    err = cudaPeekAtLastError();
    if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
    else printf("running ok\n");

    // copy the result back to Host
    //printf(" %p  %p  \n", images_as_Output_in_Host, imageOutput_in_Device);
    cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) *  imagePixelSize/2, cudaMemcpyDeviceToHost, s);
    cudaEventRecord(event_2, s);
#endif
}


void createStorageSpace()
{
    imageWidth = 640;
    imageHeight = 480;
    totNbOfImages = 300;
    imageSlot = 0;
    imagePixelSize = 640 * 480;
    lastImageFromCamera = 0;
#ifdef USE_HOST_ALLOC
    cudaHostAlloc(&camera, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
    cudaHostAlloc(&images_as_Unsigned_in_Host, imagePixelSize*sizeof(unsigned short)*totNbOfImages, cudaHostAllocDefault);
    cudaHostAlloc(&images_as_Output_in_Host, imagePixelSize*sizeof(unsigned short), cudaHostAllocDefault);
#else
    camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
    images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
    images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
#endif
    for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
    cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
    cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);

    cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));



    int priority_high, priority_low;
    cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
    cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
    cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
    cudaEventCreate(&event_2);
    cudaEventRecord(event_2, s2);
}

void releaseMapFile()
{
    cudaFree(Images_as_Unsigned_in_Device);
    cudaFree(Images_as_Float_in_Device);
    cudaFree(imageOutput_in_Device);

    cudaStreamDestroy(s1);
    cudaStreamDestroy(s2);
    cudaEventDestroy(event_2);
}

void putImageCUDA(const void * data)
{
    // We put the image in a round-robin. The slot to put the image is imageSlot
    printf("\nDealing with image %d\n", imageSlot);
    // Copy the image in the Round Robin
    cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) *  imagePixelSize, cudaMemcpyHostToDevice, s1);

    // We will blur the image. Let's prepare the memory to get the results as floats
    cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0, sizeof(float) *  imagePixelSize, s1);

    // blur image
    blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
                Images_as_Float_in_Device + imageSlot * imagePixelSize,
                imageWidth, imagePixelSize, 3);


    // launches the hard-work
    if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
    else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);

    imageSlot++;
    if (imageSlot >= totNbOfImages) {
        imageSlot = 0;
    }
}

int main()
{
    createStorageSpace();
    printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");

    for (int i = 0; i < 10; i++)
    {
        putImageCUDA(camera);  // Puts an image in the GPU, does the bluring, and tries to do the hard-work
        usleep(30000);  // to simulate Camera
    }
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) printf("some CUDA error: %s\n", cudaGetErrorString(err));
    releaseMapFile();
}
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured  ==> running ok

Dealing with image 1
Hard work is launched after image 1 is captured  ==> running ok

Dealing with image 2
Hard work is launched after image 2 is captured  ==> running ok

Dealing with image 3
Hard work is launched after image 3 is captured  ==> running ok

Dealing with image 4
Hard work is launched after image 4 is captured  ==> running ok

Dealing with image 5
Hard work is launched after image 5 is captured  ==> running ok

Dealing with image 6
Hard work is launched after image 6 is captured  ==> running ok

Dealing with image 7
Hard work is launched after image 7 is captured  ==> running ok

Dealing with image 8
Hard work is launched after image 8 is captured  ==> running ok

Dealing with image 9
Hard work is launched after image 9 is captured  ==> running ok

real    0m2.790s
user    0m0.688s
sys     0m0.966s
$ nvcc -arch=sm_52 -lineinfo -o t33 t33.cu -DUSE_HOST_ALLOC
$ time ./t33
The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...
You may adjust a #define ADJUST parameter to see what's happening.
Dealing with image 0
Hard work is launched after image 0 is captured  ==> running ok

Dealing with image 1
Hard_work still running, so unable to process after image 1

Dealing with image 2
Hard_work still running, so unable to process after image 2

Dealing with image 3
Hard_work still running, so unable to process after image 3

Dealing with image 4
Hard_work still running, so unable to process after image 4

Dealing with image 5
Hard_work still running, so unable to process after image 5

Dealing with image 6
Hard_work still running, so unable to process after image 6

Dealing with image 7
Hard work is launched after image 7 is captured  ==> running ok

Dealing with image 8
Hard_work still running, so unable to process after image 8

Dealing with image 9
Hard_work still running, so unable to process after image 9

real    0m1.721s
user    0m0.028s
sys     0m0.629s
$

In the USE_HOST_ALLOC case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.

In short, if you want the expected behavior out of cudaMemcpyAsync, make sure any participating host allocations are page-locked.

A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.