I'm working on a video stream in real time that I try to process with a GeForce GTX 960M. (Windows 10, VS 2013, CUDA 8.0)
Each frame has to be captured, lightly blured, and whenever I can, I need to do some hard-work calculations on the 10 latest frames. So I need to capture ALL the frames at 30 fps, and I expect to get the hard-work result at 5 fps.
My problems is that I cannot keep the capture running at the right pace : it seems that the hard-work calculation slows down the capture of frames, either at CPU level or at GPU level. I miss some frames...
I tried many solutions. None worked:
- I tried to set-up jobs on 2 streams (image below):
- the host gets a frame
- First stream (called Stream2) : cudaMemcpyAsync copies the frame on the Device. Then, a first kernel does the basic bluring calculations. (In the attached image, bluring appears as a short slot at 3.07 s and 3.085 s. And then nothing... until the big part has finished)
- the host checks if the second stream is "available" thanks to a CudaEvent, and lauches it if possible. Practically, the stream is available 1/2 of tries.
- Second stream (called Stream4) : starts hard-work calculations in a kernel ( kernelCalcul_W2), outputs the result, and records an Event.
Practically, I wrote :
cudaStream_t sHigh, sLow;
cudaStreamCreateWithPriority(&sHigh, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&sLow, cudaStreamNonBlocking, priority_low);
cudaEvent_t event_1;
cudaEventCreate(&event_1);
if (frame has arrived)
{
cudaMemcpyAsync(..., sHigh); // HtoD, to upload images in the GPU
blur_Image <<<... , sHigh>>> (...)
if (cudaEventQuery(event_1)==cudaSuccess)) hard_work(sLow);
else printf("Event 2 not ready\n");
}
void hard_work( cudaStream_t sLow_)
{
kernelCalcul_W2<<<... , sLow_>>> (...);
cudaMemcpyAsync(... the result..., sLow_); //DtoH
cudaEventRecord(event_1, sLow_);
}
- I tried to use only one stream. It's the same code as above, but change 1 parameter while launching hard_work.
- host gets a frame
- Stream: cudaMemcpyAsync copies the frame on the Device. Then, the kernel does the basic bluring calculations. Then, if the CudaEvent Event_1 is ok, I lauch the hard-work, and I add an Event_1 to get the status on next round. Practically, the stream is ALWAYS available: I never fall in the "else" part.
This way, while the hard-work is running, I expected to "buffer" all the frames to copy, and not to lose any. But I do lose some: it turns out that each time I get a frame and I copy it, Event_1 seems ok so I launch the hard-work, and only get the the next frame very late.
- I tried to put the two streams in two different threads (in C). Not better (even worse).
So the question is: how to ensure that the first stream captures ALL frames? I really have the feeling that the different streams block the CPU.
I display the images with OpenGL. Would it interfere?
Any idea of ways to improve this? Thanks a lot!
EDIT: As requested, I put here a MCVE.
There is a parameter you can tune (#define ADJUST) to see what's happening. Basically, the main procedure sends CUDA requests in Async mode, but it seems to block the main thread. As you will see in the image, I have "memory access" (i.e. images captured ) every 30 ms except when the hard-work is running (then, I just don't get images).
Last detail: I'm using CUDA 7.5 to run this. I tried to install 8.0 but apparently the compiler is still 7.5
#define _USE_MATH_DEFINES 1
#define _CRT_SECURE_NO_WARNINGS 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <Windows.h>
#define ADJUST 400
// adjusting this paramter may make the problem occur.
// Too high => probably watchdog will stop the kernel
// too low => probably the kernel will run smothly
unsigned short * images_as_Unsigned_in_Host;
unsigned short * Images_as_Unsigned_in_Device;
unsigned short * camera;
float * images_as_Output_in_Host;
float * Images_as_Float_in_Device;
float * imageOutput_in_Device;
unsigned short imageWidth, imageHeight, totNbOfImages, imageSlot;
unsigned long imagePixelSize;
unsigned short lastImageFromCamera;
cudaStream_t s1, s2;
cudaEvent_t event_2;
clock_t timeRef;
// Basically, in the middle of the image, I average the values. I removed the logic behind to make it simpler.
// This kernel runs fast, and that's the point.
__global__ void blurImage(unsigned short * Images_as_Unsigned_in_Device_, float * Images_as_Float_in_Device_, unsigned short imageWidth_,
unsigned long imagePixelSize_, short blur_distance)
{
// we start from 'blur_distance' from the edge
// p0 is the point we will calculate. p is a pointer which will move around for average
unsigned long p0 = (threadIdx.x + blur_distance) + (blockIdx.x + blur_distance) * imageWidth_;
unsigned long p = p0;
unsigned short * us;
if (p >= imagePixelSize_) return;
unsigned long tot = 0;
short a, b, n, k;
k = 0;
// p starts from the top edge and will move to the right-bottom
p -= blur_distance + blur_distance * imageWidth_;
us = Images_as_Unsigned_in_Device_ + p;
for (a = 2 * blur_distance; a >= 0; a--)
{
for (b = 2 * blur_distance; b >= 0; b--)
{
n = *us;
if (n > 0) { tot += n; k++; }
us++;
}
us += imageWidth_ - 2 * blur_distance - 1;
}
if (k > 0) Images_as_Float_in_Device_[p0] = (float)tot / (float)k;
else Images_as_Float_in_Device_[p0] = 128.f;
}
__global__ void kernelCalcul_W2(float *inputImage, float *outputImage, unsigned long imagePixelSize_, unsigned short imageWidth_, unsigned short slot, unsigned short totImages)
{
// point the pixel and crunch it
unsigned long p = threadIdx.x + blockIdx.x * imageWidth_;
if (p >= imagePixelSize_) { return; }
float result;
long a, b, n, n0;
float input;
b = 3;
// this is not the right algorithm (which is pretty complex).
// I know this is not optimal in terms of memory management. Still, I want a "long" calculation here so I don't care...
for (n = 0; n < 10; n++)
{
n0 = slot - n;
if (n0 < 0) n0 += totImages;
input = inputImage[p + n0 * imagePixelSize_];
for (a = 0; a < ADJUST ; a++)
result += pow(input, inputImage[a + n0 * imagePixelSize_]) * cos(input);
}
outputImage[p] = result;
}
void hard_work( cudaStream_t s){
cudaError err;
// launch the hard work
printf("Hard work is launched after image %d is captured ==> ", imageSlot);
kernelCalcul_W2 << <340, 500, 0, s >> >(Images_as_Float_in_Device, imageOutput_in_Device, imagePixelSize, imageWidth, imageSlot, totNbOfImages);
err = cudaPeekAtLastError();
if (err != cudaSuccess) printf( "running error: %s \n", cudaGetErrorString(err));
else printf("running ok\n");
// copy the result back to Host
//printf(" %p %p \n", images_as_Output_in_Host, imageOutput_in_Device);
cudaMemcpyAsync(images_as_Output_in_Host, imageOutput_in_Device, sizeof(float) * imagePixelSize, cudaMemcpyDeviceToHost, s);
cudaEventRecord(event_2, s);
}
void createStorageSpace()
{
imageWidth = 640;
imageHeight = 480;
totNbOfImages = 300;
imageSlot = 0;
imagePixelSize = 640 * 480;
lastImageFromCamera = 0;
camera = (unsigned short *)malloc(imagePixelSize * sizeof(unsigned short));
for (int i = 0; i < imagePixelSize; i++) camera[i] = rand() % 255;
// storing the images in the Host memory. I know I could optimize with cudaHostAllocate.
images_as_Unsigned_in_Host = (unsigned short *) malloc(imagePixelSize * sizeof(unsigned short) * totNbOfImages);
images_as_Output_in_Host = (float *)malloc(imagePixelSize * sizeof(float));
cudaMalloc(&Images_as_Unsigned_in_Device, imagePixelSize * sizeof(unsigned short) * totNbOfImages);
cudaMalloc(&Images_as_Float_in_Device, imagePixelSize * sizeof(float) * totNbOfImages);
cudaMalloc(&imageOutput_in_Device, imagePixelSize * sizeof(float));
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
cudaStreamCreateWithPriority(&s1, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&s2, cudaStreamNonBlocking, priority_low);
cudaEventCreate(&event_2);
}
void releaseMapFile()
{
cudaFree(Images_as_Unsigned_in_Device);
cudaFree(Images_as_Float_in_Device);
cudaFree(imageOutput_in_Device);
free(images_as_Output_in_Host);
free(camera);
cudaStreamDestroy(s1);
cudaStreamDestroy(s2);
cudaEventDestroy(event_2);
}
void putImageCUDA(const void * data)
{
// We put the image in a round-robin. The slot to put the image is imageSlot
printf("\nDealing with image %d\n", imageSlot);
// Copy the image in the Round Robin
cudaMemcpyAsync(Images_as_Unsigned_in_Device + imageSlot * imagePixelSize, data, sizeof(unsigned short) * imagePixelSize, cudaMemcpyHostToDevice, s1);
// We will blur the image. Let's prepare the memory to get the results as floats
cudaMemsetAsync(Images_as_Float_in_Device + imageSlot * imagePixelSize, 0., sizeof(float) * imagePixelSize, s1);
// blur image
blurImage << <imageHeight - 140, imageWidth - 140, 0, s1 >> > (Images_as_Unsigned_in_Device + imageSlot * imagePixelSize,
Images_as_Float_in_Device + imageSlot * imagePixelSize,
imageWidth, imagePixelSize, 3);
// launches the hard-work
if (cudaEventQuery(event_2) == cudaSuccess) hard_work(s2);
else printf("Hard_work still running, so unable to process after image %d\n", imageSlot);
imageSlot++;
if (imageSlot >= totNbOfImages) {
imageSlot = 0;
}
}
int main()
{
createStorageSpace();
printf("The following loop is supposed to push images in the GPU and do calculations in Async mode, and to wait 30 ms before the next image, so we should have the output on the screen in 10 x 30 ms. But it's far slower...\nYou may adjust a #define ADJUST parameter to see what's happening.");
for (int i = 0; i < 10; i++)
{
putImageCUDA(camera); // Puts an image in the GPU, does the bluring, and tries to do the hard-work
Sleep(30); // to simulate Camera
}
releaseMapFile();
getchar();
}
The primary issue here is that
cudaMemcpyAsync
is only a properly non-blocking async operation if the host memory involved is pinned, i.e. allocated usingcudaHostAlloc
. This characteristic is covered in several places, including the API documentation and the relevant programming guide section.The following modification to your code (to run on linux, which I prefer) demonstrates the behavioral difference:
In the
USE_HOST_ALLOC
case above, the launch pattern for the low-priority kernel is intermittent, as expected, and the overall run time is considerably shorter.In short, if you want the expected behavior out of
cudaMemcpyAsync
, make sure any participating host allocations are page-locked.A pictorial (profiler) example of the effect that pinning can have on multi-stream behavior can be seen in this answer.