I have a 3D point cloud and I project the pixels to an image plane. Since some 3D points get mapped to the same pixel, I only want the pixel with the lowest Z-value to my camera. I use a Z-Buffer - a float array - to keep track of my depth values. Here's some pseudocode:
// Initialize z-Buffer with max depth (99999.9f)
// Go through every point of point cloud...
// Project Point (x,y,z) to image plane (u,v)
int newIndex = v*imgWidth+u;
float oldDepth = zbuffer[newIndex];
if (z < oldDepth){
zbuffer[newIndex] = z; // put z value in buffer
outputImg[newIndex] = pointColor[i]; // put pixel in resulting image
}
I have a perfectly working single-core CPU version of this.
The cuda version looks fine and is extremely fast, but only the areas where a z-test play a role are very "streaky", which means that some background points are overwriting the foreground pixels, I think. Also, when I look at the color image I see random color streaks with colors that are not present in the image.
The CUDA version looks more like this:
//Initialize, kernel, project, new coordinates...
const float oldDepth = outputDepth[v * outputMaxWidth + u];
if (z < oldDepth){
outputDepth[v * outputMaxWidth + u] = z;
const int inputColorIndex = yIndex * inputImageStep + 3*xIndex;
const int outputColorIndex = yIndex * outputImageStep + 3*xIndex;
outputImage[outputColorIndex] = inputImage[inputColorIndex]; //B
outputImage[outputColorIndex + 1] = inputImage[inputColorIndex + 1]; //G
outputImage[outputColorIndex + 2] = inputImage[inputColorIndex + 2]; //R
}
I think that the concurrency is a problem here. One thread may write the nearest z-value of this pixel in the Z-Buffer, but just at the same time another thread reads the old value and overwrites the correct value.
How do I prevent this in CUDA?
Edit1: Reducing the block size from (16,16) to (1,1) will result in less streaky patterns, but it will look like 1 pixel holes.
Edit2: Here's a minimal example:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
cudaError_t insertToZBuffer(int *z, const int *a, unsigned int size);
__global__ void zbufferKernel(int *z, const int *a)
{
int i = threadIdx.x;
if (a[i] < z[0]){
z[0] = a[i]; // all mapped to pixel index 0
}
}
int main(){
for (int i = 0; i < 20; ++i){
const int arraySize = 5;
const int a[arraySize] = { 1, 7, 3, 40, 5 }; // some depth values which get mapped all to index 0
int z[arraySize] = { 999 }; // large depth value
insertToZBuffer(z, a, arraySize);
printf("{%d,%d,%d,%d,%d}\n", z[0], z[1], z[2], z[3], z[4]);
cudaDeviceReset();
}
return 0;
}
cudaError_t insertToZBuffer(int *z, const int *a, unsigned int size){
int *dev_a = 0;
int *dev_z = 0;
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
cudaStatus = cudaMalloc((void**)&dev_z, size * sizeof(int));
cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
cudaStatus = cudaMemcpy(dev_z, z, size * sizeof(int), cudaMemcpyHostToDevice);
zbufferKernel<<<1, size >>>(dev_z, dev_a);
cudaStatus = cudaGetLastError();
cudaStatus = cudaDeviceSynchronize();
cudaStatus = cudaMemcpy(z, dev_z, size * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_z);
cudaFree(dev_a);
return cudaStatus;
}
The value from z at index 0 should be 1 because it is the lowest value, but it is 5, which is the last value of a.
Here's how I solved it thanks to the comments:
I've used atomicCAS (casting floats to ints) to write to my z-buffer if it has a smaller z value. When the current thread has a larger z value, I simply return. In the end I synchronize all the remaining threads (__syncthreads()), which have written to the buffer and check if their z value was the final one. If that was indeed true, I write the point color to my pixel value at this position.
Edit: I should have used just atomicMin...