I am trying to modify the original YUV->RGB kernel provided in sample code of NVIDIA Video SDK and I need help to understand some of its parts.
Here is the kernel code:
template<class YuvUnitx2, class Rgb, class RgbIntx2>
__global__ static void YuvToRgbKernel(uint8_t* pYuv, int nYuvPitch, uint8_t* pRgb, int nRgbPitch, int nWidth, int nHeight) {
int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
if (x + 1 >= nWidth || y + 1 >= nHeight) {
return;
}
uint8_t* pSrc = pYuv + x * sizeof(YuvUnitx2) / 2 + y * nYuvPitch;
uint8_t* pDst = pRgb + x * sizeof(Rgb) + y * nRgbPitch;
YuvUnitx2 l0 = *(YuvUnitx2*)pSrc;
YuvUnitx2 l1 = *(YuvUnitx2*)(pSrc + nYuvPitch);
YuvUnitx2 ch = *(YuvUnitx2*)(pSrc + (nHeight - y / 2) * nYuvPitch);
//YuvToRgbForPixel - returns rgba encoded in uint32_t (.d)
*(RgbIntx2*)pDst = RgbIntx2{
YuvToRgbForPixel<Rgb>(l0.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l0.y, ch.x, ch.y).d,
};
*(RgbIntx2*)(pDst + nRgbPitch) = RgbIntx2{
YuvToRgbForPixel<Rgb>(l1.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l1.y, ch.x, ch.y).d,
};
}
Here are my basic assumptions, some of them are possibly wrong:
- NV12 has two planes, 1 for Luma and 2 for interleaved chroma.
- The kernel tries to write 4 pixels at a time.
If assumption 2 is correct, the question is why same chroma (ch) values are used for all 4 pixels? And If I am wrong on 2, please explain what exactly happens here.
The Chroma-planes on NV12 or NV21 are subsampled by a factor of 2.
For every 2x2 macro pixel in the output there are 4 luma (Y) channels, 1 Cb and 1 Cr element.