Using libjpeg (or libjpeg-turbo) to do JPEG encoding, I was wondering if there is any improvements providing multiple scanlines at once to the jpeg_write_scanlines function. I did some tests on 720x288 images, but I only get 0,5% increase when processing the whole image at once.
I guess this increase is just due to the removal of call stack overhead, but I was expecting a bit more, at least with libjpeg-turbo.
The performance test was run with Callgrind (in Valgrind), so maybe I'm missing something. Or I really misunderstood how JPEG encoder works.
JPEG has a minimum height of a row, called MCU height. It is 8 lines in images without subsampling (4:4:4 mode) or 16 lines if chroma is subsampled (4:2:0 mode).
If you feed libjpeg these 8 or 16 lines it will be able to process the whole row in one go. Otherwise it'll need to do extra bookkeeping or buffering.
Writing multiple MCU heights at a time, or the whole image, won't hurt.