How to perform fast software binning of an image in C++?

370 views Asked by At

I am trying to perform fast binning of a 2D image (stored in row-major order in 1D array). The image is 12 bit, Thus, I have used the uint16_t data type.

What binning means is if done 2x2, then 2 pixels in each direction (x and y) becomes one pixel, or essentially the one super pixel contains the mean of those 4 pixels.

I have written the following code to calculate the mean of pixels and store them in superpixel while avoiding integer overflow. However, The divide by "DIVISIONFACTOR" to avoid the overflow slows the function down as the code has to perform more divisions (which are expensive compared to other arithmetic operations). The code runs significantly faster if I add values of all pixels (thus potential overflow can happen if the sum of pixels exceeds 65535) and then divide the sum for a superpixel by "DIVISIONFACTOR" only once.

To be more precise, The code runs at 0.75 ms with the overflow-avoiding code and at 0.13 ms with code that has the potential for overflow (owing to fewer division operations).

Please suggest ways to optimize further. I am using intel C++ compiler, please don't shy away from suggesting any Intel specific technique.

Thank you in advance.

Regards,

Harsh

Image Definitions: WIDTH and HEIGHT are actual image dimensions received from Camera and NX and NY are final dimensions.

#define NX 256
#define NY 256
#define WIDTH 768
#define HEIGHT 768
#define BINNINGFACTORWIDTH (WIDTH / NX)
#define BINNINGFACTORHEIGHT (HEIGHT / NY)
#define DIVISIONFACTOR (BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT)

Code without overflow:

int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k] / DIVISIONFACTOR);
                }
            }
        }
    }

    return 0;
}

Code with potential of overflow:

int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += image[l * HEIGHT + k];
                }
            }
            binnedImage[j * NX + i] /= DIVISIONFACTOR;
        }
    }

    return 0;
}

Full executable code:

#include <iostream>
#include <chrono>
#define NX 256
#define NY 256
#define WIDTH 768
#define HEIGHT 768
#define BINNINGFACTORWIDTH (WIDTH / NX)
#define BINNINGFACTORHEIGHT (HEIGHT / NY)
#define DIVISIONFACTOR (BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT)

using namespace std;

int binning(uint16_t* image, uint16_t* binnedImage);


int main() {
    chrono::high_resolution_clock::time_point t0, t1;
    chrono::duration<double> dt;

    double totalTime = 0;
    uint64_t count = 0;
    uint32_t i;
    uint16_t *image = (uint16_t*) malloc(sizeof(uint16_t) * WIDTH * HEIGHT);

    for (i=0; i < WIDTH * HEIGHT; i++) {
        image[i] = 4095;
    }

    uint16_t *binnedIimage = (uint16_t*) calloc(sizeof(uint16_t), NX * NY);

    while(1) {
        t0 = chrono::high_resolution_clock::now();

        binning(image, binnedIimage);

        t1 = chrono::high_resolution_clock::now();

        dt = chrono::duration_cast<chrono::duration<double>>(t1 - t0);

        totalTime += dt.count();

        count += 1;

        if (count % 3000 == 0) {
            cout<<"Time (ms): "<< totalTime * 1000 / count<<"\r";
            cout.flush();
        }
    }

    return 0;
}


int binning(uint16_t* image, uint16_t* binnedImage){
    int i, j, k, l;

    for (j=0; j< NY; j++) {
        for (i=0;i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k=i * BINNINGFACTORWIDTH; k<i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l=j * BINNINGFACTORHEIGHT; l<j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k]/ DIVISIONFACTOR);
                }
            }
        }
    }

    return 0;
}
2

There are 2 answers

0
fabian On BEST ANSWER

If you don't need to keep the original image content, doing one iteration doing horizontal binning followed by another iteration doing a vertical binning can be provide a performance boost of > 50%. (Even if the source cannot hold sufficient large values, you could use a separately allocated array to store intermediate results; likely you can determine the maximum array size before starting image conversions, so you don't need to allocate memory during the run of the function itself.)

Source

 1  2  3 | 10 11 12 | ...
 4  5  6 | 13 14 15 | ...
 7  8  9 | 16 17 28 | ...
--------------------------- ...
...

Step 1

 6 | 33 | ...
15 | 42 | ...
24 | 51 | ...
--------------------------- ...
...

Step 2

45 | 126 | ...
--------------------------- ...
...

Additional recommendations

  • Use constexpr variables instead of preprocessor defines.
  • Add static_asserts, to make sure overflow doesn't happen. Also add static_asserts for stuff that could go wrong when modifying the code without much of a thought like choosing a WIDTH that isn't divisible by NX.
  • Use a benchmarking framework instead of manually doing the timing. Otherwise aggressive optimization could get rid of the binning call alltogether, the result is never read. Also doing the timing yourself is hard, since the compiler is allowed to reorder your logic as long as the observable outcome is the same.
  • Use the fast versions of integral types for intermediate results.

Here is a implementation using google benchmark.

#include <benchmark/benchmark.h>

#include <cstdlib>
#include <cstdint>
#include <iostream>
#include <type_traits>
#include <memory>

namespace
{
struct Deallocator
{
    void operator()(void* mem) const
    {
        std::free(mem);
    }
};

constexpr unsigned WIDTH = 768;
constexpr unsigned HEIGHT = 768;

constexpr unsigned NX = 256;
constexpr unsigned NY = 256;

static_assert(WIDTH % NX == 0);
static_assert(HEIGHT % NY == 0);

constexpr auto BINNINGFACTORWIDTH = WIDTH / NX;
constexpr auto BINNINGFACTORHEIGHT = HEIGHT / NY;

constexpr auto DIVISIONFACTOR = BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT;

class ImageAllocationFixture : public benchmark::Fixture
{
protected:
    std::unique_ptr<uint16_t[], Deallocator> image;
    std::unique_ptr<uint16_t[], Deallocator> binnedImage;
public:
    void SetUp(const ::benchmark::State& state)
    {
        image.reset(static_cast<uint16_t*>(std::malloc(sizeof(uint16_t) * WIDTH * HEIGHT)));

        for (unsigned i = 0; i < WIDTH * HEIGHT; ++i) {
            image[i] = 4095;
        }

        binnedImage.reset(static_cast<uint16_t*>(calloc(sizeof(uint16_t), NX * NY)));
    }

    void TearDown(const ::benchmark::State& state)
    {
        image.reset();
        binnedImage.reset();
    }
};

}

namespace with_overflow
{

void binning(uint16_t* image, uint16_t* binnedImage) {
    int i, j, k, l;

    for (j = 0; j < NY; j++) {
        for (i = 0; i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k = i * BINNINGFACTORWIDTH; k < i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l = j * BINNINGFACTORHEIGHT; l < j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += image[l * HEIGHT + k];
                }
            }
            binnedImage[j * NX + i] /= DIVISIONFACTOR;
        }
    }

}

}

namespace without_overflow
{

void binning(uint16_t* image, uint16_t* binnedImage) {
    int i, j, k, l;

    for (j = 0; j < NY; j++) {
        for (i = 0; i < NX; i++) {
            binnedImage[j * NX + i] = 0;
            for (k = i * BINNINGFACTORWIDTH; k < i * BINNINGFACTORWIDTH + BINNINGFACTORWIDTH; k++) {
                for (l = j * BINNINGFACTORHEIGHT; l < j * BINNINGFACTORHEIGHT + BINNINGFACTORHEIGHT; l++) {
                    binnedImage[j * NX + i] += (image[l * HEIGHT + k] / DIVISIONFACTOR);
                }
            }
        }
    }
}

}

namespace bin_separately
{


void binning(uint16_t* const image, uint16_t* const binnedImage)
{
    // just some stuff for static_assert
    using PixelValueType = std::remove_cvref_t<decltype(*image)>;

    constexpr PixelValueType AllOnes = ~static_cast<PixelValueType>(0);
    constexpr unsigned BitCount = 12;
    constexpr uint64_t PixelInValueMax = static_cast<PixelValueType>(~(AllOnes << BitCount)); // use 64 bit to prevent overflow issues
    constexpr uint64_t PixelTypeMax = (std::numeric_limits<PixelValueType>::max)();
    // end static_assert stuff


    {
        // compress horizontally
        static_assert(PixelInValueMax * BINNINGFACTORWIDTH <= PixelTypeMax,
            "cannot compress horizontally without risking overflow");

        auto out = image;
        for (auto inPos = image, end = image + WIDTH * HEIGHT; inPos != end;)
        {
            uint_fast16_t sum = 0;
            for (unsigned i = 0; i != BINNINGFACTORWIDTH; ++i)
            {
                sum += *(inPos++);
            }
            *(out++) = sum;
        }
    }

    {
        // compress vertically, divide and write to out

        //read pointers
        uint16_t* inPoss[BINNINGFACTORHEIGHT];
        for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
        {
            inPoss[i] = image + (NX * i);
        }

        for (auto out = binnedImage, end = binnedImage + NX * NY; out != end;) // for all output rows
        {
            for (auto const rowEnd = out + NX; out != rowEnd;)
            {
                uint_fast16_t sum = 0;

                static_assert(PixelInValueMax * BINNINGFACTORWIDTH * BINNINGFACTORHEIGHT <= (std::numeric_limits<decltype(sum)>::max)(),
                    "type of sum needs replacement, since it cannot hold the result of adding up all source pixels for one target pixel");

                for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
                {
                    sum += *(inPoss[i]++);
                }
                *(out++) = sum / DIVISIONFACTOR;
            }

            // we advanced each pointer by one row -> advance by (BINNINGFACTORHEIGHT - 1) more
            for (unsigned i = 0; i != BINNINGFACTORHEIGHT; ++i)
            {
                inPoss[i] += NX * (BINNINGFACTORHEIGHT - 1);
            }
        }
    }

}

}

BENCHMARK_F(ImageAllocationFixture, WithOverflow)(benchmark::State& st)
{
    for (auto _ : st)
    {
        with_overflow::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_F(ImageAllocationFixture, WithoutOverflow)(benchmark::State& st)
{
    for (auto _ : st)
    {
        without_overflow::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_F(ImageAllocationFixture, BinSeparately)(benchmark::State& st)
{
    for (auto _ : st)
    {
        bin_separately::binning(image.get(), binnedImage.get());
        auto outData = binnedImage.get();
        benchmark::DoNotOptimize(outData);
        benchmark::ClobberMemory();
    }
}

BENCHMARK_MAIN();

Output for my compiler/machine (MSVC 19.34.31937 x86_64, /O2)

Run on (12 X 3593 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 512 KiB (x6)
  L3 Unified 16384 KiB (x2)
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
ImageAllocationFixture/WithOverflow        996287 ns       941685 ns          896
ImageAllocationFixture/WithoutOverflow    1171365 ns      1098633 ns          640
ImageAllocationFixture/BinSeparately       350364 ns       353021 ns         2036
17
Marcus Müller On

This binning is just a bad 1/2-scaling operator. You can do better, image-wise. This will alias like hell. Essentially, unless you have done the math to prove that an 2×2 average is appropriate for the spectrum of your image, you're probably not doing the right scaling.

Anyways.

  • Your PC is able to do short multiplications just as fast as int32_t multiplications, and you could get rid of your numerical overflow potential just doing the summing into an int, then saving the result of the scaling into the target image. This would also have the advantage of giving the compiler a local variable to optimize – which potentially increases its ability to emit
  • I'm not willing to bet this too strongly, but % 3000 is an actual integer modulo, and is much more work then multiplying a few integers. So, you're skewing your benchmark by doing that. Maybe just compare the lowest 12 bits with 4096, if (!(count & ((1<<13)-1))) {…}.
  • It's a bit stupid to get the current time after every binning, add it to the elapsed time, and then do the next binning. Getting the current time before the thing you want to benchmark, getting it after, subtracting the two and summing up the results of that takes way more time than just having a loop over 3000 invocations of binning, and doing one time-getting before and after. At this point, this might contribute very significantly to your time measurement. I wouldn't trust your measurement at this point.
    • I'd strongly suggest you just use Google Benchmark or some other micro-benchmarking tooling instead.

Stylistically, Code-Quality-wise, lots to improve:

  • Your binning functions return an int, without actually returning anything useful. Make them void, not int, and remove the return at the end of the function.
  • Let your code-formatter run on your code. The arbitrarily distributed spacing around operators, especially comparison operators, makes it easy to hide bugs from a human reader.
  • You don't need WIDTH, HEIGHT, NX or NY to be constants. There's little difference between looping up to a compile-time constant and a value that gets calculated. So, don't use constants for these, just pass them as arguments. Especially, NX can be computed from WIDTH knowing BINNINGFACTORWIDTH, or vice versa. Your compiler can do constants folding. Keeping constants around that can contradict each other is just a recipe for bugs. DIVISIONFACTOR really doesn't need to exist as a constant at all.

Whoever taught you C++ didn't write C++. This is all how you'd write a C89 program, not a post-C++11 program, as you're otherwise doing.

  • Instead of #define CONSTANT use constexpr int CONSTANT = 42; to actually get foot-gun-free constants. Expert programmers that decide these algorithms need to run on a variety of scaling factors will instead probably opt to use template parameters, so that they don't have to have different code files with different constant definitions for one binning that does scale-by-3 and one that does scale-by-2.
  • Don't use malloc/calloc and pointer casts; your code, leaking that memory, is a good illustration of why you shouldn't do that. Instead, a simple std::vector<uint16_t> image(WIDTH*HEIGHT); is easier to read, does the same memory allocation, but also deallocates memory on destruction. You'd have a function void binning(const std::vector<uint16_t>& image, std::vector<uint16_t>& binnedImage) instead of int binning(uint16_t* image, uint16_t* binnedImage).
  • int i, j, k, l; just... don't. Make loop variable local to their loop. C99 and every version of C++ supports that: for (int j=0; j< NY; j++) {… is better, since it shows you where j lives, and in more complex code, it also allows the compiler to reason more easily about optimization of local variables (since it doesn't need to keep them around after the loop)