Optimizing Mandelbrot Set Calculation in C++ on a High-Performance CPU

96 views Asked by At

I've been experimenting with calculating the Mandelbrot set in C++ on my i9 CPU, which has 10 cores and 20 threads. Despite expecting fast computations given my CPU's capabilities, the actual performance has been surprisingly slow, and I'm trying to understand why.

Here's the core function used to calculate the number of iterations for each point in the set:

double calculateMandelbrotIterations(std::complex<double> c, int iterations) {
    std::complex<double> z = 0;
    double n = 0;
    while (abs(z*z) <= 2 && n <= iterations) {
        z = z * z + c;
        n++;
    } 
    return n;
}

According to technical specifications, my CPU is capable of 800 GFLOPS, or 80 GFLOPS per core. For rendering a 1000x1000 pixel image with a maximum iteration depth of 1000, it appears I would invoke the above function approximately 10 billion times. Given an estimated 20-30 CPU cycles per function call, the expected computation time should be around:

width * height * iterations * operations / gflops 
1000 * 1000 * 1000 * 20 / 80e9 = 0.25 seconds

However, the actual computation takes around 6.5 seconds on average, which is 26 times slower than expected.

Inspecting the assembly code generated (as seen on godbolt), it appears that SIMD optimizations are not being utilized despite the use of SIMD operations, which significantly reduces the theoretical throughput of my CPU from being able to run 8 float operations per cycle per core down to just 1. This adjustment suggests an overall performance of 10 GFLOPS, yet this is still a factor of 3 short from my calculations.

I suspect that cache performance isn't the bottleneck since the data for a 1000x1000 image should fit within the L3 cache, and all computations are successively performed on contiguous memory locations.

It's worth noting that I'm focusing on optimizing single-threaded performance and not looking to use multiple threads or GPU acceleration at this stage. My goal is to understand and improve the performance of the current setup.

  1. Why is there such a significant discrepancy between the theoretical and actual performance of my Mandelbrot set calculations?
  2. How can I optimize my C++ code to better utilize the CPU's capabilities, specifically regarding SIMD instructions and maximizing FLOPS utilization?
  3. Are there any compiler flags or code modifications that I could use to enhance SIMD utilization for this calculation?

Core Code

#pragma once

#include <complex>
#include <cmath>

#define SMOOTH

struct Context {
    unsigned int width = 1000;
    unsigned int height = 1000;
    unsigned int iterations = 256;
    double zoom = 1.0;
    std::complex<double> center = {0, 0};
};

template<typename Func>
void forEachPixel(int start, int end, int width, Func func) {
    for (int y = start; y < end; ++y)
        for (int x = 0; x < width; ++x)
            func(x, y);
}

double calculateMandelbrotIterations(std::complex<double> c, const Context &ctx) {
    std::complex<double> z = 0;
    double n = 0;
    while (abs(z*z) <= 2 && n <= ctx.iterations) {
        z = z * z + c;
        n++;
    }

#ifdef SMOOTH
    if (n < ctx.iterations) {
        double log_zn = log(pow(z, 2)) / 2;
        double nu = log2(log_zn / log(2));
        n = n + 1 - nu;
    }
#endif

    return n;
}

void fillMandelbrotRegion(int start, int end, double *output, const Context &ctx) {
    const double aspectRatio = static_cast<double>(ctx.width) / ctx.height;
    const double scaleHeight = 4.0 / (ctx.zoom * aspectRatio);
    const double scaleWidth = 4.0 / ctx.zoom;

    forEachPixel(start, end, ctx.width, [&](int x, int y) {
        std::complex<double> c(
            ctx.center.real() + (x - ctx.width / 2.0) * scaleWidth / ctx.width,
            ctx.center.imag() + (y - ctx.height / 2.0) * scaleHeight / ctx.height
        );
        output[y * ctx.width + x] = calculateMandelbrotIterations(c, ctx);
    });
}

 
1

There are 1 answers

2
harold On

The classic Mandelbrot calculation is not easy for a compiler to vectorize (it's not that hard for humans though). In order to do that, a compiler would have to inline calculateMandelbrotIterations into its caller, notice that it's both OK and desirable to lengthen the Mandelbrot iteration loop (as long as the n where it "should have ended" is remembered, also the "last" z if smooth is enabled), then vectorize it by running multiple copies of that loop simultaneously instead of applying vectorization to the loop itself. Or there may be alternative routes for a compiler to get there, but in any case it's hard.

Since the calculation involves a high latency loop-carried dependency, you would need extra independent calculations on top of vectorization in order to utilize the processor well.

Additionally, the code does a couple of unnecessary/inefficient things:

  • cabs is called rather than inlined. -ffast-math made that call go away.
  • The code is being unnecessarily careful with NaN, that also goes away with -ffast-math. (this is a common problem with std::complex)

-ffast-math is a risky option in general but for a Mandelbrot renderer it's probably fine.