Optimizing alpha blending for two colors with alpha

122 views Asked by At

I'm looking for a way to optimize alpha blending, but for two colors with alpha (what differs from the question How to alpha blend RGBA unsigned byte color fast? )

Initially I used a solution with floats (RGB ranging from 0.0f to 255.0f and A ranging from 0.0f to 1.0f):

inline void alphaBlend(Color& baseColor, Color targetColor)
{
    float newAlpha = (1 - targetColor.A) * baseColor.A + targetColor.A;
    baseColor.R = ((1 - targetColor.A) * baseColor.A * baseColor.R + targetColor.A * targetColor.R) / newAlpha;
    baseColor.G = ((1 - targetColor.A) * baseColor.A * baseColor.G + targetColor.A * targetColor.G) / newAlpha;
    baseColor.B = ((1 - targetColor.A) * baseColor.A * baseColor.B + targetColor.A * targetColor.B) / newAlpha;
}

I changed the algorithm to work on unsigned int RGBA colors. I replaced every reference to alpha with (alpha / 255) and then corrected the formulas, so that values are still within the proper ranges.

baseColor.R = ((1 - targetColor.A) * baseColor.A * baseColor.R + targetColor.A * targetColor.R) / newAlpha;

Shorthand (targetColor.A -> tA etc.):

R = ((1 - tA) * bA * bR + tA * tR) / newAlpha

(introducing 255-based alpha requires replacing all A instances with A/255)

  = ((1 - (tA / 255)) * (bA / 255) * bR + (tA / 255) * tR) / (newAlpha / 255)

(remove 255 from the denominator's denominator)

  = (((1 - (tA / 255)) * (bA / 255) * bR + (tA / 255) * tR) * 255) / newAlpha

(get rid of direct alpha divisions by 255 by multiplying parethesis by 255/255)

  = (( ((255 - tA) * bA * bR) / 255^2 + (tA * tR) / 255) * 255) / newAlpha

(multiplying by the last 255 causes denominators to reduce)

  = ( ((255 - tA) * bA * bR) / 255 + (tA * tR * 255) / 255 ) / newAlpha
  
(Pushing numerator's denominator (255) to the denominator)

  = ( (255 - tA) * bA * bR) + (tA * tR * 255) ) / (255 * newAlpha)

(Expanding first multiplication in numerator)

  = ( 255 * bA * bR - tA * bA * bR + tA * tR * 255) / (255 * newAlpha)
                      ^^^^^^^^^^^^   ^^^^^^^^^^^^^
(reordering not to fall below 0 during calculations)

  = ( 255 * bA * bR + tA * tR * 255 - tA * bA * bR ) / (255 * newAlpha)

(grouping to minimize multiplications)

 = ( (ba * bR + tA * tR) * 255 - tA * bA * bR ) / (255 * newAlpha)

(introducing bit shifting - losing precision, but in an acceptable range)

 ~= ( ((ba * bR + tA * tR) << 8) - tA * bA * bR) / (newAlpha << 8)

I managed to write the following code:

inline void alphaBlend(IntColor& baseColor, IntColor targetColor)
{
    unsigned int a = (((baseColor.A + targetColor.A) << 8) - targetColor.A * baseColor.A) >> 8;

    if (a > 0)
    {
        unsigned int divisor = a << 8;

        unsigned int baseAR = baseColor.A * baseColor.R;
        baseColor.R = (((targetColor.A * targetColor.R + baseAR) << 8) - (baseAR * targetColor.A)) / divisor;

        unsigned int baseAG = baseColor.A * baseColor.G;
        baseColor.G = (((targetColor.A * targetColor.G + baseAG) << 8) - (baseAG * targetColor.A)) / divisor;

        unsigned int baseAB = baseColor.A * baseColor.B;
        baseColor.B = (((targetColor.A * targetColor.B + baseAB) << 8) - (baseAB * targetColor.A)) / divisor;

        baseColor.A = a;
    }
    else
    {
        baseColor.R = 0;
        baseColor.G = 0;
        baseColor.B = 0;
        baseColor.A = 0;
    }
}

This change reduced the rendering of sample data from 27559 ms to 17751 ms. Since the alpha blending seems to be the most common operation in the rendering workflow I'm curious if there is a way to optimize it even further.

I thought about doing calculations on R and B at the same time, but unfortunately in some circumstances the calculations will exceed two bytes (for instance if bA = bR = tA = tR = 255, the left part of the subtraction will equal to 33162750 = 0x1faa05fe).

Is there any other optimization I could apply to make this code faster?


Edit: responding to comments:

  • Target architecture is x64, target processor may be Intel Core family
  • Input type is guaranteed to be 32-bit RGBA
  • Memory layout is BGRA (8888)
  • Regarding SIMD, my application is a vector animation renderer. Every object is rendered on a separate bitmap and then alpha-blended into result one, because every single object may have applied alpha/mask/transformations/effects or may consist of multiple sub-objects, from which every single one also may have those applied.
  • Compiler is the one from Microsoft Visual Studio 2022. Application is Windows-only.
1

There are 1 answers

0
Spook On

I'm leaving this answer for people also looking for integer-calculation-based alpha-blending of two colors with alpha (that is, allowing the "background" or "base" color to be semi-transparent as well). It is not very fast, but certainly faster than its floating-point equivalent.

The code presented in my question is unfortunately flawed and sometimes gives results of 256, which causes in some cases ugly black pixels ((unsigned char)256 == 0).

The code below presents the solution and also serves as a correctness check. It validates that:

  • Neither result alpha nor result color differs from the floating point solution by more than a unit (1). The difference comes from lack of precision during integral division
  • Neither result alpha nor result color exceeds boundaries of [0..255]

Informatively, the floating-point alpha, which normally would be left in range [0.0f..1.0f] is normalized to [0.0f..255.0f], so that I can compare it with its int counterpart.

The validation code/solution follows.

#include <iostream>

int main()
{
    uint64_t diffs = 0;

    for (unsigned int baseAlpha = 0; baseAlpha < 256; baseAlpha++) 
    {
        printf("Processing a1 = %d\n", baseAlpha);

        for (unsigned int baseColor = 0; baseColor < 256; baseColor++)
            for (unsigned int targetAlpha = 0; targetAlpha < 256; targetAlpha++)
                for (unsigned int targetColor = 0; targetColor < 256; targetColor++)
                {
                    // Evaluate float result (FLOAT ALPHA BLENDING)
                    // R, G, B in [0.0f,255.0f]; 
                    // A in [0.0f, 1.0f]

                    float floatBaseAlpha = baseAlpha / 255.0f;
                    float floatTargetAlpha = targetAlpha / 255.0f;

                    float floatResultAlpha = (1 - floatTargetAlpha) * floatBaseAlpha + floatTargetAlpha;
                    float floatResultColor = 0.0f;
                    if (floatResultAlpha >= 1.0f / 255.0f)
                    {
                        floatResultColor = ((1 - floatTargetAlpha) * floatBaseAlpha * baseColor + floatTargetAlpha * targetColor) / floatResultAlpha;
                    }
                    else
                    {
                        floatResultColor = 0.0f;
                    }

                    floatResultAlpha *= 255.0f;

                    // Evaluate int result (INT ALPHA BLENDING)
                    // R, G, B, A in [0, 255]

                    int intResultAlpha = (((baseAlpha + targetAlpha) * 255) - targetAlpha * baseAlpha);
                    int intResultColor;
                    if (intResultAlpha > 0)
                    {
                        unsigned int divisor = intResultAlpha;

                        unsigned int baseAR = baseAlpha * baseColor;
                        intResultColor = (((targetAlpha * targetColor + baseAR) * 255) - (baseAR * targetAlpha)) / divisor;
                    }
                    else
                    {
                        intResultColor = 0;
                    }

                    intResultAlpha = intResultAlpha / 255;

                    // Compare

                    int alphaFromFloat = (int)floatResultAlpha;
                    int colorFromFloat = (int)floatResultColor;

                    int alphaFromInt = (int)intResultAlpha;
                    int colorFromInt = (int)intResultColor;

                    int aDiff = std::abs(alphaFromFloat - alphaFromInt);
                    int cDiff = std::abs(colorFromFloat - colorFromInt);

                    if (colorFromInt > 255 || colorFromInt < 0)
                    {
                        printf("Int color outside range!");
                    }

                    if (aDiff > 1 || cDiff > 1)
                    {
                        printf("Critical difference: bA: %u, bC: %u, tA: %u, tC: %u\n", baseAlpha, baseColor, targetAlpha, targetColor);
                        printf("Float result: A: %d, C: %d\n", alphaFromFloat, colorFromFloat);
                        printf("Int result: A: %d, C: %d\n", alphaFromInt, colorFromInt);
                        printf("Alpha difference: %d\n", aDiff);
                        printf("Color difference: %d\n", cDiff);
                    }

                    if (aDiff > 0 || cDiff > 0)
                        diffs++;
                }
    }

    printf("Total differences: %lld (%lld%%)\n", diffs, (100Ui64 * diffs / (1Ui64 << 32)));

    getchar();
}

The result is surprisingly good - application tests all possible combinations of colors and alphas and the numbfer of those that differ from the floating-point calculations (which I treat as valid) is below 1% (total number of combinations is 4 294 967 296):

Total differences: 4959508 (0%)

The popular optimization for alpha blending is achieved by replacing (* 255, / 255) operations with, respectively (<< 8, >> 8), which equals to (* 256, / 256). Since we need to multiply and divide by 255 and not 256, the price for optimization is decrease in precision. The bad news is that number of incorrect results increases dramatically, but the good news is that the error still does not exceed unit value of alpha and color:

int intResultAlpha = (((baseAlpha + targetAlpha) << 8) - targetAlpha * baseAlpha);
int intResultColor;
if (intResultAlpha > 0)
{
    unsigned int divisor = intResultAlpha;

    unsigned int baseAR = baseAlpha * baseColor;
    intResultColor = (((targetAlpha * targetColor + baseAR) << 8) - (baseAR * targetAlpha)) / divisor;
}
else
{
    intResultColor = 0;
}

intResultAlpha = intResultAlpha >> 8;
Total differences: 1218912093 (28%)

So now:

  • If you care about precise results, pick the slow, but accurate floating point solution (remember to keep alpha value in [0.0f..1.0f] range, NOT [0.0f..255.0f]!
  • If you want faster, but (negligibly) less accurate solution, pick the integer solution with divisions and multiplications by 255.
  • If you want even faster, but noticably less accurate solution (though still not by naked eye), pick the bit-shift solution.
  • If you want even faster solution, pick one of those and optimize them further (ideas include attempts to merge two channels in one 64-bit int and attempt to process two channels at once - improvement by noticeable 33% or implementing the code in assembler, using various extensions such as MMX).

If you decide to pick the last solution, don't forget to post another answer here - I'm sure everyone will benefit from having a fast alpha-blending algorithm.