Optimization of branch prediction inside hot loop in C++

162 views Asked by At

I have performance critical code which calculates inter-atomic forcefield. It is controled by variables like bPBC, shifts, doBonds, doPiSigma, doPiPiI which can be switched on and off by user which changes very rarely.

I'm wondering how well can compiler and processor do branch prediction in this situation, and if there is something I can do to make it easier for him?

// --------- Bonds Step
for (int i = 0; i < 4; i++) {  // loop over bonds of this atom
    int ing = ings[i];
    if (ing < 0) break;
    Vec3d pi = apos[ing];
    Quat4d h;
    h.f.set_sub(pi, pa);
    if (bPBC) {
        [[likely]]
        if (shifts) {
            [[likely]]
            int ipbc = ingC[i];
            h.f.add(shifts[ipbc]);
        } else {
            Vec3i g = invLvec.nearestCell(h.f);
            Vec3d sh = lvec.a * g.x + lvec.b * g.y + lvec.c * g.z;
            h.f.add(sh);
        }
    }
    double l = h.f.normalize();
    h.e = 1 / l;
    hs[i] = h;

    if (ia < ing) {
        [[likely]]
        if (doBonds) {
            [[likely]] E += evalBond(h.f, l - bL[i], bK[i], f1);
            fbs[i].sub(f1);
            fa.add(f1);
        }
        double kpp = Kppi[i];
        if ((doPiPiI) && (ing < nnode) && (kpp > 1e-6)) {
            E += evalPiAling(hpi, pipos[ing], 1., 1., kpp, f1, f2);
            fpi.add(f1);
            fps[i].add(f2);
        }
    }

    double ksp = Kspi[i];
    if (doPiSigma && (ksp > 1e-6)) {
        E += evalAngleCos(hpi, h.f, 1., h.e, ksp, piC0, f1, f2);
        fpi.add(f1);
        fa.sub(f2);
        fbs[i].add(f2);
    }
}

Additional questions:

  • The variables bPBC, shifts, doBonds, doPiSigma, doPiPiI are class properties, so they are allocated on heap. Would it help speed of if evaluation and branch precitions if I make a copy to const local variable in front of the loop? For example like this:
const bool bPBC_ = bPBC;
for(int i=0; i<4; i++){ // loop over bonds
    if(bPBC_){ ... }
    ...
}  
  • I compile it with g++ and I get warnings that the [[likely]] attributes are ignored. Should I be concerned? Or is it just because compiler consider this branch likely by default so the attribute is redundant?
/home/prokophapala/git/FireCore/cpp/common/molecular/MMFFsp3_loc.h:320:21: warning: ‘likely’ attribute ignored [-Wattributes]
  320 |                 int ipbc = ingC[i];
1

There are 1 answers

0
Homer512 On

As suggested in comments, one way to optimize this is to create specialized implementations for each flag combination. Doing this manually is tedious but we don't need to.

std::visit is very useful when implementing this. The idea is very simple: We turn each flag value into a type (handily available as std::bool_constant since C++17) and move them into variants. Then let std::visit dispatch all variations of parameter combinations to a generic lambda.

The bool_constants that are passed to the lambda have the same effect as a constexpr bool. So at this point the dead-code elimination of the compiler can do the rest. This would even work without if constexpr if you ever need to backport this to C++14 or 11.

You did not post a complete enough code sample, so I just demonstrate this on some generic code. Just move your actual code into the impl lambda.

#include <type_traits>
#include <variant>


void implBPC();
void implShifts();
void implBonds();
void implPiSigma();
void implPiPiI();


void hot(bool bPBC, bool shifts, bool doBonds, bool doPiSigma, bool doPiPiI)
{
    using boolVar = std::variant<std::false_type, std::true_type>;
    auto impl = [](auto bPBC, auto shifts, auto doBonds,
                   auto doPiSigma, auto doPiPiI) {
        if constexpr (bPBC)
            implBPC();
        if constexpr (shifts)
            implShifts();
        if constexpr (doBonds)
            implBonds();
        if constexpr (doPiSigma)
            implPiSigma();
        if constexpr (doPiPiI)
            implPiPiI();
    };
    auto toBoolVar = [](bool flag) -> boolVar {
        if(flag)
            return std::true_type{};
        return std::false_type{};
    };
    std::visit(impl, toBoolVar(bPBC), toBoolVar(shifts), toBoolVar(doBonds),
               toBoolVar(doPiSigma), toBoolVar(doPiPiI));
}

One thing that you need to look out for: If you call functions inside your code, the compiler will now see many more call sites (one for each implementation). It may decide to no longer inline calls it did before.

If the visit call turns out to have too much overhead, we can instead do it once when we set the flags to decide between various abstract implementations. Then we just need to do a virtual method call when we use it later instead of checking 5 boolean flags.

// abstract interface
struct HotFunctor
{
    virtual ~HotFunctor() = default;
    virtual void operator()() = 0;
};

// template-specialized implementation
template<bool bPBC, bool shifts, bool doBonds, bool doPiSigma, bool doPiPiI>
struct HotImpl final: HotFunctor
{
    virtual void operator()() override
    {
        if constexpr (bPBC)
            implBPC();
        if constexpr (shifts)
            implShifts();
        if constexpr (doBonds)
            implBonds();
        if constexpr (doPiSigma)
            implPiSigma();
        if constexpr (doPiPiI)
            implPiPiI();
    }
};

void hot(bool bPBC, bool shifts, bool doBonds, bool doPiSigma, bool doPiPiI)
{
    // select which implementation to use
    using boolVar = std::variant<std::false_type, std::true_type>;
    auto makeImpl = [](auto bPBC, auto shifts, auto doBonds,
                   auto doPiSigma, auto doPiPiI)
            -> std::unique_ptr<HotFunctor> {
        return std::make_unique<HotImpl<
                bPBC, shifts, doBonds, doPiSigma, doPiPiI>>();
    };
    auto toBoolVar = [](bool flag) -> boolVar {
        if(flag)
            return std::true_type{};
        return std::false_type{};
    };
    std::unique_ptr<HotFunctor> impl = std::visit(
            makeImpl, toBoolVar(bPBC), toBoolVar(shifts), toBoolVar(doBonds),
            toBoolVar(doPiSigma), toBoolVar(doPiPiI));
    // use repeatedly
    for(...)
        (*impl)();
}

An std::function should also work well for this, as would a plain function pointer to a template function if you are looking for the lowest possible call overhead.