I have performance critical code which calculates inter-atomic forcefield. It is controled by variables like bPBC, shifts, doBonds, doPiSigma, doPiPiI which can be switched on and off by user which changes very rarely.
I'm wondering how well can compiler and processor do branch prediction in this situation, and if there is something I can do to make it easier for him?
// --------- Bonds Step
for (int i = 0; i < 4; i++) { // loop over bonds of this atom
int ing = ings[i];
if (ing < 0) break;
Vec3d pi = apos[ing];
Quat4d h;
h.f.set_sub(pi, pa);
if (bPBC) {
[[likely]]
if (shifts) {
[[likely]]
int ipbc = ingC[i];
h.f.add(shifts[ipbc]);
} else {
Vec3i g = invLvec.nearestCell(h.f);
Vec3d sh = lvec.a * g.x + lvec.b * g.y + lvec.c * g.z;
h.f.add(sh);
}
}
double l = h.f.normalize();
h.e = 1 / l;
hs[i] = h;
if (ia < ing) {
[[likely]]
if (doBonds) {
[[likely]] E += evalBond(h.f, l - bL[i], bK[i], f1);
fbs[i].sub(f1);
fa.add(f1);
}
double kpp = Kppi[i];
if ((doPiPiI) && (ing < nnode) && (kpp > 1e-6)) {
E += evalPiAling(hpi, pipos[ing], 1., 1., kpp, f1, f2);
fpi.add(f1);
fps[i].add(f2);
}
}
double ksp = Kspi[i];
if (doPiSigma && (ksp > 1e-6)) {
E += evalAngleCos(hpi, h.f, 1., h.e, ksp, piC0, f1, f2);
fpi.add(f1);
fa.sub(f2);
fbs[i].add(f2);
}
}
Additional questions:
- The variables
bPBC,shifts,doBonds,doPiSigma,doPiPiIare class properties, so they are allocated on heap. Would it help speed of if evaluation and branch precitions if I make a copy toconstlocal variable in front of the loop? For example like this:
const bool bPBC_ = bPBC;
for(int i=0; i<4; i++){ // loop over bonds
if(bPBC_){ ... }
...
}
- I compile it with
g++and I get warnings that the[[likely]]attributes are ignored. Should I be concerned? Or is it just because compiler consider this branch likely by default so the attribute is redundant?
/home/prokophapala/git/FireCore/cpp/common/molecular/MMFFsp3_loc.h:320:21: warning: ‘likely’ attribute ignored [-Wattributes]
320 | int ipbc = ingC[i];
As suggested in comments, one way to optimize this is to create specialized implementations for each flag combination. Doing this manually is tedious but we don't need to.
std::visitis very useful when implementing this. The idea is very simple: We turn each flag value into a type (handily available asstd::bool_constantsince C++17) and move them into variants. Then letstd::visitdispatch all variations of parameter combinations to a generic lambda.The
bool_constantsthat are passed to the lambda have the same effect as aconstexpr bool. So at this point the dead-code elimination of the compiler can do the rest. This would even work withoutif constexprif you ever need to backport this to C++14 or 11.You did not post a complete enough code sample, so I just demonstrate this on some generic code. Just move your actual code into the
impllambda.One thing that you need to look out for: If you call functions inside your code, the compiler will now see many more call sites (one for each implementation). It may decide to no longer inline calls it did before.
If the
visitcall turns out to have too much overhead, we can instead do it once when we set the flags to decide between various abstract implementations. Then we just need to do a virtual method call when we use it later instead of checking 5 boolean flags.An
std::functionshould also work well for this, as would a plain function pointer to a template function if you are looking for the lowest possible call overhead.