I wrote a (qsort-compatible) comparison function for a struct that has some unsigned fields in it:
typedef struct {
int a;
unsigned b;
} T;
int cmp(T t1, T t2)
{
// Decreasing order in "a"
if (t1.a < t2.a) return +1;
if (t1.a > t2.a) return -1;
// Increasing order in "b"
if (t1.b < t2.b) return -1;
if (t1.b > t2.b) return +1;
return 0;
}
Is there a way to write this function without needing two comparisons per field? I can't use the t1.b - t2.b trick because subtraction for unsigned integer wraps around.
I'm willing to accept answers using GCC extensions.
When you may ignore this answer: all reasoning about branching is useless if compiler will generate branchless code both for Keit's answer and even for original OP's code (Keit's one is treated as
condition ? ~0 : 0and OP's one will generateCMOV).Of course you may target a CPU without
SETccandCMOVccinstructions. In this case yes, I'd avoid branches (if possible) using subtraction (doing a small performance test to determine what is faster betweenlong longanddouble). If you other preconditions and range limitation isn't an issue you may even go with plain integer math.When you shouldn't ignore this answer: if your target CPU has not
CMOVccand/orSETcc(or equivalent) instructions.You don't need to return exactly +1 and -1, any positive or negative value works well (assuming you want to optimize this function to reduce jumps, math operations are relatively cheap). If we can make assumptions about platform specific signed integers implementation (2's complement) and unsigned/signed conversion then first step to remove branches (introducing cheap subtractions) is:
To remove 2nd branch we can rely on a well-defined behavior of
unsigned(notsigned) integers math:t1.b - t2.bwill wrap (whent1.bis smaller thant2.b) then(int)(t1.b - t2.b)will be a negative number and 2ndifmay be omitted. With that assumption code can be:Note 1: 2nd optimization works just in your case because you're ordering descending for
T.bthen it's not a general rule.Note 2: here you're working with copied structures. Compiler may optimize your code to remove
Tcopies but it's not required to do it then IMO you should check generated code or use pointersT*forcmparguments (if possible). Expansive copies will vanish any other micro-optimization we may do here.Explanation
I see some explanation is needed, if we're trying to reduce (to avoid AFAIK is impossible) branches then we have to consider both visible and invisible ones (otherwise no reason to even start this possibly micro-optimization).
Branches
Every condition (like
t2->b > t1->b) is compiled with branches. Let me pick nice peace of code from Keith's answer:For
t2.a > t1.acompiler will produce this:Similar code is produced for 2nd part
t2.a < t1.a. Same code is then repeated for right side of||((t2.b > t1.b) - (t2.b < t1.b)). Let's count branches: fastest code path has five branches (jle,jmpin first part,jge,jmpin second part) for each sub-expression plus an extra jump for short-circuit of||(for a total of six branches). Slowest one has even few more. It's worse than original implementation with manyifs.For comparison let's see what is generate for comparison with subtraction:
This is our best code path, just two branches. Let's see 2nd part:
No more branches here. Our fastest and slowest code paths always have same number of branches.
Subtractions
Why subtractions work? Let's see with simple values and some edge cases Suma picked in comments. Let's say:
Then we have:
t2.a - t1.a == 10 - 1 == 9. Positive number as required in original code (if (t1.a < t2.a) return +1;). Opposite case is trivial. Here we're assuming signed integer math will wrap.(int)(t1.b - t2.b) == 10 - 1 == 9. Positive number as required (inverse ordering forT.aandT.b). This needs more explanation because of edge cases. Imaginet1.bisUINT_MAXandt2.bis0.t1.b - t2.bis stillUINT_MAXand it has to be casted toint, it's bit pattern is0xFFFFFFFF, exactly bit pattern of-1for asigned int. Result is again correct. Note that here we're assuming two important things: signed numbers are represented in 2's complement and unsigned to signed conversion simply reinterpret raw memory value with new given type (no explicit calculation is done).As noted by Suma problems arise when numbers are big, if you want full
intandunsigned intrange then you may simply cast them todouble:Extract of generated assembly code:
In this way the only tuple you can't use is
INT_MINfort1.atogether withINT_MAXfort2.a.