I am looking to implement something similar to
int memcmp ( const void * ptr1, const void * ptr2, size_t num );
For comparison of numerical types such as floats, doubles and integers but with distinguishing between two case (1) < and (2) =,> rather than <between 3 cases (1)<, ==(2)==, and >(3)>. My aim is to reduce the number of the used instructions, assuming running on a standard laptop (x86 architecture).
The solution to the problem is likely just
#define less(a,b,n) (memcmp(a,b,n) < 0).There's a bunch of advantages with using
memcmpsince the compiler is likely to highly optimize the use of it. It may look at what you use as input and inlinememcmpaccordingly, giving the most efficient code.For example
memcmphas the requirement to cast each byte tounsigned charinternally and work on misaligned data. But if you provide lets say two chunks of 8 byte aligned data on a x86_64, there's probably no reason for the machine code to chew through it byte by byte.Here's an example where I hacked together a semi-naive version of a "less" function working similar to
memcmp:When implementing it, I soon recognized the problem that although we are looking for the
<result, we have to keep looping while the bytes are equal. And when they aren't, that's when we can start looking for<, with the cost of additional comparisons.Because C has no operator working like "use <= but store the less or equal statuses separately, so we can loop based on the equal flag but return the less flag". On the assembler level we can likely do that however, making this function a good candidate for inline assembler in case we care deeply about performance. And yet unless we happen to be some x86 assembler guru, we can probably not hope to beat the compiler even with hand-crafted assembler.
Looking at the generated code (gcc -O3 x86) in Compiler Explorer, we can conclude that my home-made function is a mess:
cmpall over the place - it has more branches than a Christmas tree! This will not perform well at all.Whereas the equivalent
memcmpcalls are sometimes inlined, resulting in various fancy x86 intrinsics, hard-coded "magic numbers" etc and very few if any branches. Way more efficient.As so the conclusion is that "pre-mature optimization" remains the root of all evil, and
memcmp(...) < 0is likely the best solution for this purpose no matter the target.