This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++
The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num
is a prime of size about 2**50). In C++11:
using nt = unsigned long long;
bool is_prime_float(nt num)
{
for (nt n=2; n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false; }
}
return true;
}
Then g++ -std=c++11 -O3 -S
produces the following, with RCX containing n
and XMM6 containing sqrt(num)
. See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).
jmp .L20
.p2align 4,,10
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36 // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L30 // Failed divisibility test
addq $1, %rcx
jns .L37
// Further code to deal with case when ucomisd can't be used
I time this using std::chrono::steady_clock
. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10
tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.
I wrote a Python script to replace .p2align 4,,10
by a manually controlled number of nop
instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:
From objdump
with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that
- For 0 byte padding, slow performance (0.42 secs)
- For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).
- For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)
- For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)
- Then cycles on a 32 byte sample
So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.
Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)
I don't see any branch-prediction problems. Could this be a uop cache quirk??
By changing the C++ algorithm to cache sqrt(num)
in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.
From what I can see in your algorithm, there is certainly not much you can do to improve it.
The problem you are hitting is probably not so much the branch to an aligned position, although that can still help, you're current problem is much more likely the pipeline mechanism.
When you write two instructions one after another such as:
In order to execute the second instruction, the first one has to be complete. For that reason compilers tend to mix instructions. Say you need to set
%ecx
to zero, you could do this:In this case, the
mov
and thexor
can both be executed in parallel. This makes things go faster... The number of instructions that can be handled in parallel vary very much between processors (Xeons are generally better at that).The branch adds another parameter where the best processors may start executing both sides of the branch (the true and the false...) simultaneously. But really most processors will make a guess and hope they are right.
Finally, it is obvious that converting the
sqrt()
result to an integer will make things a lot faster since you will avoid all that non-sense with SSE2 code which is definitively slower if used only for a conversion + compare when those two instructions could be done with integers.Now... you are probably still wondering why the alignment does not matter with the integers. The fact is that if your code fits in the L1 instruction cache, then the alignment is not important. If you lose the L1 cache, then it has to reload the code and that's where the alignment becomes quite important since on each loop it could otherwise be loading useless code (possibly 15 bytes of useless code...) and memory access is still dead slow.