I assume everyone knows what "unrolling loops means". Just in case I'll give a concrete example in a moment. The question I will ask is... does unrolling loops in x86-64 assembly language actually make code faster? I will explain why I begin to question this notion.
For those not familiar with the term "unrolling loops", here is one example of a loop from code I am writing now:
movq $15, %rcx # rcx = loop iterations
s1024_divide_compare_loop:
movq (%r14, %rcx, 8), %rax # rax = numerator
subq (%r15, %rcx, 8), %rax # flags = numerator - denominator
js s1024_divide_done # divide done: (numerator < denominator)
jnz s1024_upshift_done # do the divide: (numerator > denominator)
subq $1, %rcx # rcx = rcx - 1 : one less loop interation
jns s1024_divide_compare_loop # check next lower 64-bit portion of n & d
And here is what that loop looks like unrolled:
movq 120(%r14), %rax # rax = numerator
subq 120(%r15), %rax # flags = numerator - denominator
js s1024_divide_done # divide done: (numerator < denominator)
jnz s1024_upshift_done # do the divide: (numerator > denominator)
movq 112(%r14), %rax # rax = numerator
subq 112(%r15), %rax # flags = numerator - denominator
js s1024_divide_done # divide done: (numerator < denominator)
jnz s1024_upshift_done # do the divide: (numerator > denominator)
movq 104(%r14), %rax # rax = numerator
subq 104(%r15), %rax # flags = numerator - denominator
js s1024_divide_done # divide done: (numerator < denominator)
jnz s1024_upshift_done # do the divide: (numerator > denominator)
movq 96(%r14), %rax # rax = numerator
subq 96(%r15), %rax # flags = numerator - denominator
js s1024_divide_done # divide done: (numerator < denominator)
jnz s1024_upshift_done # do the divide: (numerator > denominator)
#
# insert 11 more copies of the above 4 lines (with different offsets) here
#
movq 0(%r14), %rax # rax = numerator
subq 0(%r15), %rax # flags = numerator - denominator
js s1024_divide_done # divide done: (numerator < denominator)
jnz s1024_upshift_done # do the divide: (numerator > denominator)
I should note that the above example is representative, but may well not be optimal for this discussion. The reason is the large number of conditional branches. While I believe "branches not taken" are similar to other instructions, that assumption might not be true in some cases (which might be obscure). So if this aspect is relevant, we can assume those two conditional branches are just simple instructions like movq
or addq
for this discussion (though addressing both cases separately is desirable if different).
Oh, two final conditions:
#1: This question ONLY applies to a running code on single thread.
#2: This question ONLY applies to fast modern ~4GHz CPUs (FX-8350, etc).
Okay, now for thoughts that make me question whether unrolling loops is actually wise.
These new processors run at 4GHz, and can sometimes execute two or three instructions in each cycle. In my code above, the first 2 instructions cannot execute in parallel, and probably the last 3 also cannot. But a loop with instructions that can execute in parallel only makes my questions more relevant.
So, each instruction may execute in 0.25 nanoseconds (or less for instrutions that execute in parallel). This means 4 instructions takes 1 nanosecond to execute. Each set of 4 instructions roughly consumes 16-bytes, which I assume is 1/4 of a cache line. Therefore, 4 sets of 4 lines should take 4ns to execute at which point it has exited the cache line and needs another.
This is where the question becomes more complicated.
So, after 16-instructions and 1/4 of my entire unrolled loop, the CPU needs more instructions to execute. If the CPU was running the loop version of the code, it would execute the exact same instructions again, which will definitely still be in the L1 cache, and therefore continue to execute a full-bore CPU speed.
However, can we reasonably expect the CPU to have loaded the next cache line in only 4ns? Or in the case of instructions that could execute in parallel, can we reasonably expect the CPU to have loaded the next cache line in only 2ns?
From what I know about dynamic RAM, that seems... tight. I know when the CPU accesses contiguous addresses, it can leave RAS (upper address bits) latched and clock out consecutive 64-bit or 128-bit chunks of memory faster with CAS. But, can we really expect the CPU to read 64-bytes in 2ns or 4ns? That is 4 or 8 read-from-DRAM cycles, depending on whether the CPU is reading 64-bits (8-bytes) or 128-bits (16-bytes) per read operation.
My specific code may exercise this question even further. By the nature of this algorithm, my code needs to compare the most-significant portions of the numerator and denominator first, then work downward towards lower addresses each access. Does this make automatic pre-fetching less likely to work?
I have seen various people test loops versus unrolled loops. But every instance I have seen has a fatal design flaw. It keeps calling the same routines over and over and over again... usually millions of times... in order to get a large enough timer value to make sense of. But wait! Like most applications, code will likely call my 1024-bit divide functions only occasionally. That is nothing like these tests I see, which by their very nature assure that both the instructions stay in L1 instruction cache, and the accessed data stays in L1 data cache.
Of course unrolled loops are faster if you make sure code and data is already in L1 cache! Duh!
Those are not representative tests - not even close!
Those tests do measure "best case performance", but fail to represent normal program execution at all. But to decide how to best write 64-bit assembly-language code (or the code emitter portion of compilers), we need to be more realistic in our premises. Or at least I think so, which is why I ask this question.
Has anyone been through these questions in a thorough and realistic way?
Intel provides an optimization manual for people who are interested in tuning code for their processors which contains some treatment of loop unrolling. I wouldn't expect a nice simple answer, but it may help.
By the way, the
loop
instruction should be avoided. It's been slower than the equivalent instructions for many years now.