does unrolling loops in x86-64 actually make code faster?

2.2k views Asked by At

I assume everyone knows what "unrolling loops means". Just in case I'll give a concrete example in a moment. The question I will ask is... does unrolling loops in x86-64 assembly language actually make code faster? I will explain why I begin to question this notion.

For those not familiar with the term "unrolling loops", here is one example of a loop from code I am writing now:

    movq   $15, %rcx                  # rcx = loop iterations
s1024_divide_compare_loop:
    movq   (%r14, %rcx, 8), %rax      # rax = numerator
    subq   (%r15, %rcx, 8), %rax      # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)
    subq   $1, %rcx                   # rcx = rcx - 1 : one less loop interation
    jns    s1024_divide_compare_loop  # check next lower 64-bit portion of n & d

And here is what that loop looks like unrolled:

    movq   120(%r14), %rax            # rax = numerator
    subq   120(%r15), %rax            # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

    movq   112(%r14), %rax            # rax = numerator
    subq   112(%r15), %rax            # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

    movq   104(%r14), %rax            # rax = numerator
    subq   104(%r15), %rax            # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

    movq   96(%r14), %rax             # rax = numerator
    subq   96(%r15), %rax             # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)
 #
 # insert 11 more copies of the above 4 lines (with different offsets) here
 #
    movq   0(%r14), %rax              # rax = numerator
    subq   0(%r15), %rax              # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

I should note that the above example is representative, but may well not be optimal for this discussion. The reason is the large number of conditional branches. While I believe "branches not taken" are similar to other instructions, that assumption might not be true in some cases (which might be obscure). So if this aspect is relevant, we can assume those two conditional branches are just simple instructions like movq or addq for this discussion (though addressing both cases separately is desirable if different).

Oh, two final conditions:

#1: This question ONLY applies to a running code on single thread.

#2: This question ONLY applies to fast modern ~4GHz CPUs (FX-8350, etc).

Okay, now for thoughts that make me question whether unrolling loops is actually wise.

These new processors run at 4GHz, and can sometimes execute two or three instructions in each cycle. In my code above, the first 2 instructions cannot execute in parallel, and probably the last 3 also cannot. But a loop with instructions that can execute in parallel only makes my questions more relevant.

So, each instruction may execute in 0.25 nanoseconds (or less for instrutions that execute in parallel). This means 4 instructions takes 1 nanosecond to execute. Each set of 4 instructions roughly consumes 16-bytes, which I assume is 1/4 of a cache line. Therefore, 4 sets of 4 lines should take 4ns to execute at which point it has exited the cache line and needs another.

This is where the question becomes more complicated.

So, after 16-instructions and 1/4 of my entire unrolled loop, the CPU needs more instructions to execute. If the CPU was running the loop version of the code, it would execute the exact same instructions again, which will definitely still be in the L1 cache, and therefore continue to execute a full-bore CPU speed.

However, can we reasonably expect the CPU to have loaded the next cache line in only 4ns? Or in the case of instructions that could execute in parallel, can we reasonably expect the CPU to have loaded the next cache line in only 2ns?

From what I know about dynamic RAM, that seems... tight. I know when the CPU accesses contiguous addresses, it can leave RAS (upper address bits) latched and clock out consecutive 64-bit or 128-bit chunks of memory faster with CAS. But, can we really expect the CPU to read 64-bytes in 2ns or 4ns? That is 4 or 8 read-from-DRAM cycles, depending on whether the CPU is reading 64-bits (8-bytes) or 128-bits (16-bytes) per read operation.

My specific code may exercise this question even further. By the nature of this algorithm, my code needs to compare the most-significant portions of the numerator and denominator first, then work downward towards lower addresses each access. Does this make automatic pre-fetching less likely to work?

I have seen various people test loops versus unrolled loops. But every instance I have seen has a fatal design flaw. It keeps calling the same routines over and over and over again... usually millions of times... in order to get a large enough timer value to make sense of. But wait! Like most applications, code will likely call my 1024-bit divide functions only occasionally. That is nothing like these tests I see, which by their very nature assure that both the instructions stay in L1 instruction cache, and the accessed data stays in L1 data cache.

Of course unrolled loops are faster if you make sure code and data is already in L1 cache! Duh!

Those are not representative tests - not even close!

Those tests do measure "best case performance", but fail to represent normal program execution at all. But to decide how to best write 64-bit assembly-language code (or the code emitter portion of compilers), we need to be more realistic in our premises. Or at least I think so, which is why I ask this question.

Has anyone been through these questions in a thorough and realistic way?

3

There are 3 answers

1
gsg On

Intel provides an optimization manual for people who are interested in tuning code for their processors which contains some treatment of loop unrolling. I wouldn't expect a nice simple answer, but it may help.

By the way, the loop instruction should be avoided. It's been slower than the equivalent instructions for many years now.

2
superdesk On

It's complicated.

To answer your final question, yes and no. There is a lot of research on optimizations, with thorough analysis. But, for some of the reasons you mention, it is almost impossible to show that one optimization actually has an effect on real-world performance. Every change made to one optimization affects other optimizations, and the work load can have differing effects.

But you could test it out yourself pretty easily (well... you will easily start to run into problems and conflicting results, I imagine). GCC lets you turn on and off specific optimizations, so find the source to a program you want to test and give it a go.

My guess? It is going to come down to caches. If (on average) cache performance is improved, then loop unrolling is worth it.

3
Brett Hale On

This level of optimization is highly dependent on the micro-architecture, and the subject is too broad for a comprehensive answer. The GMP library's mpn/x86_64 directory has READMEs and annotated assembly for different u-archs.

So yes - the contributors to GMP have dealt with these issues thoroughly. And unrolling does provide a speedup in some cases, though it's not as effective on modern x86-64. Loops that fit in the decoded instruction / uop cache, loop alignment, branch prediction, avoiding partial stalls, etc., are important as well. Agner Fog's optimization manuals are another excellent resource.

Finally, using bitwise shift / subtract [non]restoring division written in assembly will never be competitive with a per-word multiple precision division implementation written in C. There's a reason GMP doesn't use it. The classical 'Knuth algorithm D' takes some effort (pen and paper) to understand - particularly the quotient estimate / quotient adjust conditions. Otherwise, I fear your efforts here might be wasted.

With fixed operand sizes you can store normalized divisors and working remainders on the stack. The algorithm is actually dominated by the cost of multiply instructions, as division instructions are only used in the estimate step. The Handbook of Applied Cryptography, Chapter 14, is a good reference for implementation details.