Avoiding branches in managed languages

479 views Asked by At

In C, when compiled to a x86 machine, I would usually replace branches with logical expression, when speed is the most important aspect, even if the conditions are complex, for example, instead of

char isSomething() {
    if (complexExpression01) {
        if (complexExpression02) {
            if(!complexExpression03) {
                return 1;
            }
        }
    }
    return 0;
}

I will write:

char isSomething() {
    return complexExpression01 &&
           complexExpression02 &&
           !complexExpression03 ;
}

Now clearly, this might be a harder to maintain, and less readable code, but it might actually be faster.

Is there any reason to act the same way when working with managed code, such as C#? Are "jumps" expensive in managed code as they are in unmanaged code (at least on x86)?

4

There are 4 answers

3
Ulrik Rasmussen On BEST ANSWER

The two expressions will result in the same number of tests, as the logical and operator (&&) has short-circuit semantics in both C and C#. The premise of your question (that the second way of expressing the program results in less branching) is therefore incorrect.

0
Polity On

This is up to the implementation of the CLR and compiler of the managed language. In case of C#, the following test case proofs that there is no difference in instructions for nested if statements and combined if statements:

            // case 1
            if (value1 < value2)
00000089  mov         eax,dword ptr [ebp-0Ch] 
0000008c  cmp         eax,dword ptr [ebp-10h] 
0000008f  jge         000000A6 
            {
                if (value2 < value3)
00000091  mov         eax,dword ptr [ebp-10h] 
00000094  cmp         eax,dword ptr [ebp-14h] 
00000097  jge         000000A6 
                {
                    result1 = true;
00000099  mov         eax,1 
0000009e  and         eax,0FFh 
000000a3  mov         dword ptr [ebp-4],eax 
                }
            }

            // case 2
            if (value1 < value2 && value2 < value3)
000000a6  mov         eax,dword ptr [ebp-0Ch] 
000000a9  cmp         eax,dword ptr [ebp-10h] 
000000ac  jge         000000C3 
000000ae  mov         eax,dword ptr [ebp-10h] 
000000b1  cmp         eax,dword ptr [ebp-14h] 
000000b4  jge         000000C3 
            {
                result2 = true;
000000b6  mov         eax,1 
000000bb  and         eax,0FFh 
000000c0  mov         dword ptr [ebp-8],eax 
            }
3
Jeffrey Sax On

The only way to know is to measure.

True and false are represented as 1 and 0 by the CLR, so it wouldn't surprise me if using logical expressions had some advantage. Let's see:

static void BenchBranch() {
    Stopwatch sw = new Stopwatch();

    const int NMAX = 1000000000;
    bool a = true;
    bool b = false;
    bool c = true;

    sw.Restart();
    int sum = 0;
    for (int i = 0; i < NMAX; i++) {
        if (a)
            if (b)
                if (c)
                    sum++;
        a = !a;
        b = a ^ b;
        c = b;
    }
    sw.Stop();
    Console.WriteLine("1: {0:F3} ms ({1})", sw.Elapsed.TotalMilliseconds, sum);

    sw.Restart();
    sum = 0;
    for (int i = 0; i < NMAX; i++) {
        if (a && b && c) 
            sum++;
        a = !a;
        b = a ^ b;
        c = b;
    }
    sw.Stop();
    Console.WriteLine("2: {0:F3} ms ({1})", sw.Elapsed.TotalMilliseconds, sum);

    sw.Restart();
    sum = 0;
    for (int i = 0; i < NMAX; i++) {
        sum += (a && b && c) ? 1 : 0;
        a = !a;
        b = a ^ b;
        c = b;
    }
    sw.Stop();
    Console.WriteLine("3: {0:F3} ms ({1})", sw.Elapsed.TotalMilliseconds, sum);
}

The result:

1:  2713.396 ms (250000000)
2:  2477.912 ms (250000000)
3:  2324.916 ms (250000000)

So, from this it seems there is a slight advantage to using logical operators instead of nested conditional statements. However, any specific instance may give somewhat different results.

In the end, whether a micro-optimization such as this is worth it depends on how performance-critical the code is.

1
sehe On

General

In your regular compiler, the generated code will most often be the same, at least when assuming that you are using regular

csc.exe /optimize+
cl.exe /O2
g++ -O2

and related default optimization modes.

The general mantra is: profile, profile, profile (and don't micro-optimize until your profiler tells you to). You can always look at the generated code2 to see whether there is room for improvement.

Think of it this way, e.g. the C# code:

C#/.NET

Each of your complexExpressions is a de-facto function call invocation (call, calli, callvirt opcode3) that requires it's arguments to be pushed onto the stack. The return value will be left pushed onto the stack instead of the parameters on exit.

Now, CLR being a stack-based virtual machine (i.e. register-less) this amounts to exactly the same as an anonymous temporary variable on the stack. The only difference is the number of identifiers used in the code.

Now what the JIT engine does with that is another matter: the JIT engine will have to translate these calls into native assembly and may be doing optimization by tweaking register-allocation, instruction ordering, branch prediction and stuff like that1

1 (though in practice, for this sample it won't be allowed to do the more interesting optimizations because the complex function calls may have sideeffects and the C# specs are very clear about the evaluation order and so-called sequence)point. Note however, that the JIT engine is allowed to inline function calls, to reduce call overhead.

Not only when they are non-virtual, but (IIRC) also when the runtime type can be known statically at compile time for certain .NET framework internals. I'd have to look up a reference for this, but in fact I think there are attributes introduced in .NET framework 4.0 to explicit prevent inlining of framework functions; this is so that Microsoft can patch library code in service packs/updates, even when user assemblies have been ahead-of-time compiled (ngen.exe) into native images.

C/C++

In C/C++ the memory model is much more lax (i.e. at least until C++11) and the code is usually compiled down to native instructions at compile time directly. Add that C/C++ compilers usually do aggressive inlining, the code even in such compilers will usually be the same, unless you compile without optimization enabled


2 I use

  • ildasm or monodis to see the IL code generated
  • mono -aot=full,static or mkbundle to generate native object modules and objdump -CdS to see the annotated native assembly instructions for that.

Note this is purely curiosity, because it rarely happens that I find interesting bottlenecks that way. See however, Jon Skeet's blog posts on performance optimizing Noda.NET for good examples of surprises that may lurk in the generated IL code for Generic classes.

3 Edit not accurate for operators on compiler intrinsics, though even they will just leave their result on the stack.