When will compilers optimize assembly code in C/C++ source?

2.6k views Asked by At

Most of compilers do not optimize inline assembly code (VS2015, gcc), it allows us to write new instructions it doesn't support.

But when should a C/C++ compiler implement inline assembly optimizing?

2

There are 2 answers

6
Peter Cordes On

Never. That would defeat the purpose of inline assembly, which is to get exactly what you ask for.

If you want to use the full power of the target CPU's instruction set in a way that the compiler can understand and optimize, you should use intrinsic functions, not inline asm.

e.g. instead of inline asm for popcnt, use int count = __builtin_popcount(x); (in GNU C compiled with -mpopcnt). Inline-asm is compiler-specific too, so if anything intrinsics are more portable, especially if you use Intel's x86 intrinsics which are supported across all the major compilers that can target x86. Use #include <x86intrin.h> and you can use int _popcnt32 (int a) to reliably get the popcnt x86 instruction. See Intel's intrinsics finder/guide, and other links in the tag wiki.


int count(){ 
  int total = 0;
  for(int i=0 ; i<4 ; ++i)
    total += popc(i);
  return total;
}

Compiled with #define popc _popcnt32 by gcc6.3:

    mov     eax, 4
    ret

clang 3.9 with an inline-asm definition of popc, on the Godbolt compiler explorer:

    xor     eax, eax
    popcnt  eax, eax
    mov     ecx, 1
    popcnt  ecx, ecx
    add     ecx, eax
    mov     edx, 2
    popcnt  edx, edx
    add     edx, ecx
    mov     eax, 3
    popcnt  eax, eax
    add     eax, edx
    ret

This is a classic example of inline asm defeating constant propagation, and why you shouldn't use it for performance if you can avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm.


This was the inline-asm definition I used for this test:

int popc_asm(int x) {
  // force use of the same register because popcnt has a false dependency on its output, on Intel hardware
  // this is just a toy example, though, and also demonstrates how non-optimal constraints can lead to worse code
  asm("popcnt %0,%0" : "+r"(x));
  return x;
}

If you didn't know that popcnt has a false dependency on its output register on Intel hardware, that's another reason you should leave it to the compiler whenever possible.


Using special instructions that the compiler doesn't know about is one use-case for inline asm, but if the compiler doesn't know about it, it certainly can't optimize it. Before compilers were good at optimizing intrinsics (e.g. for SIMD instructions), inline asm for this kind of thing was more common. But we're many years beyond that now, and compilers are generally good with intrinsics, even for non-x86 architectures like ARM.

0
BeeOnRope On

In general, compilers will not optimize the content of your inline assembly. That is, they won't remove or change instructions in your assembly block. In particular, gcc simply passes through the body of your inline assembly unchanged to the underlying assembler (gas in this case).

However, good compilers may optimize around your inline assembly, and in some cases may even omit the execution inline assembly code entirely! Gcc, for example, can do this if it determines that the declared outputs of the assembly are dead. It can also hoist an assembly block out of a loop or combine multiple calls into one. So it never messes with the instructions inside the block, but it entirely reasonable to change the number of times a block would be executed. Of course, this behavior can also be be disabled if the block has some other important side effect.

The gcc docs on extended asm syntax have some good examples of all of this stuff.