Most of compilers do not optimize inline assembly code (VS2015, gcc), it allows us to write new instructions it doesn't support.
But when should a C/C++ compiler implement inline assembly optimizing?
In general, compilers will not optimize the content of your inline assembly. That is, they won't remove or change instructions in your assembly block. In particular, gcc
simply passes through the body of your inline assembly unchanged to the underlying assembler (gas
in this case).
However, good compilers may optimize around your inline assembly, and in some cases may even omit the execution inline assembly code entirely! Gcc, for example, can do this if it determines that the declared outputs of the assembly are dead. It can also hoist an assembly block out of a loop or combine multiple calls into one. So it never messes with the instructions inside the block, but it entirely reasonable to change the number of times a block would be executed. Of course, this behavior can also be be disabled if the block has some other important side effect.
The gcc docs on extended asm syntax have some good examples of all of this stuff.
Never. That would defeat the purpose of inline assembly, which is to get exactly what you ask for.
If you want to use the full power of the target CPU's instruction set in a way that the compiler can understand and optimize, you should use intrinsic functions, not inline asm.
e.g. instead of inline asm for
popcnt
, useint count = __builtin_popcount(x);
(in GNU C compiled with-mpopcnt
). Inline-asm is compiler-specific too, so if anything intrinsics are more portable, especially if you use Intel's x86 intrinsics which are supported across all the major compilers that can target x86. Use#include <x86intrin.h>
and you can useint _popcnt32 (int a)
to reliably get thepopcnt
x86 instruction. See Intel's intrinsics finder/guide, and other links in the x86 tag wiki.Compiled with
#define popc _popcnt32
by gcc6.3:clang 3.9 with an inline-asm definition of
popc
, on the Godbolt compiler explorer:This is a classic example of inline asm defeating constant propagation, and why you shouldn't use it for performance if you can avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm.
This was the inline-asm definition I used for this test:
If you didn't know that
popcnt
has a false dependency on its output register on Intel hardware, that's another reason you should leave it to the compiler whenever possible.Using special instructions that the compiler doesn't know about is one use-case for inline asm, but if the compiler doesn't know about it, it certainly can't optimize it. Before compilers were good at optimizing intrinsics (e.g. for SIMD instructions), inline asm for this kind of thing was more common. But we're many years beyond that now, and compilers are generally good with intrinsics, even for non-x86 architectures like ARM.