LEA vs MOV imm64 for loading address-constant into register

133 views Asked by At

I have a constant (64-bit) address that I want to load into a register. This address is located in the code, segment, so it could be addressed relative to RIP. What's the differences between

movabs rax, 0x123456789abc

and

lea rax, [rip+0xFF] // relative offset for 0x123456789abc

in terms of execution-speed, and which one is preferable (in a situation where both alternatives could theoretically be used; like in a JIT or when the address can be fixed up at link-time)?

By looking at the disassembly, LEA results in less code, but would it be faster due to this; or potentially slower due to the relative encoded offset?

2

There are 2 answers

1
Jérôme Richard On

TL;DR: In a hot loop, the former (movabs) is generally faster because it has a higher reciprocal throughput on most modern processors.


Indeed, on Intel Haswell/Broadwell/Skylake/CoffeeLake/CannonLake/IceLake/TigerLake/RocketLake (too many of those lakes), the movabs has a reciprocal throughput of 0.25 while it is 1 for the lea (due to the rip-relative addressing).

On the quite-recent Intel AlderLake hybrid architecture, things are significantly more complex. AlderLake's P-cores (GoldenCove) have a reciprocal throughput of 0.2 for the movabs and 1 for the lea (mainly due to the rip-relative addressing again). AlderLake's E-core (Gracemont) are pretty different: the reciprocal throughput for the movabs is 0.33, while it is 0.25 for the lea. This means that the best instruction to use is dependent on where the thread are scheduled! This is crazy. Even more funny : it looks like Goldmont/Tremont already had fast lea with rip-relative addressing while SunnyCove/WillowCove. This is because the architecture for the P-core and E-core are designed for different purpose (AFAIK Mon-like architectures was designed for low-power processors while Cove-like ones was designed for desktop processors). Not to mention Intel certainly hadn't initially planned to mix the two kind of architectures in the same chip.

On AMD Zen1/Zen2, it is 0.25 for the movabs and 0.5 for lea, so the former is also better. On AMD Zen3/Zen4, both have a reciprocal throughput of 0.25 so they are equally fast on this architecture.

That being said, the former take more space and it is likely slower to decode than the later so the later might be better outsize a hot loop. Indeed, instructions are decoded to µops once and then put in a cache for relatively-short loops, but the decoding is typically a bottleneck for large code executed once (no hot loop or very large one and a code that may need to be fetched from the RAM or the L3).

0
Peter Cordes On

Compilers prefer RIP-relative LEA because code-size matters more for most use-cases than the fact it can only execute on one port on Intel CPUs (https://uops.info/ and Jérôme Richard's answer). Also because it's position-independent, unlike movabs, so we should at least consider both options. (Runtime fixup of a 64-bit absolute addresses is supported in PIE executables on Linux, but it means the dynamic linker has to remap the page writeable and then back to executable. "Text relocations" are generally something to avoid, and ld will warn about it. So that's a big negative against movabs for non-JIT use-cases in modern executables on all mainstream OSes.)

movabs on Sandybridge-family with an immediate that actually has more than 32 significant bits takes extra time to fetch from the uop cache, according to Agner Fog's testing, and has some limits on packing as tightly into the uop cache since it needs to borrow space. (https://agner.org/optimize/microarchitecture.pdf#page=125 - in the Sandybridge section: this may have changed some in later CPUs). The 10-byte x86 machine-code size cost still generally makes it something to avoid, even without any uop-cache penalties.

If you need to get an address into a register more than once per clock, copy from another register that you set up outside the loop, or load from memory (with a RIP-relative addressing mode or from stack space.)

In a loop making a function call, that will be a bigger bottleneck than 1/clock anyway.


Of course, if your address is actually in the low 32 bits of virtual address-space, use 5-byte mov eax, 0x1234567 (like a Linux non-PIE executable for static data, or like the x32 ABI. For a JIT, with Linux's MAP_32BIT if you want to enable that.)

Related: How to load address of function or label into register covers the details of the choices, but without discussing the LEA throughput tradeoff.