Sites like https://uops.info/ and Agner Fog's instruction tables, and even Intel's own manuals, list various forms of the same instruction. For example add m, r
(in Agner's tables) or add (m64, r64)
on uops.info, or ADD r/m64, r64
in Intel's manual (https://www.felixcloutier.com/x86/add).
Here's a simple example I ran on godbolt
__thread int a;
void Test() {
a+=5;
}
The add is add DWORD PTR fs:0xfffffffffffffffc,0x5
. It starts with the opcodes 64 83 04 25
.
There's a few ways to write my real code but I wanted to lookup how many cycles this might take and other information. How the heck do I find the reference to this instruction? I tried https://uops.info/table.html typing in "add" and checking off my architecture. But I have no idea which one of the entries is the instruction that's being used.
For now in this specific case I'm guessing the opcode is Add m64, r64 but I have no idea if there's any penalty for using fs:
before the address or if there's a way to see opcodes so I can confirm I'm looking at the right reference
http://ref.x86asm.net/coder64.html has an opcode map, but with a bit of experience you won't need one most of the time. Especially when you have disassembly, you can just check the manual entry for that mnemonic (https://www.felixcloutier.com/x86/add), and see which of the possible opcodes it is (
83 /0 add r/m32, imm8
).Clearly this has a 32-bit operand-size (
dword ptr
) memory destination, and the source is an immediate (numeric constant). That rules out a, r64
register source for 2 separate reasons. So even without looking at the machine code, it's definitelyadd r/m32, imm
with an imm8 or imm32. Any sane assembler will of course pick imm8 for a small constant that fits in a signed 8-bit integer.Generally different ways of encoding the same instruction aren't special, so the source-level assembly / disassembly is fine, as long as you understand what's a register, what's memory, and what's an immediate.
But there are a few special cases, e.g. Agner Fog's guide notes that rotates by 1 using the short-form encoding are slower than
rol reg, imm8
even when the imm8=1, because the flag-updating special case for rotate-by-1 actually depends on the opcode, not the immediate count. (Intel's documentation apparently assumes your assembler will always pick the short-form for rotate by constant 1. The part about "masked count" may only apply to rotate bycl
. https://www.felixcloutier.com/x86/rcl:rcr:rol:ror#flags-affected. I haven't tested this recently and am not 100% sure I'm remembering correctly when OF is updated (but other flags in the SPAZO group are always left unmodified), but IIRC that's why rotates by 1 (2 uops) and by cl (3 uops) are slow, vs. rotates by other immediate counts (1 uop) on Intel).Or https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks. Specifically I mean Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? - even on Haswell / Skylake,
adc al,0
(using the short form with no modrm byte) is 2 uops, and so is the equivalentadc eax, 12345
. Butadc edx, 12345
is 1 uop using the non-special case.) Then you have to either check the machine code, or know how your assembler will have chosen to encode a given instruction. (Optimizing for size).BTW, using a segment with a non-zero base adds 1 cycle of latency to address-generation, IIRC, but aren't a significant throughput penalty. (Unless of course throughput bottlenecks on a latency chain that it's part of...)