How to determine x86 machine opcode values based on real mode offsets and addressing?

4.3k views Asked by At

I am trying to write raw machine code bytes as 0s and 1s in to a text file, and execute it as that through the BIOS.

I have some problems understanding, however, how addressing, multiplying, offsets, addressing, operands, and instructions work in combinatorial arrangements, i.e. difference between MOV AL, 07 and MOV BL, AL.

I mean it makes sense in Assembly, but in machine code it becomes highly difficult to get the idea of parameters.

So what I want to know is this: How can I better understand this? There are no tutorials I've found that accurately explain/describe the 0s and 1s from instructions in combinatorial correlations or connections between data passing, MMIO, addressing modes, arithmetic, and the like.

On this site http://ref.x86asm.net/coder32.html#x00 it tries, but I don't understand this.

EXAMPLE: Say I want to move 5 in to AL ... would I specify the literal '5' in binary as part of the opcode in binary prefix chained with the AL/MOV instruction, or would I have one fixed binary code for each instruction, regardless of value? That is what I want to know ... how to understad how machine code is written.

2

There are 2 answers

5
Carl Norum On

There is (mostly) a one-to-one mapping between assembler mnemonics and machine instructions. You can find these mappings in the Intel Software Developers Manual, Volume 2, which contains the complete x86 16-, 32- and 64-bit instruction sets. You'll probably want to start with Chapter 2: Instruction Format which describes the translations you're trying to come up with.

In the case of mov al, 5 it's just as you say, you put the literal there. The instruction in machine code is:

b0 05

Since thats the MOV r8, imm8 form of the MOV instruction. For mov bl, al, you'd want the MOV r/m8,r8 form, which in your case would encode to:

88 c3

The c3 you can look up in Table 2-2 32-Bit Addressing Forms with the ModR/M Byte, where you'll see it at the intersection of the BL row and the AL column. (There's a 16-bit table, too if that's the mode you're in - the value in this case is the same.)

4
mike On

Unfortunately, x86 encoding is complex and irregular, and understanding it is hard work. The best "quick start" on the encoding is a set of HTML pages at sandpile.org (it's terse, but pretty thorough).

First: http://sandpile.org/x86/opc_enc.htm - the "instruction encodings" table shows the dozen or so ways in which instructions are coded. The white cells in each row represent the mandatory bytes in the instruction; the following grey cells are there (or not there) based on various fields appearing earlier in the opcode. You should look at the rows starting with a white "0Fh", as well as the first row. At the bottom of the same page are the bitfields appearing in various "extended" opcode fields - you're ignoring all but the "modrm/sib" row (the first row).

Notice that for all but the first row (which is 1-byte opcodes), a "mod r/m" byte must follow the opcode (for the 1-byte opcodes, it depends on the instruction). This encodes the arguments for most 2-argument instructions. The table at http://sandpile.org/x86/opc_rm.htm has the meanings: one of the arguments must be a register, the other argument can be a register or a memory indirection (the "reg" field encodes the register, the "mod" and "r/m" fields encode the other argument). There's usually also a "direction" bit elsewhere in the opcode indicating the order of the arguments. The opcode also indicates whether we're manipulating, eg, AL, AX, EAX or RAX (i.e. different sizes), or one of the extended registers, which is why each 3-bit field is listed as refering to many different registers.

In modrm, if the "mod" bits are "11", then the "r/m" field also refers to a register. Otherwise it usually refers to a memory address constructed by adding the named register to an (optional) displacement appearing after the modrm byte (this constant is 0, 1, or 4 bytes long depending on the "mod" bits). The exception is when the "r/m" bits are "100" (i.e. 0x4), which would usually name "SP" - in this case, the memory argument is described by an additional "sib" byte which immediately follows the modrm byte (any modrm displacement appears after the sib). For the encoding of SIB, look at http://sandpile.org/x86/opc_sib.htm, or click through from the modrm page.

Finally, to understand where the direction and size come from, look at some opcodes: http://sandpile.org/x86/opc_1.htm. The first four entries are all "ADD", with the arguments in two different orders, and being of two different widths. So in this case, the bottom bits of the instruction are encoding the direction and width.