What is PC-relative addressing and how can I use it in MASM?

5.6k views Asked by At

I'm following Jack Crenshaw's compiler tutorial (If you look at my profile, that's what all my questions are about lol) and it just got to the point where variables are introduced. He comments that the 68k requires everything to be "position-independent" which means it's "PC-relative". I get that PC is the program counter, and on x86 it's EIP. But he uses syntax like MOVE X(PC),D0 where X is a variable name. I've read a little ahead and it says nothing later about declaring a variable in .data. How does this work? To make this work in x86, what would I replace X(PC) with in MOV EAX, X(PC)?

To be honest I'm not even sure this is supposed to output working code yet, but up to this point it has and I've added code to my compiler that adds the appropriate headers etc and a batch file to assemble, link and run the result.

2

There are 2 answers

1
harold On BEST ANSWER

Here's a short overview over what a statically allocated global variable (which is what this question is about) really is and what to do about them.

What is a variable anyway

To the machine, there is no such thing as a variable. It never hears about them, it never cares about them, it just has no concept of them. They're just a convention to assign a consistent meaning to a particular location in RAM (in the case of virtual memory, a position in your address space).

Where you actually put a variable, is sort of up to you - but within reason. If you're going to write to it (and you probably are), it had better be in a writable location, which means: the address of that variable should fall within a memory area that is allocated and writable. The .data section is just an other convention for that. You don't have to call it that, you don't even need a separate section (you could make your .text section writable and allocate your globals there, if you really wanted), you could even use OS functions like VirtualAllocEx (or equivalent) to allocate memory at a fixed position and use that (but don't do that). It's up to you. But the .data section is a convenient place to put them.

"Allocating" the variables is just a matter of choosing an address such that the variable doesn't overlap with any other variable. That's not hard, just lay them out sequentially: start a pointer var_ptr at the beginning of wherever you're going to put them (so the VA of your .data section, or 0 if you're using a linker), and then for every variable v:

  • the location l of v is align(var_ptr, round_up_to_power_of_2(sizeof(v)))
  • set var_ptr to l + sizeof(v)

As a minor variation, you could skip the alignment (most compiler textbooks do that, but in real life you should align). x86 usually lets you get away with that.

As a bigger variation, you could try to "fill the holes" left by the alignments. The simplest way to fill at least most holes is to just sort the variables biggest-first (that fills all holes if all sizes are powers of two). While that may save some space (though not necessarily any, because sections are aligned themselves), it never saves much. Under the usual alignment rules the "just lay them out sequentially"-algorithm will, at worst, waste nearly half the space it uses on holes. The pattern that leads to that is an alternating sequence of the smallest type and the biggest type. And let's be honest, that wouldn't really happen - and even if it did, that's not all that bad.

Then, you have to make sure that the .data segment is big enough to hold all variables, and that the initial contents match what the variables were initialized with.

But you don't even have to do any of this. You can use variable declarations in the assembly code (you know how to do this), and then the assembler/linker (they typically both play a roll in this) will do all of this for you (and, of course, it will also do the replacement of variable names by variable addresses).

How to use a variable

It depends. If you're using an assembler/linker, just refer to the label that you gave the variable. The label, of course, does not have to match the name in the source code, it can be any legal unique name (for example, you could use the AST node ID of the declaration with an underscore in front of it).

So loading a variable could look like this:

mov eax, dword ptr [variablelabel]

Or, on x64, perhaps this

mov eax, dword ptr [rel variablelabel]

Which would emit a rip-relative address. If you do that, you don't have to care about the current value of RIP or where the variable is allocated, the assembler/linker will take care of it. On x64, using a RIP-relative address like that is common, for several reasons:

  • it allows the .data segment to be somewhere that isn't the first 4GB (or 2GB) of address space, as long as it's close to the .text segment
  • it's shorter than an instruction with an absolute 64bit address
  • there are only two instructions that even take an absolute 64bit address, namely mov rax,[imm64] and mov [imm64],rax
  • you get relocations for free

If you're not using an assembler and/or linker, it becomes (at least to some extend) your own job to replace variable-names by whatever address you allocated for them (if you're using a linker but no assembler, you'd make relocation data but you wouldn't yourself decide on the absolute addresses of variables).

When you're using absolute addresses, you can "put them in" in parallel with emitting instructions (provided you've already allocated the variables). When you're using RIP-relative addresses, you can only put them in once you decide where the code will be (so you'd emit code where the offsets are 0, do some bookkeeping, decide where the code will be, then you go back and replace the 0's by the real offsets), which is a non-trivial problem in itself unless you use a naive way and don't care about branch-size-optimization (in that case you know the address of an instruction at the time you emit it, and therefore what the offset of a variable relative to RIP would be). A RIP-relative offset is easy enough to calculate, just subtract the RIP of the position immediately after the current instruction from the VA (virtual address) of the variable.

But that's not all

You may want to make some variables non-writable, to the point that any attempt to write to them in "funny ways that the compile can't detect" will fail. That can be accomplished by putting them in a read-only section, typically called .rdata (but the name is irrelevant really, what matters is whether the "writable" flag of the section is set in the PE header). This isn't done often, though it is sometimes used for string or array constants (which aren't properly variables).

What is done regularly, is putting zero-initialized variables in their own section, a section that takes no space in the executable file but is instead simply zeroed out. Putting zero-initialized variables may save some space in the executable. This section is commonly called .bss (not short for bullsh*t section), but as always, the name is irrelevant.

More

Most compiler textbooks deal with this subject to varying amounts, though usually not in much detail, because when you get right down to it: static variables aren't hard. Certainly not compared most other aspects of compilations. Also, some aspects are very platform specific, such as the details around the sections and how things actually end up in an executable.

Some sources/useful things (I've found all of these useful while working on compilers):

2
Martin Rosenau On

Many processors support PC-Relative or Absolute addressing.

On X86 machines however there is the following restriction:

  • Jumps and Calls are always PC-Relative (unless register-based)
  • Other adresses are always Absolute (unless register-based)

C compilers that can do PC-Relative addressing will implement this the following way:

  CALL x
x:
  ; Now address "x" is on the stack
  POP EDI
  ; Now EDI contains address of "x"
  ; Now we can do (pseudo-)PC-Relative addressing:
  MOV EAX,[EDI+1234]

This is used if the address of the code in the memory is not known during compile/linking time (e.g. for dynmaic libraries (DLLs) under Linux) so the address of a variable (here located at address "x+1234") is not known, yet.