How global pointer variables are stored in memory?

2.5k views Asked by At

Suppose we have a simple code :

int* q = new int(13);

int main() {
    return 0;
}

Clearly, variable q is global and initialized. From this answer, we expect q variable to be stored in initialized data segment (.data) within program file but it is a pointer, so it's value (which is an address in heap segment) is determined at run time. So what's the value stored in data segment within program file ?

My try:
In my thinking, compiler allocates some space for variable q (typically 8 bytes for 64 bit address) in data segment with no meaningful value. Then, puts some initialization code in text segment before main function code to initialize q variable at run time. Something like this in assembly :

     ....
     mov  edi, 4
     call operator new(unsigned long)
     mov  DWORD PTR [rax], 13  // rax: 64 bit address (pointer value)

     // offset : q variable offset in data segment, calculated by compiler
     mov  QWORD PTR [ds+offset], rax // store address in data segment
     ....
main:
     ....

Any idea?

2

There are 2 answers

4
Dietrich Epp On BEST ANSWER

Yes, that is essentially how it works.

Note that in ELF .data, .bss, and .text are actually sections, not segments. You can look at the assembly yourself by running your compiler:

c++ -S -O2 test.cpp

You will typically see a main function, and some kind of initialization code outside that function. The program entry point (part of your C++ runtime) will call the initialization code and then call main. The initialization code is also responsible for running things like constructors.

8
Peter Cordes On

int *q will go in the .bss, not the .data section, since it's only initialized at run-time by a non-constant initializer (so this is only legal in C++, not in C). There's no need to have 8 bytes in the executable's data segment for it.

The compiler arranges for the initializer function to be run by putting its address into an array of initializers that the CRT (C Run-Time) startup code calls before calling main.

On the Godbolt compiler explorer, you can see the init function's asm without all the noise of directives. Notice that the addressing mode is just a simple RIP-relative access to q. The linker fills in the right offset from RIP at this point, since that's a link-time constant even though the .text and .bss sections end up in separate segments.

Godbolt's compiler-noise filtering isn't ideal for us. Some of the directives are relevant, but many of them aren't. Below is a hand-chosen mix of gcc6.2 -O3 asm output with Godbolt's "filter directives" option unchecked, for just the int* q = new int(13); statement. (No need to compile a main at the same time, we're not linking an executable).

# gcc6.2 -O3 output
_GLOBAL__sub_I_q:      # presumably stands for subroutine
    sub     rsp, 8           # align the stack for calling another function
    mov     edi, 4           # 4 bytes
    call    operator new(unsigned long)   # this is the demangled name, like from objdump -dC
    mov     DWORD PTR [rax], 13
    mov     QWORD PTR q[rip], rax      # clang uses the equivalent `[rip + q]`
    add     rsp, 8
    ret

    .globl  q
    .bss
q:
    .zero   8      # reserve 8 bytes in the BSS

There's no reference to the base of the ELF data (or any other) segment.

Also definitely no segment-register overrides. ELF segments have nothing to do with x86 segments. (And the default segment register for this is DS anyway, so the compiler doesn't need to emit [ds:rip+q] or anything. Some disassemblers may be explicit and show DS even though there was no segment-override prefix on the instruction, though.)


This is how the compiler arranges for it to be called before main():

    # the "aw" sets options / flags for this section to tell the linker about it.
    .section        .init_array,"aw"
    .align 8
    .quad   _GLOBAL__sub_I_q       # this assembles to the absolute address of the function.

The CRT start code has a loop that knows the size of the .init_array section and uses a memory-indirect call instruction on each function-pointer in turn.

The .init_array section is marked writeable, so it goes into the data segment. I'm not sure what writes it. Maybe the CRT code marks it as already-done by zeroing the pointers after calling them?


There's a similar mechanism in Linux for running initializers in dynamic libraries, which is done by the ELF interpreter while doing dynamic linking. This is why you can call printf() or other glibc stdio functions from _start in a dynamically-linked binary created from hand-written asm, and why that fails in a statically linked binary if you don't call the right init functions. (See this Q&A for more about building static or dynamic binaries that define their own _start or just main(), with or without libc).