How should I get gcc to realign the stack pointer to a 16-byte boundary on the way in to a function?

696 views Asked by At

I'm trying to get an existing JIT working on Windows x86_64 using mingw64.

I'm getting segfaults when the JIT calls back into precompiled code, and that code calls Windows APIs, because aligned move instructions such as movaps within the Windows API implementations are being called with %rsp not a multiple of 16, i.e. the stack isn't aligned to a 16-byte boundary.

Thread 1 hit Catchpoint 2 (signal SIGSEGV), 0x00007fff5865142d in KERNELBASE!FindFirstFileA () from C:\WINDOWS\System32\KernelBase.dll
1: x/i $pc
=> 0x7fff5865142d <KERNELBASE!FindFirstFileA+125>:      movaps 0x60(%rsp),%xmm0
2: /x $rsp = 0xd8edd8

In what I was expecting to be a quick workaround, I thought I would get gcc to force a realignment of the stack on the way into the precompiled functions that are called by the JIT code and ultimately call Windows API functions.

The gcc docs for the force_align_arg_pointer attribute:

On x86 targets, the force_align_arg_pointer attribute may be applied to individual function definitions, generating an alternate prologue and epilogue that realigns the run-time stack if necessary. This supports mixing legacy codes that run with a 4-byte aligned stack with modern codes that keep a 16-byte stack for SSE compatibility.

However adding __attribute__((force_align_arg_pointer)) to the function specifiers had no effect on the output assembly.

I also tried -mpreferred-stack-boundary=4, which explicitly requests 2**4 == 16 alignment for all functions:

-mpreferred-stack-boundary=num Attempt to keep the stack boundary aligned to a 2 raised to num byte boundary.

This also had no effect.

In fact, the first thing I found that did affect the output assembly was -mpreferred-stack-boundary=3 (which should keep the stack aligned to an 8-byte boundary).

That resulted in this difference:

@@ -46,8 +59,15 @@
        .def    foo;    .scl    2;      .type   32;     .endef
        .seh_proc       foo
 foo:
+       pushq   %rbp
+       .seh_pushreg    %rbp
+       movq    %rsp, %rbp
+       .seh_setframe   %rbp, 0
+       andq    $-16, %rsp
        .seh_endprologue
        leaq    .LC0(%rip), %rcx
+       movq    %rbp, %rsp
+       popq    %rbp
        jmp     printf
        .seh_endproc
        .def    __main; .scl    2;      .type   32;     .endef

Strangely this is actually putting in andq $-16, %rsp (aligning the stack pointer to a multiple of 16) despite the fact we said to prefer 8 byte alignment.

What am I misunderstanding about these options or the cases they work in?

The version of gcc is MSYS2 mingw64's 10.2.0:

$ gcc --version
gcc.exe (Rev4, Built by MSYS2 project) 10.2.0
1

There are 1 answers

1
amonakov On

The correct workaround would be -mincoming-stack-boundary=3: you should be telling the compiler that the function it compiles may be called with under-aligned stack (hence "incoming" rather than "preferred": you don't need to raise the preferred alignment above the default).

As to why the attribute doesn't work, it seems you've found a compiler backend bug specific to 64-bit Microsoft ABI. The attribute works as you would expect when targeting Linux, but there's some special-casing for Microsoft (and Apple) ABIs in the backend, and it's possible the code does not align with the intended behavior:

6089   /* 64-bit MS ABI seem to require stack alignment to be always 16,
6090      except for function prologues, leaf functions and when the defult
6091      incoming stack boundary is overriden at command line or via
6092      force_align_arg_pointer attribute.
6093 
6094      Darwin's ABI specifies 128b alignment for both 32 and  64 bit variants
6095      at call sites, including profile function calls.
6096  */
6097   if (((TARGET_64BIT_MS_ABI || TARGET_MACHO)
6098         && crtl->preferred_stack_boundary < 128)
6099       && (!crtl->is_leaf || cfun->calls_alloca != 0
6100           || ix86_current_function_calls_tls_descriptor
6101           || (TARGET_MACHO && crtl->profile)
6102           || ix86_incoming_stack_boundary < 128))
6103     {
6104       crtl->preferred_stack_boundary = 128;
6105       crtl->stack_alignment_needed = 128;
6106     }
6107

(note how the comment refers to the attribute, but the code evidently does not work that way)