Is this inline-asm approach for stack switching ok?

408 views Asked by At

For some functions, I need to switch the stack so that the original stack remains unmodified. For that purpose, I have written two macros as shown below.

#define SAVE_STACK()    __asm__ __volatile__ ( "mov %%rsp, %0; mov %1, %%rsp" : \
"=m" (saved_sp) : "m" (temp_sp) );
#define RESTORE_STACK() __asm__ __volatile__ ( "mov %0, %%rsp" : \
"=m" (saved_sp) );

Here temp_sp and saved_sp are thread local variables. temp_sp points to the makeshift stack that we use. For a function, whose original stack I want unmodified, I place SAVE_STACK at the beginning and RESTORE_STACK at bottom. For example, like this.

int some_func(int param1, int param2)
{
 int a, b, r;
 SAVE_STACK();
 // Function Body here
 .....................
 RESTORE_STACK();
 return r;
}

Now my question is whether this approach is fine. On x86 (64bit), the local variables and parameters are accessed through the rbp register and rsp is accordingly subtracted in function prologue and not touched until in function epilogue where it is added to bring it back to the original value. Therefore, I see no problem here.

I am not sure, if this is correct in the presence of context switches and signals though. (On Linux). Also I'm not sure if this is correct if the function is inlined or if tail call optimization (where jmp instead of call is used) is applied. Do you see any problem or side effects with this approach?

2

There are 2 answers

2
FrankH. On BEST ANSWER

With the code that you've shown above, I can think of the following breakage:

  1. On x86/x64, GCC will "deco" your function with prologues/epilogues if it sees fit, and you can't stop it from doing that (like on ARM, where __attribute__((__naked__)) forces code creation without prologues/epilogues, aka without stackframe setup).
    That might end up allocating stack / creating references to stack memory locations before you switch the stack. Even worse if, again, due to the compiler's choice, such an address is put into a nonvolatile register before you switch the stack, it might alias to two locations (the stackpointer-relative one that you changed and the other-reg-relative one that is the same).

  2. Again, on x86/x64, the ABI suggests an optimization for leaf functions (the "red zone") where no stackframe is allocated yet 128 Bytes of stack "below" the end are usable by the function. Unless your memory buffer takes this into account, overruns might occur that you're not expecting.

  3. Signals are handled on alternate stacks (see sigaltstack()) and doing your own stack switching might make your code uncallable from within signal handlers. It'll definitely make it non-reentrant, and depending on where/how you retrieve the "stack location" will also definitely make it non-threadsafe.

In general, if you want to run a specific piece of code on a different stack, why not either:

  • run it in a different thread (every thread gets a different stack) ?
  • trigger e.g. SIGUSR1 and run your code in a signal handler (which you can configure to use a different stack) ?
  • run it via makecontext() / swapcontext() (see the example in the manpage) ?

Edit:

Since you say "you want to compare the memory of two processes", again, there's different methods for that, in particular external process tracing - attach a "debugger" (that can be a process you write yourself that uses ptrace() to control what you want to monitor, and have it handle e.g. breakpoints / checkpoints on behalf of those you trace, to perform the validations you need). That'd be more flexible as well because it doesn't require to change the code you inspect.

0
Peter Cordes On

-fomit-frame-pointer is on by default. Unless you plan to compile with optimization disabled, the assumption that functions don't touch RSP except in prologue/epilogue is super broken.

Even if you did use -O3 -fno-omit-frame-pointer, compilers will still move RSP around in some cases, although they won't use it to access args and locals. e.g. alloc / C99 VLA, or even calling a function that has more than 6 args (or more precisely, one with args that don't fit in registers), will all move RSP. (Calling a function might just use mov stores, depending on strategy chosen by the compiler.)

Also, "shrink wrap" optimization where a function delays saving call-preserved regs until after a possible early-out could mean your stack-switch happens before the compiler is ready to save/restore. And your restore might happen too late or too early. (This was mentioned in comments by ams.)