Transparently replace file mapping with anonymous

92 views Asked by At

I am doing a checkpoint-and restore using CRIU; in turn after restore, my application wakes with some threads that have their stack mmaped into files on disk (CRIU doesn't do it by default, this is a custom optimization). Later on, I want to transparently replace this mapping with anonymous memory - allocating new one, copying it over and finally calling mremap to the original address.

However, there's a glitch in this approach - if the threads start mutating the stack while I copy it over I could break the application. Ideally, I would trap it using userfaultfd but it's not possible to register on a file-mapped memory region. Even if I introduced some mutex to those threads there's no way to tell that the thread is really parked and won't mutate its stack until I wake it up.

I am thinking of mprotect to read-only and handling SIGSEGV. Or is there a better approach? PTrace self?

2

There are 2 answers

1
John Bollinger On BEST ANSWER

The only alternative I have come up with that I would trust is for the main thread to use ptrace to force the others to stop, and then to resume them when that is safe. You seem to already be aware of this option, so I will not go into details. The main objective here is to preemptively suspend the activity of the affected threads while their stacks are being copied, which seems far less risky than approaches that do otherwise.

The alternative presented in the question is to use mprotect to trap the threads' attempts to modify data on their stacks while the copy is being made. I guess the idea is to have a lighter touch, allowing threads to proceed as long as they can do so without modifying their stacks, but I don't think that's plausible or viable. Among other things:

  • it seems unlikely in general that any thread will be able to do much meaningful work without modifying its stack, so it seems doubtful that there is much gain available in practice.

  • as I observed in comments, both C and POSIX specify that a program has undefined behavior if a signal handler for SIGSEGV returns normally. Usually, program termination is the only viable alternative, but a sufficiently prepared program might in some cases longjmp() or siglongjmp() out of the handler instead. That could give you a vector for recovery, but only to whatever extent you are prepared to mediate it with special tooling, and only to the extent supported by such tooling.

    It is not safe to assume that the trap handler installed by the kernel will have the effect of retrying the failed instruction of your userspace program in the event that a handler for a segfault returns normally. That ranks very high among the implications of the userspace behavior being undefined. If you did observe that effect with a particular combination of hardware and software then that would be no basis for relying on the same thing for different combinations.

1
thejh On

That premise seems a bit weird to me, I don't really get why you'd have the stacks file-mapped after such a CRIU operation... but anyway:

First off: There is one type of file mapping that userfaultfd does work with, which is shmem/tmpfs. But I don't know whether that helps in your case. If not:

You can't register the file mapping with userfaultfd, but you can register the new anonymous mapping with userfaultfd. This means that one thing you could do would be to first replace the stack with the new mapping, then copy the data over from the file when you know the old mapping is no longer used.

You probably don't want to do exactly this, because then you'd have to block for as long as it takes to copy the entire stack. There are two optimizations you could consider:

  1. You could try to stop the thread and figure out the thread's current stack pointer; any memory that is sufficiently far below the stack pointer based on the ABI (e.g. 128 bytes on amd64) doesn't need to be copied at all, you only have to register the currently used part of the stack with userfaultfd. (Probably a good way to do this would be to send a signal to the thread and let the signal handler take care of this.) If your threads typically have relatively little stack usage and only use lots of stack memory for short moments, this is probably all you need?
  2. You could copy the file contents into anonymous memory area A ahead of time while letting the kernel monitor which of the file mapping pages have been written to. Then after you replace the file mapping with a new anonymous mapping B with userfaultfd, you can ask the kernel which parts of the file mapping have been written to, copy all those parts into mapping A again, and then mremap() mapping A over the file mapping. This probably only makes sense if your stacks are typically pretty big. To figure out which parts of a file mapping have changed, you can use the kernel's Soft-Dirty interface, using bit 55 in /proc/[pid]/pagemap and /proc/[pid]/clear_refs.