(This question isn't intended to be specific to VirtualBox or x86 per se, but since they're the best examples I'm aware of, I'll be referencing them and asking how VBox handles some scenarios. If you're aware of other solutions that aren't used by VBox, do consider mentioning them as well.)
I've read how VirtualBox does software virtualization, and I don't understand the following.
Before executing ring 0 code, CSAM [Code Scanning and Analysis Manager] scans it recursively to discover problematic instructions. PATM [Patch Manager] then performs in-situ patching, i.e. it replaces the instruction with a jump to hypervisor memory where an integrated code generator has placed a more suitable implementation. In reality, this is a very complex task as there are lots of odd situations to be discovered and handled correctly. So, with its current complexity, one could argue that PATM is an advanced in-situ recompiler.
Consider the following sample instruction sequence in ring-0 code:
call foo
foo:
mov EAX, 1234
mov EDX, [ESP]
cmp EDX, EAX
jne bar
call do_something_special_if_return_address_was_1234
bar:
...
The callee here is testing to see if the return address to the caller is 1234
, and if it is, it does something special. Clearly patching will change the return address, so we need to be able to handle it.
VirtualBox's documentation says it discovers "problematic" instructions and patches them in-place, but I don't really understand how this can work, for 2 reasons:
It appears that any instruction that exposes the instruction pointer is "problematic", of which
call
is probably the most common (and extremely so). Does this mean VirtualBox must analyze, and potentially patch, everycall
instruction it sees in ring 0? Doesn't this make performance drop off a cliff? How do they handle this with high performance? (The cases they mention in their documentation are quite obscure, so I'm confused why they didn't mention such a common instruction if it occurs. And if it isn't a problem, I don't understand why.)If the instruction stream happens to be modified (e.g. dynamically loading/unloading kernel modules), VirtualBox must dynamically detect this and garbage-collect unreachable recompiled instructions. Otherwise, it will have a memory leak. But this means every
mov
instruction (andpush
instruction, and everything else that writes to memory) will now have to be analyzed and potentially patched, possibly repeatedly, because it may be modifying code that was patched. This would seem to essentially degenerate all guest ring-0 code into near-full software emulation (since the target of a move is not known during recompilation), which would make the virtualization cost skyrocket, and yet this is not the impression I get from reading the documentation. Is this not a problem? How is this handled efficiently?
Please note that I am not asking about hardware-assisted virtualization like Intel VT or AMD-V and I am not interested in reading about those. I'm well aware that they avoid these problems altogether, but my question is about pure software virtualization.
At least for
QEMU
, it seems like the answer is that even in the translated code, there is a separate emulated "stack" that is set up with the same values as the code would have when running natively, and this "stack" is the one read by the emulated code, which sees the same values as if it were running natively.This means that emulated code can't be translated to use
call
,ret
, or any other stack-using instructions directly, since those would not use the emulated stack. Therefore, those calls are replaced by jumps to various bits of thunk code that does the right thing in terms of calling the equivalent translated code.Details for QEMU
The OP's (reasonable) assumption seems to be that
call
andret
instructions would appear in the translated binary, and the stack would reflect the addresses of the dynamically translated code. What actually happens (in QEMU) is that thecall
andret
instructions are removed and replaced with control flow that doesn't use the stack, and values on the stack are set to have the same values as they would in native code.That is, the OP's mental model is that the result of the code translation is somewhat like the native code, with some patches and modifications. At least in the case of QEMU, that's not really the case—every basic block is heavily translated via the Tiny Code Generator (TCG), first to an intermediate representation and then to the target architecture (even if the source and destination archs are the same, as in my case). This deck has a good overview of a lot of the technical details, including an overview of TCG as shown below.
The resulting code is usually nothing like the input code, and typically increases in size by about 3 times. Registers are often used fairly sparingly and you often see back-to-back redundant sequences. Particularly relevant to this question, is that essentially all control flow instructions are quite differently, so
ret
andcall
instructions in the native code pretty much never translate into plaincall
orret
in the translated code.An example: First, some C code with a
return_address()
call that simply returns the return address, and amain()
that prints this function:The
noinline
here is important since otherwisegcc
just may just inline the function and hardcode the address directly into the assembly without making acall
or accessing the stack at all!With
gcc -g -O1 -march=native
this compiles to:Note that
return_address()
returns[rsp]
just like the OP's example. Themain()
function sticks it inrdx
, whereprintf
will read it from.We expect the caller's return address as seen by
return_address
to be the instruction following the call,0x400559
:... and indeed that's what we see when we run it natively:
Lets try it in QEMU:
It works! Note that QEMU is by default translating all code and putting it far away from the usual native location (as we will see shortly), so we don't need any special instructions to trigger translation.
How does that work under the covers? We can use the
-d in_asm,out_asm
option with QEMU to see what it is making of this code.First, the code leading up to the call (the
IN
portion is the native code, and theOUT
is what QEMU translates it to—sorry for AT&T syntax, I can't figure out how to change that in QEMU):A key part is here:
You can see that it is actually manually putting the return address from the native code onto the "stack" (which seems generally to be accessed with
rbp
). Following that, note that there is nocall
instruction toreturn_address
. Rather, we have:In most of the code,
r14
seems to be a pointer to some internal QEMU data structure (i.e., not used to hold values from the emulated program). The above shoves0x400546
(which is the address of thereturn_address
function in native code) into a field of the structure pointed to byr14
, sticks0x7f177ad8a690
inrax
, and jumps to0x557c9cef8196
. This last address appears all over the place in the generated code (but its definition does not) and seems to be some kind of internal dispatch or thunk method. Presumably, it uses either the native address or, more likely, the mystery value shoved inrax
to dispatch to the translatedreturn_address
method, which looks like this:The first bit of code seems to set up the user "stack" in
ebp
(getting it fromr14 + 0x20
, which is probably the emulated machine state structure) and ultimately reads from the "stack" (the linemov 0x0(%rbp),%rbx
) and stores that away into the area pointed to byr14
(mov %rbx,0x80(%r14)
).Finally, it gets to
jmpq 0x55c131ebe196
, which transfers to the QEMU epilogue routine:Note that I use the word "stack" in quotes above. This is because this "stack" is the emulation of the stack as seen by the emulated program, not the true stack pointed to by
rsp
. The true stack pointed to byrsp
is controlled by QEMU to implement the emulated control flow, and the emulated code doesn't access it directly.Some Things Can Change
We see above that "stack" contents as seen by the emulated process are the same under QEMU, but details of the stack do change. For example, the address of the stack looks different under emulation than natively (i.e., the value of
rsp
rather than the stuff pointed to by[rsp]
).This function:
normally returns addresses like
0x7fffad33c100
but returns addresses like0x40007ffd00
under QEMU. This should not be a problem, though, because no valid program should depend on the exact absolute value of the stack address. Not only is it generally not defined and unpredictable, but on recent operating systems it really is designed to be totally unpredictable due to stack ASLR (Linux and Windows both implement this). The program above returns a different address each time I run it natively (but the same address under QEMU).Self Modifying Code
You also mentioned a concern about when the instruction stream is modified, and gave an example of loading a kernel module. First, at least for QEMU, code is only translated "on demand". Functions that could be called, but aren't in some particular run, are never translated (you can try it with a function that is called conditionally depending on
argc
). So in general, loading new code into the kernel, or into a process in user-mode emulation, is handled by the same mechanism: the code will simply be translated the first time it is called.If the code is actually self modifying—i.e., the process writes to its own code—then something has to be done, since without help QEMU would keep using the old translation. So, to detect self-modifying code without penalizing every write to memory, the native code lives in pages with R+X permissions only. The consequence is that a write raises a GP fault which QEMU handles by noting that the code has modified itself, invalidating the translation, and so on. Plenty of details can be found in this thread and elsewhere.
It's a reasonable mechanism, and I expect other code-translation VMs do something similar.
Note that in the case of self-modifying code, the "garbage collection" problem is simple: the emulator is informed about the SMC event as described above, and since it has to re-translate at this point, it throws away the old translation.