Working with Intel's VMX and ARM's virt-extensions, I have noticed the lack of a functionality that would be very useful when implementing hypervisors.
Within the workings of a hypervisor, it is often necessary to trap a guest behavior, but only for tracing purposes (that is, the instruction can be executed normally by the guest, but we need to do something first - for instance logging).
To be more precise, take the following example: on an Intel hypervisor I implemented some time ago (with Windows 7 as guest), I needed to log whenever a windows kernel structure was being modified. To accomplish this, I found out the physical address of the kernel structure and I removed the write permission of the guest for the corresponding EPT page. Thus, whenever the guest tried to write (modify) the structure, an EPT violation would occur, resulting in a hypervisor trap.
On each such EPT violation I would then proceed with one of the following strategies:
Strategy 1:
- Activate the monitor-trap-flag
- Temporarily grant the guest permissions to write the region (EPT modification)
- VMRESUME => the guest will execute the instruction and VMEXIT immediately after, due to MTF being activated
- on this next VMEXIT, I would deactivate MTF and re-forbid the guest to write the structure (=EPT modification + invalidation) and VMRESUME once again
Strategy 2:
- Emulate the instruction that wants to write the structure. This implied writing an emulator (>disassembler).
As you can see, both these strategies are a bit complicated even without multi-processing awareness. Concerning strategy 1, if Windows was to be virtualized on multiple processors, we would also have to send IPIs to the other cores to pause them while we handle the EPT violation. Plus, this is a specific example, which implies a specific strategy. Another example of tracing might be for instance logging and/or modifying the parameters of a kernel function whenever it is called. In this case we might need a different strategy.
I guess it's time to get to the point. My dilemma is the following. A simple way to avoid complicated programming strategies whenever we need to trace a guest behavior would have been for virtualization technologies to offer the possibility to dynamically choose if instruction traps occur before OR AFTER their execution.
Even before writing my first hypervisor (on Intel), I was almost sure VMX would offer me such a functionality. My thinking told me this would be an obvious feature offered by any virtualization technology on any platform, so I was surprised (and a bit frustrated) when I found out that it's actually not: not on Intel, not on ARM (as I recently found out) and most probably not on other platforms. Thus, my question is actually: WHY? Why don't hardware virtualization "designers" implement such functionality? I'm sure it has been thought of previously, thus the only possible answer for me seems to be that hardware-wise such a functionality implementation would be very difficult or not even possible, although I fail to see why this would be true. Is that the case?
Thanks in advance for your answers :)
EDIT
Although I haven't made it clear, I would also like to point out the fact that there are dozens of cases where the programmer wants to trap some guest behavior WITH THE INTENTION OF CHANGING IT'S EFFECT (thus implying more than tracing), but in which this kind of functionality would still be very useful.
Take for instance the following example. Let us assume that I want my hypervisor to control the communication of the guest with a memory-mapped device (or even entirely emulate one - a very common requirement for nowadays hypervisors). Most of the time what we'll do is that we will tell the guest that the device is memory-mapped at address A and hook writes/reads to/from that address. When a trapped instruction tries to r/w the region at address A, currently we are enforced to disassemble and emulate it. If the hypervisor offered us the possibility to let an instruction execute with temporary r/w permission and trap immediately after it, emulating the instruction would become unnecessary, since we could let it execute and "add-in" the desired effect afterwards.
You only get to trap before the instruction because that's what the hardware provides. In theory, the VM could react to your trap in the hypervisor and actually do something about it (change any memory-bound arguments behind your back) so the alternative isn't very useful.
Sorry man, that's just the way it is.
Years later, it occurred to me that if you are writing the hypervisor you can single-step the instruction rather than emulate it.