Is it guaranteed to be able to read all the syscall parameters at sys_exit
tracepoint?
sysdig driver is a kernel module to capture syscall using kernel static tracepoint. In this project some of system call parameters are read at sys_enter
tracepoint, and some other parameters are read at sys_exit
(return value of course, and contents in userspace to avoid pagefault).
Why not read all parameters at sys_exit
? Is this because some parameters may be not be available at sys_exit
?
Yes... and no, we need to distinguish parameters from registers. Linux syscalls should preserve all general purpose userspace registers, except the register used for the return value (and on some architectures also a second register to indicate if an error occurred). However, this does not mean that the input parameters of the syscall cannot change between entry and exit: if a register holds the value of a pointer to some data, while the register itself does not change, the data it points to could very well change.
Looking at the code for the static tracepoint
sys_exit
, you can see that only the syscall number (id
) and its return value (ret
) are traced. See note at the bottom of my answer for more.Yes, I would say that ensuring the correctness of the traced parameters is the main reason why tracing only at the exit would be a bad idea. Even if you get the values of the register, you cannot know the real parameters at syscall exit. Even if a syscall per se is guaranteed to save and restore the state of user registers, the syscall itself can alter the data that is being passed as argument. For example, the
recvmsg
syscall takes a pointer to astruct msghdr
in memory which is used both as an input and an output parameter; thepoll
syscall does the same with a pointer tostruct pollfd
. Furthermore, another thread or program could have very well modified the memory of the program while it was making a syscall, therefore altering the data.Under specific circumstances a syscall can also take a very long time before returning (think for example of a
sleep
, or a blockingread
on your terminal, anaccept
on a listening socket, etc). If you only trace at the exit, you will have very incorrect timing information, and most importantly you will have to wait a lot before any meaningful information can be captured, even though that information is already available at the entry point.Note on
sys_exit
tracepointAlthough you could thecnically extract the values of the saved registers of the current task, I am not entirely sure about the semantics of doing so while in the
sys_exit
tracepoint. I searched for some documentation on this specific case, but had no luck, and kernel code is well... complex.The chain of calls to reach the exit hook should be:
entry_INT80_32
for x86int 0x80
)do_int80_syscall_32()
for x86int 0x80
)syscall_exit_to_user_code()
syscall_exit_to_user_mode_prepare()
syscall_exit_work()
trace_sys_exit()
If a deadly signal is delivered to a process during a syscall, while the actual process will never reach the exit of the syscall (i.e. no value is ever returned to user space), the tracepoint will still be hit. When a signal delivery of this kind happens, a special internal return value is used, like
-ERESTARTSYS
(see here). This value is not an actual syscall return value (it is not returned to user space), but rather it is only meant to be used by kernel. So it looks like thesys_exit
tracepoint is being hit with the special-ERESTARTSYS
if a deadly signal is received by the process. This does not happen for example in the case ofSIGSTOP
+SIGCONT
. Take this with a grain of salt though, since I was not able to find proper documentation for this.