Can a Linux process/thread terminate without pass through do_exit()?

1.2k views Asked by At

To verify the behavior of a third party binary distributed software I'd like to use, I'm implementing a kernel module whose objective is to keep track of each child this software produces and terminates.

The target binary is a Golang produced one, and it is heavily multi thread. The kernel module I wrote installs hooks on the kernel functions _do_fork() and do_exit() to keep track of each process/thread this binary produces and terminates.

The LKM works, more or less.

During some conditions, however, I have a scenario I'm not able to explain. It seems like a process/thread could terminate without passing through do_exit().

The evidence I collected by putting printk() shows the process creation but does not indicate the process termination.

I'm aware that printk() can be slow, and I'm also aware that messages can be lost in such situations.

Trying to prevent message loss due to slow console (for this particular application, serial tty 115200 is used), I tried to implement a quicker console, and messages have been collected using netconsole.

The described setup seems to confirm a process can terminate without pass through the do_exit() function.

But because I wasn't sure my messages couldn't be lost on the printk() infrastructure, I decided to repeat the same test but replacing printk() with ftrace_printk(), which should be a leaner alternative to printk().

Still the same result, occasionally I see processes not passing through the do_exit(), and verifying if the PID is currently running, I have to face the fact that it is not running.

Also note that I put my hook in the do_exit() kernel function as the first instruction to ensure the function flow does not terminate inside a called function.

My question is then the following:

Can a Linux process terminate without its flow pass through the do_exit() function?

If so, can someone give me a hint of what this scenario can be?

2

There are 2 answers

0
Alessandro On BEST ANSWER

After a long debug session, I'm finally able to answer my own question.

That's not all; I'm also able to explain why I saw the strange behavior I described in my scenario.

Let's start from the beginning: monitoring a heavily multithreading application. I observed rare cases where a PID that suddenly stops exists without observing its flow to pass through the Linux Kernel do_exit() function.

Because this my original question:

Can a Linux process terminate without pass through the do_exit() function?

As for my current knowledge, which I would by now consider reasonably extensive, a Linux process can not end its execution without pass through the do_exit() function.

But this answer is in contrast with my observations, and the problem leading me to this question is still there.

Someone here suggested that the strange behavior I watched was because my observations were somehow wrong, alluding my method was inaccurate, as for my conclusions.

My observations were correct, and the process I watched didn't pass through the do_exit() but terminated.

To explain this phenomenon, I want to put on the table another question that I think internet searchers may find somehow useful:

Can two processes share the same PID?

If you'd asked me this a month ago, I'd surely answered this question with: "definitively no, two processes can not share the same PID." Linux is more complex, though.

There's a situation in which, in a Linux system, two different processes can share the same PID!

https://elixir.bootlin.com/linux/v4.19.20/source/fs/exec.c#L1141

Surprisingly, this behavior does not harm anyone; when this happens, one of these two processes is a zombie.

updated to correct an error

The circumstances of this duplicate PID are more intricate than those described previously. The process must flush the previous exec context if a threaded process forks before invoking an execve (the fork copies also the threads). If the intention is to use the execve() function to execute a new text, the kernel must first call the flush_old_exec()  function, which then calls the de_thread() function for each thread in the process other than the task leader. Except the task leader, all the process' threads are eliminated as a result. Each thread's PID is changed to that of the leader, and if it is not immediately terminated, for example because it needs to wait an operation completion, it keeps using that PID.

end of the update

That was what I was watching; the PID I was monitoring did not pass through the do_exit() because when the corresponding thread terminated, it had no more the PID it had when it started, but it had its leader's.

For people who know the Linux Kernel's mechanics very well, this is nothing to be surprised for; this behavior is intended and hasn't changed since 2.6.17. Current 5.10.3, is still this way.

Hoping this to be useful to internet searchers; I'd also like to add that this also answers the followings:

  • Question: Can a Linux process/thread terminate without pass through do_exit()? Answer: NO, do_exit() is the only way a process has to end its execution — both intentional than unintentional.
  • Question: Can two processes share the same PID? Answer: Normally don't. There's some rare case in which two schedulable entities have the same PID.
  • Question: Do Linux kernel have scenarios where a process change its PID? Answer: yes, there's at least one scenario where a Process changes its PID.
6
Basile Starynkevitch On

Can a Linux process terminate without its flow pass through the do_exit() function?

Probably not, but you should study the source code of the Linux kernel to be sure. Ask on KernelNewbies. Kernel threads and udev or systemd related things (or perhaps modprobe or the older hotplug) are probable exceptions. When your /sbin/init of pid 1 terminates (that should not happen) strange things would happen.

The LKM works, more or less.

What does that means? How could a kernel module half-work?


And in real life, it does happen sometimes that your Linux kernel is panicking or crashes (and it could happen with your LKM, if it has not been peer-reviewed by the Linux kernel community). In such a case, there is no more any notion of processes, since they are an abstraction provided by a living Linux kernel.

See also dmesg(1), strace(1), proc(5), syscalls(2), ptrace(2), clone(2), fork(2), execve(2), waitpid(2), elf(5), credentials(7), pthreads(7)

Look also inside the source code of your libc, e.g. GNU libc or musl-libc

Of course, see Linux From Scratch and Advanced Linux Programming

And verifying if the PID is currently running,

This can be done is user land with /proc/, or using kill(2) with a 0 signal (and maybe also pidfd_send_signal(2)...)

PS. I still don't understand why you need to write a kernel module or change the kernel code. My intuition would be to avoid doing that when possible.