Htop cpu bar red, 100% kernel time

1.7k views Asked by At

I found some similar topics but no helpful solution was found. Since I have some more information to provide, I opened this issue.

My PyTorch script frequently gets stuck on a training server. Htop shows that there is only one green CPU bar while other active cores are almost 100% red. According to the F1 explanation, red means kernel time. enter image description here

Whenever this 100% red CPU bar occurs, the training gets stuck and GPU-util drops down to 0%. Wired thing is this only happens on two of the servers I use. It never happens on my PC (less powerful) and never happens on another powerful server.

The strace command shows that when the problem occurs, there will be many

futex(0x55bbb0e82db0, FUTEX_WAKE_PRIVATE, 1) = 0

enter image description here

Any explanation on what the problem is and how to avoid this. Or any further information to provide?

1

There are 1 answers

0
wstcegg On BEST ANSWER

I solved the problem and found possible causes.

  1. The CPU usage is high means the CPU is working, so this means no disk IO limitation is happening.

  2. The GPU usage is low means that GPU is not correctly fed.

  3. This means RAM is the most likely bottleneck for my case.

As mentioned in the GitHub issue, multi-process accessing the same python object causes the object ref-count to increase. In fork mode, this triggers page allocation thus slowing down the system performance.

This system behavior can not be detected by python memory allocation libs such as Memray[https://github.com/bloomberg/memray] or so. But might be detected by other system-level memory tools such as Valgrind [https://valgrind.org/]

https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662

The final solution is to reduce accessing python objects from the forked process.