The application I am working on recently migrated from embedding python 2.7 to python 3.8
We noticed a significant slowdown when calling Py_EndInterpreter in Python 3.8, when using many sub-interpreters.
Looking at the CPU usage I can see that all the time is spent doing garbage collection. Py_EndInterpreter -> PyImport_Cleanup -> _PyGC_CollectNoFail -> collect.
99% of the CPU time is spent in the collect method of _PyGC_CollectNoFail
Calling Py_EndInterprter when there are 500 sub-interpreters, each call to Py_EndInterpreter takes 2seconds!! For a total of ~3minutes to end the 500 sub-interpreters.
Comparatively in python 2.7 each call to Py_EndInterpreter takes 1 or 2ms independently of how many subinterpreters are alive, for a total of ~500ms to close all sub-interpreters.
When using few sub-interpreters (less than 20), the performance is almost identical between python 2.7 and 3.8.
I tried looking at other applications using many sub-interpreters but it seems like a very rare use case and could not find anyone else having the same issue.
Is there anyone else using many sub-interpreters having similar troubles?
It seems that currently my options are
- Take the performance hit...
- Leak a bunch of memory and not call Py_EndInterpreter
- Fundamentally change how my application embeds python and not use sub-interpreters
- ??