RSS memory usage from concurrent.futures

365 views Asked by At

I have a simple script that attempts to stress the concurrent.futures library as follows:

#! /usr/bin/python

import psutil
import gc
import os
from concurrent.futures import ThreadPoolExecutor

WORKERS=2**10

def run():
        def x(y):
                pass

        with ThreadPoolExecutor(max_workers=WORKERS) as pool:
                for _ in pool.map(x, [i for i in range(WORKERS)]):
                        pass

if __name__ == '__main__':
        print('%d objects' % len(gc.get_objects()))
        print('RSS: %s kB' % (psutil.Process(os.getpid()).get_memory_info().rss / 2**10))
        run()
        print('%d objects' % len(gc.get_objects()))
        print('RSS: %s kB' % (psutil.Process(os.getpid()).get_memory_info().rss / 2**10))

Which ends up producing the following output on a 2 core linux machine running python 2.7:

# time ./test.py
7048 objects
RSS: 11968 kB
6749 objects
RSS: 23256 kB

real    0m1.077s
user    0m0.875s
sys     0m0.316s

Although this is a bit of a contrived example, I'm struggling to understand why the RSS increases in this situation and what the allocated memory is being used for.

Linux should handle the forked memory fairly well with COW, but since CPython is reference-counted, portions of the inherited memory would not be truly read-only because the reference needs to be updated. Considering how minimal the reference count overhead is, the 12MB increase is surprising to me. If instead of using the ThreadPoolExecutor I just spawn daemon threads using the threading library, the RSS will only increase by 4MB.

It is definitely unclear to me whether to suspect the CPython allocator or the glibc allocator at this point, but my understanding is that the latter should presumably handle this flavor of concurrency and be able to reuse arenas for allocations across the spawned threads.

I'm using the backported version of concurrent.futures 3.0.3 under python 2.7.9 with glibc 2.4 on a 4.1 kernel. Any advice or hints on how to investigate this further would be greatly appreciated.

2

There are 2 answers

2
o11c On BEST ANSWER

Most memory allocators don't return all their memory to the OS.

Try calling run() twice and checking the RSS before/after the second time.

(That said, ludicrous numbers of threads are generally not a good idea)

1
MrGoodKat On

I suggest you to read this reply from https://stackoverflow.com/a/1718522/5632150

As he said, the number of threads you can spawn depends on the fact that your threads do or do not any I/O operation. If so there are some ways to optimize this problem. If not I usually do MAX_THREADS = N_CORES + 1.

not sure but, are you trying to spawn 1024 thread on one core?