os.sched_getaffinity(0) vs os.cpu_count()

7.5k views Asked by At

So, I know the difference between the two methods in the title, but not the practical implications.

From what I understand: If you use more NUM_WORKERS than are cores actually available, you face big performance drops because your OS constantly switches back and forth trying to keep things in parallel. Don't know how true this is, but I read it here on SO somewhere from someone smarter than me.

And in the docs for os.cpu_count() it says:

Return the number of CPUs in the system. Returns None if undetermined. This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with len(os.sched_getaffinity(0))

So, I'm trying to work out what the "system" refers to if there can be more CPUs usable by a process than there are in the "system".

I just want to safely and efficiently implement multiprocessing.pool functionality. So here is my question summarized:

What are the practical implications of:

NUM_WORKERS = os.cpu_count() - 1
# vs.
NUM_WORKERS = len(os.sched_getaffinity(0)) - 1

The -1 is because I've found that my system is a lot less laggy if I try to work while data is being processed.

3

There are 3 answers

5
Booboo On BEST ANSWER

If you had a tasks that were pure 100% CPU bound, i.e. did nothing but calculations, then clearly nothing would/could be gained by having a process pool size greater than the number of CPUs available on your computer. But what if there was a mix of I/O thrown in whereby a process would relinquish the CPU waiting for an I/O to complete (or, for example, a URL to be returned from a website, which takes a relatively long time)? To me it's not clear that you couldn't achieve in this scenario improved throughput with a process pool size that exceeds os.cpu_count().

Update

Here is code to demonstrate the point. This code, which would probably be best served by using threading, is using processes. I have 8 cores on my desktop. The program simply retrieves 54 URL's concurrently (or in parallel in this case). The program is passed an argument, the size of the pool to use. Unfortunately, there is initial overhead just to create additional processes so the savings begin to fall off if you create too many processes. But if the task were long running and had a lot of I/O, then the overhead of creating the processes would be worth it in the end:

from concurrent.futures import ProcessPoolExecutor, as_completed
import requests
from timing import time_it

def get_url(url):
    resp = requests.get(url, headers={'user-agent': 'my-app/0.0.1'})
    return resp.text


@time_it
def main(poolsize):
    urls = [
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
    ]
    with ProcessPoolExecutor(poolsize) as executor:
        futures = {executor.submit(get_url, url): url for url in urls}
        for future in as_completed(futures):
            text = future.result()
            url = futures[future]
            print(url, text[0:80])
            print('-' * 100)

if __name__ == '__main__':
    import sys
    main(int(sys.argv[1]))

8 processes: (the number of cores I have):

func: main args: [(8,), {}] took: 2.316840410232544 sec.

16 processes:

func: main args: [(16,), {}] took: 1.7964842319488525 sec.

24 processes:

func: main args: [(24,), {}] took: 2.2560818195343018 sec.
0
Frank Yellin On

The implementation of multiprocessing.pool uses

if processes is None:
    processes = os.cpu_count() or 1

Not sure if that answers your question, but at least it's a datapoint.

0
Darkonaut On

These two functions are very different and NUM_WORKERS = os.sched_getaffinity(0) - 1 would just instantly fail with TypeError because you try to subtract an integer from a set. While os.cpu_count() tells you how many cores the system has, os.sched_getaffinity(pid) tells you on which cores a certain thread/process is allowed to run.


os.cpu_count()

os.cpu_count() shows the number of available cores as known to the OS (virtual cores). Most likely you have half this number of physical cores. If it makes sense to use more processes than you have physical cores, or even more than virtual cores, depends very much on what you are doing. The tighter the computational loop (little diversity in instructions, few cache misses, ...), the more likely you won't benefit from more used cores (by using more worker-processes) or even experience performance degradation.

Obviously it also depends on what else your system is running, because your system tries to give every thread (as the actual execution unit of a process) in the system a fair share of run-time on the available cores. So there is no generalization possible in terms of how many workers you should use. But if, for instance, you have a tight loop and your system is idling, a good starting point for optimizing is

os.cpu_count() // 2 # same as mp.cpu_count() // 2 

...and increasing from there.

How @Frank Yellin already mentioned, multiprocessing.Pool uses os.cpu_count() for the number of workers as a default.

os.sched_getaffinity(pid)

os.sched_getaffinity(pid)

Return the set of CPUs the process with PID pid (or the current process if zero) is restricted to.

Now core/cpu/processor/-affinity is about on which concrete (virtual) cores your thread (within your worker-process) is allowed to run. Your OS gives every core an id, from 0 to (number-of-cores - 1) and changing affinity allows restricting ("pinning") on which actual core(s) a certain thread is allowed to run at all.

At least on Linux I found this to mean that if none of the allowed cores is currently available, the thread of a child-process won't run, even if other, non-allowed cores would be idle. So "affinity" is a bit misleading here.

The goal when fiddling with affinity is to minimize cache invalidations from context-switches and core-migrations. Your OS here usually has the better insight and already tries to keep caches "hot" with its scheduling-policy, so unless you know what you're doing, you can't expect easy gains from interfering.

By default the affinity is set to all cores and for multiprocessing.Pool, it doesn't make too much sense bothering with changing that, at least if your system is idle otherwise.

Note that despite the fact the docs here speak of "process", setting affinity really is a per-thread thing. So for example, setting affinity in a "child"-thread for the "current process if zero", does not change the affinity of the main-thread or other threads within the process. But, child-threads inherit their affinity from the main-thread and child-processes (through their main-thread) inherit affinity from the parent's process main-thread. This affects all possible start-methods ("spawn", "fork", "forkserver"). The example below demonstrates this and how to modify affinity with using multiprocessing.Pool.

import multiprocessing as mp
import threading
import os


def _location():
    return f"{mp.current_process().name} {threading.current_thread().name}"


def thread_foo():
    print(f"{_location()}, affinity before change: {os.sched_getaffinity(0)}")
    os.sched_setaffinity(0, {4})
    print(f"{_location()}, affinity after change: {os.sched_getaffinity(0)}")


def foo(_, iterations=200e6):

    print(f"{_location()}, affinity before thread_foo:"
          f" {os.sched_getaffinity(0)}")

    for _ in range(int(iterations)):  # some dummy computation
        pass

    t = threading.Thread(target=thread_foo)
    t.start()
    t.join()

    print(f"{_location()}, affinity before exit is unchanged: "
          f"{os.sched_getaffinity(0)}")

    return _


if __name__ == '__main__':

    mp.set_start_method("spawn")  # alternatives on Unix: "fork", "forkserver"

    # for current process, exclude cores 0,1 from affinity-mask
    print(f"parent affinity before change: {os.sched_getaffinity(0)}")
    excluded_cores = {0, 1}
    os.sched_setaffinity(0, os.sched_getaffinity(0).difference(excluded_cores))
    print(f"parent affinity after change: {os.sched_getaffinity(0)}")

    with mp.Pool(2) as pool:
        pool.map(foo, range(5))

Output:

parent affinity before change: {0, 1, 2, 3, 4, 5, 6, 7}
parent affinity after change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-1, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-1, affinity after change: {4}
SpawnPoolWorker-1 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-1, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-1, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-2, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-2, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-2, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-2, affinity after change: {4}
SpawnPoolWorker-1 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-3, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-3, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}