I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.
I read that OMP_NUM_THREADS
env var sets the number of threads launched for a numactl
process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py
with OMP_NUM_THREADS=4
on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads.
But since machine has HT, it's not clear to me if 4-7
in the above are 4 physical or 4 logical.
How to find which of the numa-node-0 cores in
0-23,96-119
are physical and which ones logical? Are96-119
all logical or are they interspersed?If
4-7
are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?Where is OpenMP library getting invoked in binding threads to physical cores?
(from my limited understanding I could just launch command python main.py
in a sh
shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
Stepping: 7
Frequency boost: enabled
CPU MHz: 1000.026
CPU max MHz: 2301,0000
CPU min MHz: 1000,0000
BogoMIPS: 4600.00
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 96 MiB
L3 cache: 143 MiB
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191
numactl
do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set bynumactl
(although AFAIK this behaviour is undefined by the standard). You should use the environment variableOMP_NUM_THREADS
to set the number of threads. You can check the OpenMP configuration using the environment variableOMP_DISPLAY_ENV
.This is a bit complex. Physical IDs are the ones available in
/proc/cpuinfo
. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.You can use the great tool
hwloc
to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.--physcpubind
ofnumctl
accepts physical cpu numbers as shown in the "processor" fields of/proc/cpuinfo
regarding the documentation. Thus,4-7
here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read
/proc/cpuinfo
manually while some other usehwloc
. If you want deterministic bindings, then you should use theOMP_PLACES
andOMP_PROC_BIND
environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.If you want to be safe and portable, use the following configuration (using Bash):
The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.