Confused about OMP_NUM_THREADS and numactl NUMA-cores bindings

1.2k views Asked by At

I'm confused about how multiple launches of same python command bind to cores on a NUMA Xeon machine.

I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process. So if I ran numactl --physcpubind=4-7 --membind=0 python -u test.py with OMP_NUM_THREADS=4 on a hyperthreaded HT machine (lscpu output below) it'd limit the this numactl process to 4 threads. But since machine has HT, it's not clear to me if 4-7 in the above are 4 physical or 4 logical.

  • How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?

  • If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?

  • Where is OpenMP library getting invoked in binding threads to physical cores?

(from my limited understanding I could just launch command python main.py in a sh shell 20 times with different numactl bindings and OMP_NUM_THREADS still applies, even though I didn't explicitly use MPI lib anywhere, is that correct?)

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          192
On-line CPU(s) list:             0-191
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
Stepping:                        7
Frequency boost:                 enabled
CPU MHz:                         1000.026
CPU max MHz:                     2301,0000
CPU min MHz:                     1000,0000
BogoMIPS:                        4600.00
L1d cache:                       3 MiB
L1i cache:                       3 MiB
L2 cache:                        96 MiB
L3 cache:                        143 MiB
NUMA node0 CPU(s):               0-23,96-119
NUMA node1 CPU(s):               24-47,120-143
NUMA node2 CPU(s):               48-71,144-167
NUMA node3 CPU(s):               72-95,168-191
1

There are 1 answers

0
Jérôme Richard On

I read that OMP_NUM_THREADS env var sets the number of threads launched for a numactl process.

numactl do not launch threads. It controls NUMA policy for processes or shared memory. However, OpenMP runtimes may adapt the number of threads created by a region based on the environment set by numactl (although AFAIK this behaviour is undefined by the standard). You should use the environment variable OMP_NUM_THREADS to set the number of threads. You can check the OpenMP configuration using the environment variable OMP_DISPLAY_ENV.

How to find which of the numa-node-0 cores in 0-23,96-119 are physical and which ones logical? Are 96-119 all logical or are they interspersed?

This is a bit complex. Physical IDs are the ones available in /proc/cpuinfo. They are not guaranteed to stay the same over time (eg. they can change when the machine is restarted) nor "intuitive" (ie. following rules like being contiguous for threads/cores close to each other). One should avoid hard-coding them manually. e.g. a BIOS update or kernel update might lead to enumerating logical cores in a different order.

You can use the great tool hwloc to convert well-defined deterministic logical IDs to physical ones. Here, you cannot be entirely sure that 0 and 96 are two threads sharing the same core (although this is probably true here for your processor, where it looks like the kernel enumerated one logical core from each physical core as cores 0..95, then 96..191 for the other logical core on each physical core). The other common possibility is for Linux to do both logical cores of each physical core consecutively, making logical cores 2n and 2n+1 share a physical core.

If 4-7 are all physical cores, then with HT on there would be only 2 physical cores needed, so what happens to the other 2?

--physcpubind of numctl accepts physical cpu numbers as shown in the "processor" fields of /proc/cpuinfo regarding the documentation. Thus, 4-7 here should be interpreted as physical thread IDs. Two threads IDs can refer to the same physical core (which is always the case on Intel processors with hyper-threading enabled).

Where is OpenMP library getting invoked in binding threads to physical cores?

AFAIK, this is implementation dependent of the OpenMP runtime used (eg. GOMP, IOMP, etc.). The initialization of the OpenMP runtime is often done lazily when the first parallel section is encountered. For the binding, some runtimes read /proc/cpuinfo manually while some other use hwloc. If you want deterministic bindings, then you should use the OMP_PLACES and OMP_PROC_BIND environment variables to tell the runtime to bind threads using a custom user-defined method and not the default one.


If you want to be safe and portable, use the following configuration (using Bash):

OMP_NUM_THREADS=4
OMP_PROC_BIND=TRUE
OMP_PLACES={$(hwloc-calc --physical-output --sep "},{" --intersect PU core:all.pu:0)}

The OpenMP threads will be scheduled on OpenMP places. The above configuration configure the OpenMP runtime so that there will be 4 threads statically map on 4 different fixed cores.