Understanding failing numactl with --membind=1 or 3 when all lscpu shows 4 numa nodes

1.2k views Asked by At

I've been trying to figure out the issue with failing numactl command, but it looks like may be i don't fully understand the way numactl or OMP_MP_THREAD works.

I'm trying to run a script main.py of 1 instance bound to 4 cpus of numa-node-1 using numactl --physcpubind=24-27 --membind=1 python -u main.py, as the lscpu shows CPUs 24-27 bound to numa-node-1.

But I get the following error.

libnuma: Warning: node argument 1 is out of range
<1> is invalid

If I use --membind=3 I get the same error, but it runs when I use --membind=2.

Question:

1. For numa-node=0 are each of 0-23 in 0-23,96-119 physical cores or only some of 0-23 are physical cores, as there are 2 threads per core? How to know which ones of ``0-23,96-119` and which are 2nd threads?

2. Am I binding the phys-cores to nodes correctly? Why the above failure?

3. which 2 numa nodes are on socket-0 and which ones are on socket-1?

Outputs:

lscpu:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          192
On-line CPU(s) list:             0-191
Thread(s) per core:              2
Core(s) per socket:              48
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz
Stepping:                        7
Frequency boost:                 enabled
CPU MHz:                         1000.026
CPU max MHz:                     2301,0000
CPU min MHz:                     1000,0000
BogoMIPS:                        4600.00
L1d cache:                       3 MiB
L1i cache:                       3 MiB
L2 cache:                        96 MiB
L3 cache:                        143 MiB
NUMA node0 CPU(s):               0-23,96-119
NUMA node1 CPU(s):               24-47,120-143
NUMA node2 CPU(s):               48-71,144-167
NUMA node3 CPU(s):               72-95,168-191

numactl --hardware:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 0 size: 64106 MB
node 0 free: 28478 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
node 2 size: 64478 MB
node 2 free: 45446 MB
node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10 
1

There are 1 answers

7
Gilles On

The issue here is that some of your NUMA nodes aren't populated with any memory. You can see that with the output of the numactl --hardware command which shows a size of 0 for the memory on nodes 1 and 3. Therefore, trying to bind the memory to these nodes is a lost battle...

Jus a side note: 9242 CPUs are normally (well, AFAIK) only available with welded-on memory modules, so it is very unlikely that there are missing memory DIMMS on your machine. So either there's something very wrong at the hardware level for your machine, or there's a layer of virtualization of a sort which hides part of the memory to you. Either way, the configuration is very wrong and needs to be investigated deeper.

EDIT: Answering the extra questions

  1. Physical core vs. HW threads numbering: when hyperthreading is enabled, there's no actual numbering of the physical core anymore. All cores seen by the OS are actually HW threads. Simply, in your case here, physical core 0 is seen as the 2 logical cores 0 and 96. Physical core 1 is seen as logical cores 1 and 97, as so on...

  2. Numactl failure: already answered

  3. NUMA node numbering: generally speaking, it depends on the BIOS of the machine. So there are 2 main options for numbering, when you have N physical sockets on a node with P cores each. These 2 options are the following (naming is mine, not sure if there's an official one):

    1. Spreading:
      • Socket 0: cores 0, N, 2N, 3N, ..., (P-1)N
      • Socket 1: cores 1, N+1, 2N+1, ..., (P-1)N+1
      • ...
      • Socket N-1: cores N-1, 2N-1, ..., PN-1
    2. Linear:
      • Socket 0: cores 0, 1, ..., P-1
      • Socket 1: cores P, P+1, ..., 2P-1
      • ...
      • Socket N-1: cores (N-1)P, ..., NP-1

    And if the Hyperthreading is activated, you just add P cores per socket, and number them so that cores numbered C and C+PN are actually the 2 HW threads of the same physical core.

    In your case here, you are seeing linear numbering

  4. numactl --physcpubind=0-3: this restrains the range of logical cores the command you launched is allowed to be scheduled on to the list passed in parameter, namely cores 0, 1, 2 and 3. But that doesn't force the code you launched to use more than one core at a time. For OpenMP codes, you still need to set the OMP_NUM_THREADS environment variable for that.

  5. OMP_NUM_THREADS and HT: OMP_NUM_THREADS only tells how many threads to launch, it doesn't care about cores, should they ne physical or logical.

  6. Distance reported by numactl: I'm not too sure of the exact meaning / accuracy of the values reported, but here is how I interpret them when I need it: for me it corresponds to some relative memory access latencies. I don't know if the values are measured or just guessed, and if these are cycles or nano seconds, but here is what it says:

    • Cores from NUMA node 0 have an access latency to memory attached to NUMA node 0 of 10 and of 21 to all other NUMA nodes
    • Cores from NUMA node 1 have an access latency to memory attached to NUMA node 1 of 10 and of 21 to all other NUMA nodes
    • etc
      But the crucial point is that accessing distance memory is 2.1 times longer than accessing local one