SLURM: see how many cores per node, and how many cores per job

83.1k views Asked by At

I have searched google and read the documentation.

My local cluster is using SLURM. I want to check the following things: How many cores does each node have? How many cores has each job in the queue reserved?

Any advice would be much appreciated!

3

There are 3 answers

0
Bub Espinja On BEST ANSWER

in order to see the details of all the nodes you can use:

scontrol show node

For an specific node:

scontrol show node "nodename"

And for the cores of job you can use the format mark %C, for instance:

squeue -o"%.7i %.9P %.8j %.8u %.2t %.10M %.6D %C"

More info about format.

4
damienfrancois On

You can get most information about the nodes in the cluster with the sinfo command, for instance with:

sinfo --Node --long

you will get condensed information about, a.o., the partition, node state, number of sockets, cores, threads, memory, disk and features. It is slightly easier to read than the output of scontrol show nodes.

As for the number of CPUs for each job, see @Sergio Iserte's answer.

See the manpage here.

2
jimh On

To build on @damienfrancois's answer:

I found that sinfo was the most useful, but the command arguments should be different. If you just want to know the cores per node, mem per node, availability, and how much is available per node just do the following.

For quick node status: sinfo -o "%n %e %m %a %c %C"

Output looks like:

HOSTNAMES FREE_MEM MEMORY AVAIL CPUS CPUS(A/I/O/T)
m-4-06 301585 950000 up 96 88/8/0/96
m-4-07 654944 950000 up 72 71/1/0/72
m-4-09 628696 950000 up 72 49/23/0/72
c-0-02 36741 115000 up 24 24/0/0/24
c-0-03 47512 115000 up 24 24/0/0/24
m-2-01 699025 950000 up 72 72/0/0/72

HOSTNAMES tells you the nodes of the cluster, if you want submit to a specific node that is the one you can say you want to use.

FREE_MEM tells you how much memory that node has free in MB.

MEMORY tells you how much memory that node has by default, when it is unused, in MB.

AVAIL tells you if that node is up or not (if you are having issues).

CPUS tells you the total number of cpus on that node, assuming it is unused.

CPUS(A/I/O/T) tells you the number of allocated/idle/other/total cpus. Allocated cpus are the cores unavailable, and currently being used in jobs. Idle cpus are immediately available for use, other means they could be down or in some different mid-run state, and total just reiterates that total number of cpus.

More details on the output of this command and how to format it can be found here.