mpirun : ORTE daemon has unexpectedly failed

350 views Asked by At

I'm on a fresh install of a Slurm (version 20.11.9) cluster with 4 nodes on CentOS 8 Stream, with Mellanox infiniband connection. Mellanox drivers has been built from this ISO : https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/ (version 5.8-2.0.3.0-LTS for RHEL/Rocky 8.6).

I've add the compatibility of my kernel with mlnx_add_kernel_support.sh.

Everything seems to be done correctly :

  • both openibd opensmd services are running without apparent error
  • ibstat return something good

I've compiled OpenMPI 4.1.1 (./configure --disable-io-ompio --enable-mpi-thread-multiple --without-openib --without-verbs --with-ucx=$ucx --with-hwloc=/usr -enable-shared --prefix $sw) with icc v20, ucx 1.11.2 and gcc 9 (from gcc-toolset). This kind of compilation worked for an other running cluster on CentOS 7.

When I run a mpirun hostname on a single machine, it works.

But if I do the same with 2 nodes on an interractive job (srun --nodes=2 --ntasks-per-node=1 --pty bash -i), it fails :

[1019]user@node01:~ $ mpirun hostname
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

Here is the log with more verbosity (mpirun -debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 hostname) : https://pastebin.com/680azqpa

Short version :

[node01:2220381] [[63667,0],0] plm:slurm: final top-level argv:
        srun --ntasks-per-node=1 --kill-on-bad-exit --mpi=none --nodes=1 --nodelist=node02 --ntasks=1 orted -mca orte_debug_daemons "1" -mca ess "slurm" -mca ess_base_jobid "4172480512" -mca ess_base_vpid "1" -mca ess_base_num_procs "2" -mca orte_node_regex "node[2:01-02]@0(2)" -mca orte_hnp_uri "4172480512.0;tcp://<ip_node_01>:51187" --mca plm_base_verbose "5" -mca oob_base_verbose "10" -mca rml_base_verbose "10"
srun: error: Unable to create step for job 18: Requested node configuration is not available
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

What kind of check I can do ? How to find out where is the problem ?

0

There are 0 answers