openMPI/mpich2 doesn't run on multiple nodes

7.3k views Asked by At

I am trying to use install openMPI and mpich2 on a multi-node cluster and I am having trouble running on multiple machines in both cases. Using mpich2 I am able to run on an specific host from the head node, but if I try to run something from the compute nodes to a different node I get:

HYDU_sock_connect (utils/sock/sock.c:172): unable to connect from "destination_node" to "parent_node" (No route to host)
[proxy:0:0@destination_node] main (pm/pmiserv/pmip.c:189): unable to connect to server parent_node at port 56411 (check for firewalls!)

If I try to use sge to set up a job I get similar errors.

On the other hand, if I try to use openMPI to run jobs, I am not able to run in any remote machine, even from the head node. I get:

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).

The machines are connected to each other, I can ping, ssh passwordlessly etc from any of them to any other, MPI_LIB and the PATH are well set in all machines.

1

There are 1 answers

1
Wesley Bland On

Usually this is caused because you didn't set up a hostfile or pass the list of hosts on the command line.

For MPICH, you do this by passing the flag -host on the command line, followed by a list of hosts (host1,host2,host3,etc.).

mpiexec -host host1,host2,host3 -n 3 <executable>

You can also put these in a file:

host1
host2
host3

Then you pass that file on the command line like so:

mpiexec -f <hostfile> -n 3 <executable>

Similarly, with Open MPI, you would use:

mpiexec --host host1,host2,host3 -n 3 <executable>

and

mpiexec --hostfile hostfile -n 3 <executable>

You can get more information at these links: