openmpi: MPI_recv hangs for specific number of processes

996 views Asked by At

I am running a HPC benchmark (IOR - http://sourceforge.net/projects/ior-sio/) on lustre. I compiled the source of IOR and running it with openmpi 1.5.3.

The problem is that it hangs when the number of processes (-np) is less than 6, which is odd. Removing all other things I do with around, the actual command that I run comes down to this:

/usr/lib64/openmpi/bin/mpirun --machinefile mpi_hosts --bynode -np 16 /path/to/IOR -F -u -t 1m -b 16g -i 1 -o /my/file/system/out_file

Attaching the process to GDB shows that process hangs at MPI_recv:

#0  0x00007f3ac49e95fe in mlx4_poll_cq () from /usr/lib64/libmlx4-m-rdmav2.so
#1  0x00007f3ac6ce0918 in ?? () from /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so
#2  0x000000385a6f0d5a in opal_progress () from /usr/lib64/openmpi/lib/libmpi.so.1
#3  0x00007f3ac7511e05 in ?? () from /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so
#4  0x000000385a666cac in PMPI_Recv () from /usr/lib64/openmpi/lib/libmpi.so.1
#5  0x0000000000404bd7 in CountTasksPerNode (numTasks=16, comm=0x628a80) at IOR.c:526
#6  0x0000000000407d53 in SetupTests (argc=11, argv=0x7fffe61fa2a8) at IOR.c:1402
#7  0x0000000000402e96 in main (argc=11, argv=0x7fffe61fa2a8) at IOR.c:134

This problem happens only when -np is 2/3/4/5. It works for 1/6/7/8/16 etc.

I can't reproduce this problem if I use simple commands such as date or ls. So I am not sure if this is a problem with my environment or IOR binary that I compiled (very unlikely because the same happens with an older/stable IOR binary too).

Also the precise behaviour is observed when using openmpi1.4.3 instead of openmpi1.5.3.

I have also tried by using various number of hosts (--machinefile argument) and same behaviour is observed for the the above mentioned -np values. The source line it hangs is this:

MPI_Recv(hostname, MAX_STR, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, comm, &status);

Basically I am looking for clues as to why MPI_recv() might hang when -np is 2/3/4/5. Please let me know if other information is needed. Thanks.

1

There are 1 answers

3
wolfPack88 On

First thing that comes to mind: MPI_Recv is a blocking receive, and will wait until a matching MPI_Send is called. However, if what you are sending is small enough (i.e., it fits in the buffer that MPI sets aside for such tasks), then the function will actually not wait, and instead carry on through the code. For higher core counts, you may be sending less with each MPI_Send/MPI_Recv command, so the data fits in the buffer and everything continues on its way. With lower core counts, there is too much data to fit in the buffer and MPI_Recv hangs as you have not called an appropriate MPI_Send for the information to get there. A quick and easy way to test this hypothesis: decrease the problem size substantially; does it still hang at all of those core counts? If not, then that is further evidence for my hypothesis and you will need to provide more code so we can see what the issue is.