I am a new user of MVAPICH2, and I encountered troubles when I started with it.
First, I think I have installed it successfully, through this:
    ./configure --disable-fortran --enable-cuda
    make -j 4
    make install
There were not errors.
But when I attempted to run the example of cpi in the directory of example, I encountered like this:
- I could connect node gpu-cluster-1 and gpu-cluster-4 through ssh without password; 
- I run the cpi example separately on gpu-cluster-1 and gpu-cluster-4 using mpirun_rsh, and it worked OK, just like this: 
 run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-1 gpu-cluster-1 ./cpi
 Process 0 of 2 is on gpu-cluster-1
 Process 1 of 2 is on gpu-cluster-1
 pi is approximately 3.1415926544231318, Error is 0.0000000008333387
 wall clock time = 0.000089
 run@gpu-cluster-4:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 gpu-cluster-4 gpu-cluster-4 ./cpi
 Process 0 of 2 is on gpu-cluster-4
 Process 1 of 2 is on gpu-cluster-4
 pi is approximately 3.1415926544231318, Error is 0.0000000008333387
 wall clock time = 0.000134
- I run the cpi example both on gpu-cluster-1 and gpu-cluster-4 using mpiexec, and it worked OK, just like this: 
 run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpiexec -np 2 -f hostfile ./cpi
 Process 0 of 2 is on gpu-cluster-1
 Process 1 of 2 is on gpu-cluster-4
 pi is approximately 3.1415926544231318, Error is 0.0000000008333387
 wall clock time = 0.000352
 The content in hostfile is "gpu-cluster-1\ngpu-cluster-4"
- But, when I run cpi example, using mpirun_rsh, borh on gpu-cluster-1 and gpu-cluster-4, problem came out: 
 run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ mpirun_rsh -ssh -np 2 -hostfile hostfile ./cpi Process 1 of 2 is on gpu-cluster-4
 -----------------It stuck here, not going on ------------------------
 After a long time, I press Ctrl + C, and it present this:
 ^C[gpu-cluster-1:mpirun_rsh][signal_processor] Caught signal 2, killing job
 run@gpu-cluster-1:~/mvapich2-2.1rc1/examples$ [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
 [gpu-cluster-4:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 6. MPI process died?
 [gpu-cluster-4:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
 [gpu-cluster-4:mpispawn_1][report_error] connect() failed: Connection refused (111)
 I have been confused for a long time, could you give me some help to resolve this problems?
Here is the code of cpi example:
#include "mpi.h"
#include <stdio.h>
#include <math.h>
double f(double);
double f(double a)
{
    return (4.0 / (1.0 + a*a));
}
int main(int argc,char *argv[])
{
    int    n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int    namelen;
    char   processor_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);
    fprintf(stdout,"Process %d of %d is on %s\n",
    myid, numprocs, processor_name);
    fflush(stdout);
    n = 10000;          /* default # of rectangles */
    if (myid == 0)
    startwtime = MPI_Wtime();
    MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
    h   = 1.0 / (double) n;
    sum = 0.0;
    /* A slightly better approach starts from large i and works back */
    for (i = myid + 1; i <= n; i += numprocs)
    {
        x = h * ((double)i - 0.5);
        sum += f(x);
    }
    mypi = h * sum;
    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0) {
        endwtime = MPI_Wtime();
        printf("pi is approximately %.16f, Error is %.16f\n",
               pi, fabs(pi - PI25DT));
        printf("wall clock time = %f\n", endwtime-startwtime);         
        fflush(stdout);
    }
    MPI_Finalize();
    return 0;
}