What's the relationship between Sun Grid Engine (SGE) process number and OpenMPI process number?

846 views Asked by At

When running MPI applications in an SGE cluster, I have to assign two parameters of process numbers, one is for SGE itself, and the other is for OpenMPI. For example:

qrsh -pe <pe_name> <number1> mpirun -np <number2> ./program

What are the meanings of number1 and number2 in the command? What's the relationship between them?

If I need 128 (for number2) processes for my MPI application, and I assign 16 to number1, what would happen?

edit:

The following is the PE configuration:

pe_name           impl
slots             999
user_lists        NONE
xuser_lists       NONE
start_proc_args   NONE
stop_proc_args    NONE
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min
1

There are 1 answers

1
Hristo Iliev On

The answer would depend on how the <pe_name> parallel environment (PE) is configured. In general, -pe <pe_name> <number1> requests <number1> slots in the <pe_name> PE. Each PE could be configured to provide a fixed amount of slots on a node, or to fill up the available slots on one node and then move to the next one, to always allocate slots on the same node and so on. A slot in SGE usually corresponds to a CPU core but it is entirely to the SGE administrator to decide if this is the case or not.

-np <number2> tells Open MPI how many processes to launch within the MPI job. In many cases this number should be equal to the number of SGE slots requested. If Open MPI was built with SGE integration, it automatically gets the total number of slots granted from the batch system and explicit specifiction of the number of processes is only necessary in some special cases.

Again, it all depends on how SGE is configured. Without the details on your cluster, e.g. the output from qconf -sp <pe_name>, you won't get a very concrete answer.