When running MPI applications in an SGE cluster, I have to assign two parameters of process numbers, one is for SGE itself, and the other is for OpenMPI. For example:
qrsh -pe <pe_name> <number1> mpirun -np <number2> ./program
What are the meanings of number1
and number2
in the command? What's the relationship between them?
If I need 128 (for number2
) processes for my MPI application, and I assign 16 to number1
, what would happen?
edit:
The following is the PE configuration:
pe_name impl
slots 999
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
The answer would depend on how the
<pe_name>
parallel environment (PE) is configured. In general,-pe <pe_name> <number1>
requests<number1>
slots in the<pe_name>
PE. Each PE could be configured to provide a fixed amount of slots on a node, or to fill up the available slots on one node and then move to the next one, to always allocate slots on the same node and so on. A slot in SGE usually corresponds to a CPU core but it is entirely to the SGE administrator to decide if this is the case or not.-np <number2>
tells Open MPI how many processes to launch within the MPI job. In many cases this number should be equal to the number of SGE slots requested. If Open MPI was built with SGE integration, it automatically gets the total number of slots granted from the batch system and explicit specifiction of the number of processes is only necessary in some special cases.Again, it all depends on how SGE is configured. Without the details on your cluster, e.g. the output from
qconf -sp <pe_name>
, you won't get a very concrete answer.