PBS/TORQUE: how do I submit a parallel job on multiple nodes?

1.2k views Asked by At

So, right now I'm submitting jobs on a cluster with qsub, but they seem to always run on a single node. I currently run them by doing

#PBS -l walltime=10
#PBS -l nodes=4:gpus=2
#PBS -r n
#PBS -N test

range_0_total = $(seq 0 $(expr $total - 1)) 

for i in $range_0_total
do
    $PATH_TO_JOB_EXEC/job_executable &
done
wait

I would be incredibly grateful if you could tell me if I'm doing something wrong, or if it's just that my test tasks are too small.

1

There are 1 answers

1
chuck On

With your approach, you need to have your for loop go through all of the entries in the file pointed to by $PBS_NODEFILE and then inside of you loop you would need "ssh $i $PATH_TO_JOB_EXEC/job_executable &".

The other, easier way to do this would be to replace the for loop and wait with:

pbsdsh $PATH_TO_JOB_EXEC/job_executable

This would run a copy of your program on each core assigned to your job. If you need to modify this behavior take a look at the options available in the pbsdsh man page.