I want to be able to schedule a job on multiple nodes, with one process per node. I also want that process to use threads that use all of the cores available on that node. I know that "ppn" is used for scheduling PBS jobs, so I tried it with the Univa scheduler. The colon delimiter doesn't work, so I used two '-l' flags. I attempted
qsub -cwd -j y -l nodes=4 -l ppn=1 -N hellonodes mpirunscript.sh
This gives
Unable to run job: unknown resource "ppn".
Exiting.
In the man page of qsub it states
complex(5) describes how a list of available resources and their associated valid value specifiers can be obtained.
Unfortunately no such documentation exists on the cluster I am using. However, I found one here. Eventually I discovered that to get the list of settable resources values, I needed to run
qconf - sc
This output the below (abbreviated):
#name shortcut type relop requestable consumable default urgency
#------------------------------------------------------------------------------------------
...
cpu cpu DOUBLE >= YES NO 0 0
...
m_numa_nodes nodes INT <= YES NO 0 0
m_socket socket INT <= YES NO 0 0
m_thread thread INT <= YES NO 0 0
...
num_proc p INT == YES NO 0 0
...
slots s INT <= YES YES 1 1000
...
"ppn" (processes per node for PBS) was not listed, nor was anything similar that I could find. Can anyone tell me if this is possible, and if so, how?
Since it is a parallel job you need to request a parallel environment with -pe The admin has to create a parallel environment which fulfill your requirements first. It is then persistent and can be used for this type of parallel jobs. See: http://www.gridengine.eu/mangridengine/htmlman5/sge_pe.html
For creating a parallel environment: qconf -ap mype For listing all PEs: qconf -spl Then attach the PE to your queue: qconf -mq all.q (in case of all.q) --> "pe_list mype"
Important is: allocation_rule Here you need to set: 1 --> This means one process per compute host.
Set slots to an high value (like the amount of cores in your cluster). It is a limitation for all jobs using this parallel environment.
Then you or your users can start your job: qsub -pe mytpe 8 myscript.sh
Then you get 8 compute nodes for this job with 1 slot each. qstat -g t shows you where.
Does this help?
Daniel