I've been using a PBS managed computing cluster for a few years now at my school. A few months ago I ran into this problem and they were never able to figure it out. When I submit jobs, they queue and then some run immediately. I believe that the jobs that should stay queued because of lack of resources die pretty much immediately. This happens intermittently depending on how many nodes I can use at one time. Sometimes I submit say 10 jobs, the first two will run, the next three will fail, then the next five will run.
I do not get either stdout or stderr files created for these failed jobs. The ones that do run do create these files. I get an email when these jobs die which I've attached here with some identifying information removed. Exit status -9 means "Could not create/open stdout stderr files" but I don't know how to fix that problem since it's so intermittent.
PBS Job Id: 11335.pearl.hpcc.XXX.edu
Job Name: mc1055
Exec host: m09/5
Aborted by PBS Server
Job cannot be executed
See Administrator for help
Exit_status=-9
resources_used.cput=00:00:00
resources_used.vmem=0kb
resources_used.walltime=00:00:02
resources_used.mem=0kb
resources_used.energy_used=0
req_information.task_count.0=1
req_information.lprocs.0=1
req_information.thread_usage_policy.0=allowthreads
req_information.hostlist.0=m09:ppn=1
req_information.task_usage.0.task.0={"task":{"cpu_list":"9","mem_list":"0","cores":0,"threads":1,"host":"m09"}}
Error_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.e11335
Output_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.o11335
I also looked at qstat -f right when a job failed and it's below. If I don't catch it right away it disappears from qstat.
Job Id: 11339.pearl.hpcc.XXX.edu
Job_Name = mc1059
Job_Owner = [email protected]
resources_used.cput = 00:00:00
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
resources_used.mem = 0kb
resources_used.energy_used = 0
job_state = C
queue = default
server = CLUSTERNAME.hpcc.XXX.edu
Account_Name = ADVISOR
Checkpoint = u
ctime = Mon Jan 4 20:02:25 2021
Error_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.e11339
exec_host = m09/9
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Jan 4 20:03:14 2021
Output_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.o11339
Priority = 0
qtime = Mon Jan 4 20:02:25 2021
Rerunable = True
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 50:00:00
Resource_List.var = mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2
Resource_List.nodect = 1
session_id = 0
Variable_List = PBS_O_QUEUE=largeq,
PBS_O_HOME=/PATH,PBS_O_LOGNAME=USERNAME,
PBS_O_PATH=lots of things
PBS_O_MAIL=/var/spool/mail/USERNAME,PBS_O_SHELL=/bin/bash,
PBS_O_LANG=en_US,KRB5CCNAME=FILE:/tmp/krb5cc_404112_hd4Yty,
PBS_O_WORKDIR=/PATH/TOSCRIPT/run,
PBS_O_HOST=CLUSTERNAME.hpcc.XXX.edu,
PBS_O_SERVER=CLUSTERNAME.hpcc.XXX.edu
euser = USERNAME
egroup = physics
queue_type = E
etime = Mon Jan 4 20:02:25 2021
exit_status = -9
submit_args = -l var=mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2 -v KRB5CCNAME
/PATH/TOSCRIPT/run/tmp/montec_1059
start_time = Mon Jan 4 20:03:14 2021
start_count = 1
fault_tolerant = False
comp_time = Mon Jan 4 20:03:14 2021
job_radix = 0
total_runtime = 7.218811
submit_host = CLUSTERNAME.hpcc.XXX.edu
init_work_dir = /PATH/TOSCRIPT/run
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 1
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = m09:ppn=1
req_information.task_usage.0.task.0.cpu_list = 5
req_information.task_usage.0.task.0.mem_list = 1
req_information.task_usage.0.task.0.cores = 0
req_information.task_usage.0.task.0.threads = 1
req_information.task_usage.0.task.0.host = m09