PBS jobs queue then immediately exit sometimes

457 views Asked by At

I've been using a PBS managed computing cluster for a few years now at my school. A few months ago I ran into this problem and they were never able to figure it out. When I submit jobs, they queue and then some run immediately. I believe that the jobs that should stay queued because of lack of resources die pretty much immediately. This happens intermittently depending on how many nodes I can use at one time. Sometimes I submit say 10 jobs, the first two will run, the next three will fail, then the next five will run.

I do not get either stdout or stderr files created for these failed jobs. The ones that do run do create these files. I get an email when these jobs die which I've attached here with some identifying information removed. Exit status -9 means "Could not create/open stdout stderr files" but I don't know how to fix that problem since it's so intermittent.

PBS Job Id: 11335.pearl.hpcc.XXX.edu
Job Name:   mc1055
Exec host:  m09/5
Aborted by PBS Server
Job cannot be executed
See Administrator for help
Exit_status=-9
resources_used.cput=00:00:00
resources_used.vmem=0kb
resources_used.walltime=00:00:02
resources_used.mem=0kb
resources_used.energy_used=0
req_information.task_count.0=1
req_information.lprocs.0=1
req_information.thread_usage_policy.0=allowthreads
req_information.hostlist.0=m09:ppn=1
req_information.task_usage.0.task.0={"task":{"cpu_list":"9","mem_list":"0","cores":0,"threads":1,"host":"m09"}}
Error_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.e11335
Output_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.o11335

I also looked at qstat -f right when a job failed and it's below. If I don't catch it right away it disappears from qstat.

Job Id: 11339.pearl.hpcc.XXX.edu
    Job_Name = mc1059
    Job_Owner = [email protected]
    resources_used.cput = 00:00:00
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    resources_used.mem = 0kb
    resources_used.energy_used = 0
    job_state = C
    queue = default
    server = CLUSTERNAME.hpcc.XXX.edu
    Account_Name = ADVISOR
    Checkpoint = u
    ctime = Mon Jan  4 20:02:25 2021
    Error_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.e11339
    exec_host = m09/9
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Jan  4 20:03:14 2021
    Output_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.o11339
    Priority = 0
    qtime = Mon Jan  4 20:02:25 2021
    Rerunable = True
    Resource_List.nodes = 1:ppn=1
    Resource_List.walltime = 50:00:00
    Resource_List.var = mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2
    Resource_List.nodect = 1
    session_id = 0
    Variable_List = PBS_O_QUEUE=largeq,
    PBS_O_HOME=/PATH,PBS_O_LOGNAME=USERNAME,
    PBS_O_PATH=lots of things
    PBS_O_MAIL=/var/spool/mail/USERNAME,PBS_O_SHELL=/bin/bash,
    PBS_O_LANG=en_US,KRB5CCNAME=FILE:/tmp/krb5cc_404112_hd4Yty,
    PBS_O_WORKDIR=/PATH/TOSCRIPT/run,
    PBS_O_HOST=CLUSTERNAME.hpcc.XXX.edu,
    PBS_O_SERVER=CLUSTERNAME.hpcc.XXX.edu
    euser = USERNAME
    egroup = physics
    queue_type = E
    etime = Mon Jan  4 20:02:25 2021
    exit_status = -9
    submit_args = -l var=mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2 -v KRB5CCNAME
     /PATH/TOSCRIPT/run/tmp/montec_1059
    start_time = Mon Jan  4 20:03:14 2021
    start_count = 1
    fault_tolerant = False
    comp_time = Mon Jan  4 20:03:14 2021
    job_radix = 0
    total_runtime = 7.218811
    submit_host = CLUSTERNAME.hpcc.XXX.edu
    init_work_dir = /PATH/TOSCRIPT/run
    request_version = 1
    req_information.task_count.0 = 1
    req_information.lprocs.0 = 1
    req_information.thread_usage_policy.0 = allowthreads
    req_information.hostlist.0 = m09:ppn=1
    req_information.task_usage.0.task.0.cpu_list = 5
    req_information.task_usage.0.task.0.mem_list = 1
    req_information.task_usage.0.task.0.cores = 0
    req_information.task_usage.0.task.0.threads = 1
    req_information.task_usage.0.task.0.host = m09
0

There are 0 answers