All slurm jobs fail silently with exit code 0:53

909 views Asked by At

All my slurm jobs fail with exit code 0:53 within two seconds of starting.

When I look at job details with scontrol show jobid <JOBID> it doesn't say anything suspicious.

When I look at the files that stdout and stderr write to, there is nothing there.

I couldn't find anything on the listed signal 53.

3

There are 3 answers

0
Cornelius Roemer On

It turns out that the directory containing the files that slurm was supposed to write stdout and stderr to didn't exist.

In my submit.sh script, the relevant lines were:

#SBATCH --output=log/%j.out                 # where to store the output ( %j is the JOBID )
#SBATCH --error=log/%j.err                  # where to store error messages

The log directory in the current working directory from which I was submitting the job didn't exist. Once I created the directory slurm jobs no longer failed with 0:53.

My slurm version is 22.05.2. Per this answer, slurm no longer errors silently when the output directory doesn't exist from version 23.02 upwards. Seems to have been reported in this issue.

1
PassiveAggressiveSalad On

I wanted to add that while this error has happened to me if the directory does not exist, the same thing happens if you exceed your quota.

0
Brain Damage On

I've had the same issue as the OP and in my case the log directory existed, however, was on a filesystem that was read-only. To cite the entry from the ZIH HPC Compendium

When redirecting stderr and stderr into a file using --output= and --stderr=, make sure the target path is writeable on the compute nodes, i.e., it may not point to a read-only mounted filesystem like /projects.

https://compendium.hpc.tu-dresden.de/jobs_and_resources/slurm/