slurmstepd: error: Exceeded step memory limit at some point

4.6k views Asked by At

I am running a code in Bluehive. The code has some parameter N. If N is small then the code is running perfectly fine. But for slightly large N (e.g. N=10) the code is running for hours and at the end I am getting the following error message:

slurmstepd: error: Exceeded step memory limit at some point.

The batch file which I am submitting has the following code:

#!/bin/bash
#SBATCH -o log.%a.txt -t 3-01:01:00
#SBATCH --mem-per-cpu=1gb
#SBATCH -c 4
#SBATCH --gres=gpu:1 
#SBATCH -J Ankani
#SBATCH -a 1-2

python run.py $SLURM_ARRAY_TASK_ID

I am assigning enough memory for the code. But still getting the error

"slurmstepd: error: Exceeded step memory limit at some point."

Can somebody help?

1

There are 1 answers

0
user10089632 On BEST ANSWER

However, I will note that the memory limit described by "step memory limit" in this error message is not necessarily related to the RSS of your process. This limit is provided and enforced by the cgroup plugin, and memory cgroups

track not only RSS of tasks in your job but file cache, mmap pages, etc. If I had to guess you are hitting memory limit due to page cache. In that case, you might be able to just ignore this error since hitting the limit here probably just triggered memory reclaim which freed cached pages (this shouldn't be a fatal error).

If you'd like to avoid the error, and you're only writing out data and don't want it cached, then you could try playing with posix_fadvise(2) using the POSIX_FADV_DONTNEED which hints to the VM that you aren't going to read the pages you're writing out again.

Here is the source of this text