I am running a code in Bluehive. The code has some parameter N. If N is small then the code is running perfectly fine. But for slightly large N (e.g. N=10) the code is running for hours and at the end I am getting the following error message:
slurmstepd: error: Exceeded step memory limit at some point.
The batch file which I am submitting has the following code:
#!/bin/bash
#SBATCH -o log.%a.txt -t 3-01:01:00
#SBATCH --mem-per-cpu=1gb
#SBATCH -c 4
#SBATCH --gres=gpu:1
#SBATCH -J Ankani
#SBATCH -a 1-2
python run.py $SLURM_ARRAY_TASK_ID
I am assigning enough memory for the code. But still getting the error
"slurmstepd: error: Exceeded step memory limit at some point."
Can somebody help?
However, I will note that the memory limit described by "step memory limit" in this error message is not necessarily related to the RSS of your process. This limit is provided and enforced by the cgroup plugin, and memory cgroups
Here is the source of this text