How to stop a compute node with SLURM?

Question

How to stop a compute node with SLURM?

680 views Asked by FenryrMKIII At 09 April 2021 at 07:38

I am using SLURM on AWS to manage jobs as part of AWS parallelcluster. I have two questions :

When using scancel *jobid* to cancel a job, the associated node(s) do not stop. How can I achieve that ?
When starting, I made the mistake of not making my script executable so the sbatch *script.sh* worked but the compute node was doing nothing. How could I identify such behaviour and handle it properly ? Is the proper to e.g. stop the idle node after some time for example and output that in a log ? How can I achieve that ?

Original Q&A

There are 1 answers

**boofla** · Accepted Answer · 2021-04-09T12:18:18+00:00

Check out this page in the docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/autoscaling.html

Bottom line is that instances that have no jobs for a period of time longer than the scaledown_idletime (the default setting is 10 minutes) will get scaled down (terminated) by the cluster, automagically.

You can tweak the setting in the config file when you build your cluster, if 10 mins is too long. Just think about your workload first, because you don't want small delays between jobs to cause you a lot of churn whilst you wait for nodes to die and then get created again shortly after, hence the 10 minute thing.

TechQA.

How to stop a compute node with SLURM?

There are 1 answers

Related Questions in AMAZON-WEB-SERVICES

Related Questions in SLURM

Related Questions in AMAZON-PARALLELCLUSTER

Popular Questions

Trending Questions