Automatically cancel slurm jobs if there are insufficient instances on AWS ParallelCluster

Question

Automatically cancel slurm jobs if there are insufficient instances on AWS ParallelCluster

157 views Asked by Omar Awile At 27 June 2023 at 08:52

I recently started playing around with AWS ParallelCluster and I noticed that when I submit a job that requires more instances than there are currently available in my region/AZ then the available instances are brought up and idle until all remaining instances become available. It seems like this can sometimes take a very long time. SLURM reports in /var/log/parallelcluster/slurm_resume.log

ERROR - Error in CreateFleet request (...): InsufficientInstanceCapacity - We currently do not have sufficient c6i.metal capacity in the Availability Zone you requested (us-east-1a)

The problem is, I still pay for the nodes that are up and waiting. Is there a way to instead cancel the job after a certain timeout such that I can try later?

Original Q&A

There are 1 answers

**Matt Vaughn** · Accepted Answer · 2023-06-28T11:13:29+00:00

There might be a better solution than canceling the job in the face of limited capacity. ParallelCluster has a hidden capability called "all or nothing instance launching" that you can turn on by editing your cluster configuration.

What enabling this will do is instruct ParallelCluster to only launch new instances for a job if it can get all the requested instances. The job will not proceed to a running state, and you will not accrue charges for the unused instances. This should prevent the situation you are describing above.

Here's a link to an AWS HPC blog article that will tell you all about it and show you how to use it: https://aws.amazon.com/blogs/hpc/minimize-hpc-compute-costs-with-all-or-nothing-instance-launching/

TechQA.

Automatically cancel slurm jobs if there are insufficient instances on AWS ParallelCluster

There are 1 answers

Related Questions in AMAZON-WEB-SERVICES

Related Questions in AMAZON-PARALLELCLUSTER

Popular Questions

Trending Questions