Slurm jobs getting starved

182 views Asked by Ahmed Tawfik At 29 December 2024 at 02:13

So, I have an issue with my Slurm GPU queue that leads to a job starvation every now and then.

Basically, I have many nodes with 1 GPU, 2 GPUs, 3 GPUs, and only 2 have 4 GPUs. The situation is as follows:

User A submits a 4 GPU job
Slurm assigns one 4 GPU node to User A's job
Users B, C and D submit 1 GPU jobs and all get allocated to the second 4 GPU node
User E submits one 4 GPU job, it's PENDING since there are no resources to fulfill its need
Users F, G, H, I...etc submit 1 GPU jobs, which get allocated to the 4 GPU node immediately as soon as any of the jobs of users B, C or D finishes
More users keep submitting jobs and the 4 GPU node stays busy with these 1 GPU jobs
User E 4 GPU job stays waiting FOREVER, as the 4 GPUs are never available together

Knowing that I have set the weight of 1 GPU nodes to 1, 2 GPU nodes to 2, 3 GPU nodes to 3 and 4 GPU nodes to 4, so that users priority goes to any available 1 GPU jobs, if not then 2, if not then 3, and lastly the 4.

Any suggestions to eliminate or reduce starvation here (automatically)? I have jobs that wait for weeks!

Original Q&A

TechQA.

Slurm jobs getting starved

There are 0 answers

Related Questions in QUEUE

Related Questions in SCHEDULING

Related Questions in HPC

Related Questions in SLURM

Related Questions in STARVATION

Popular Questions

Popular Tags

Trending Questions