Slurm jobs getting starved

129 views Asked by At

So, I have an issue with my Slurm GPU queue that leads to a job starvation every now and then.

Basically, I have many nodes with 1 GPU, 2 GPUs, 3 GPUs, and only 2 have 4 GPUs. The situation is as follows:

  1. User A submits a 4 GPU job
  2. Slurm assigns one 4 GPU node to User A's job
  3. Users B, C and D submit 1 GPU jobs and all get allocated to the second 4 GPU node
  4. User E submits one 4 GPU job, it's PENDING since there are no resources to fulfill its need
  5. Users F, G, H, I...etc submit 1 GPU jobs, which get allocated to the 4 GPU node immediately as soon as any of the jobs of users B, C or D finishes
  6. More users keep submitting jobs and the 4 GPU node stays busy with these 1 GPU jobs
  7. User E 4 GPU job stays waiting FOREVER, as the 4 GPUs are never available together

Knowing that I have set the weight of 1 GPU nodes to 1, 2 GPU nodes to 2, 3 GPU nodes to 3 and 4 GPU nodes to 4, so that users priority goes to any available 1 GPU jobs, if not then 2, if not then 3, and lastly the 4.

Any suggestions to eliminate or reduce starvation here (automatically)? I have jobs that wait for weeks!

0

There are 0 answers