So, I have an issue with my Slurm GPU queue that leads to a job starvation every now and then.
Basically, I have many nodes with 1 GPU, 2 GPUs, 3 GPUs, and only 2 have 4 GPUs. The situation is as follows:
- User A submits a 4 GPU job
- Slurm assigns one 4 GPU node to User A's job
- Users B, C and D submit 1 GPU jobs and all get allocated to the second 4 GPU node
- User E submits one 4 GPU job, it's PENDING since there are no resources to fulfill its need
- Users F, G, H, I...etc submit 1 GPU jobs, which get allocated to the 4 GPU node immediately as soon as any of the jobs of users B, C or D finishes
- More users keep submitting jobs and the 4 GPU node stays busy with these 1 GPU jobs
- User E 4 GPU job stays waiting FOREVER, as the 4 GPUs are never available together
Knowing that I have set the weight of 1 GPU nodes to 1, 2 GPU nodes to 2, 3 GPU nodes to 3 and 4 GPU nodes to 4, so that users priority goes to any available 1 GPU jobs, if not then 2, if not then 3, and lastly the 4.
Any suggestions to eliminate or reduce starvation here (automatically)? I have jobs that wait for weeks!