Ticket 15149

Summary: --gpus --ntasks not working as expected
Product: Slurm Reporter: Kaylea Nelson <kaylea.nelson>
Component: SchedulingAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
Site: Yale Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Kaylea Nelson 2022-10-11 13:53:52 MDT
We have a number of users having unexpected resource scheduling when they use "--gpus:4 --ntasks=1", resulting in wasted GPU cycles.

If they don't specify "-N 1", there is a reasonable chance that Slurm has their job span multiple nodes and multiple tasks despite the --ntasks=1 request. We have plenty of nodes with 4 GPUs, so that is not a limitation. 

Is this working as expected? --ntasks is just a suggestion and we need to explicitly specific -N 1?

Thanks,
Kaylea
Comment 2 Scott Hilton 2022-10-11 17:03:08 MDT
Kaylea,

Slurm will allocate multiple nodes to satisfy the gpu requirement. Even if you asked for only 1 task.

If you want it to only use 1 node you should specify -N1.

-Scott
Comment 3 Kaylea Nelson 2022-10-12 08:17:16 MDT
Understood. We will update our documentation accordingly.
Comment 4 Scott Hilton 2022-10-12 11:26:21 MDT
Kaylea,

After talking with some others, it seems that this is a bug and will change in 22.05.5.

In 22.05.5 slurm should properly limit jobs with --ntasks=1 to 1 node.

If such a node is not available this error will be given.
>srun: error: Unable to allocate resources: Requested node configuration is not available

-Scott
Comment 5 Kaylea Nelson 2022-10-12 11:44:09 MDT
That's great news! I look forward to the fix. In the meantime, we will encourage the users to use -N 1.
Comment 6 Scott Hilton 2022-10-12 11:56:00 MDT
Glad I could help.