Ticket 15149 - --gpus --ntasks not working as expected
Summary: --gpus --ntasks not working as expected
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 22.05.2
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-10-11 13:53 MDT by Kaylea Nelson
Modified: 2022-10-12 11:56 MDT (History)
0 users

See Also:
Site: Yale
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kaylea Nelson 2022-10-11 13:53:52 MDT
We have a number of users having unexpected resource scheduling when they use "--gpus:4 --ntasks=1", resulting in wasted GPU cycles.

If they don't specify "-N 1", there is a reasonable chance that Slurm has their job span multiple nodes and multiple tasks despite the --ntasks=1 request. We have plenty of nodes with 4 GPUs, so that is not a limitation. 

Is this working as expected? --ntasks is just a suggestion and we need to explicitly specific -N 1?

Thanks,
Kaylea
Comment 2 Scott Hilton 2022-10-11 17:03:08 MDT
Kaylea,

Slurm will allocate multiple nodes to satisfy the gpu requirement. Even if you asked for only 1 task.

If you want it to only use 1 node you should specify -N1.

-Scott
Comment 3 Kaylea Nelson 2022-10-12 08:17:16 MDT
Understood. We will update our documentation accordingly.
Comment 4 Scott Hilton 2022-10-12 11:26:21 MDT
Kaylea,

After talking with some others, it seems that this is a bug and will change in 22.05.5.

In 22.05.5 slurm should properly limit jobs with --ntasks=1 to 1 node.

If such a node is not available this error will be given.
>srun: error: Unable to allocate resources: Requested node configuration is not available

-Scott
Comment 5 Kaylea Nelson 2022-10-12 11:44:09 MDT
That's great news! I look forward to the fix. In the meantime, we will encourage the users to use -N 1.
Comment 6 Scott Hilton 2022-10-12 11:56:00 MDT
Glad I could help.