Ticket 1543 - srun with --ntasks-per-node
Summary: srun with --ntasks-per-node
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 14.11.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-03-18 04:15 MDT by Will French
Modified: 2015-03-18 11:23 MDT (History)
2 users (show)

See Also:
Site: Vanderbilt
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Will French 2015-03-18 04:15:03 MDT
When I submit a multi-node job and then specify --ntasks-per-node, SLURM complains when I try launching a job step with srun and --ntasks-per-node. For example:

[frenchwr@vmps08 ~]$ cat test.slurm 
#!/bin/bash
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=8

echo "testing nodes=1, ntasks=1"
srun --nodes=1 --ntasks=1 hostname
echo "testing nodes=1, ntasks=8"
srun --nodes=1 --ntasks=8 hostname
echo "testing nodes=1, ntasks-per-node=1"
srun --nodes=1 --ntasks-per-node=1 hostname
[frenchwr@vmps08 ~]$ cat slurm-862190.out 
testing nodes=1, ntasks=1
vmp368
testing nodes=1, ntasks=8
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
testing nodes=1, ntasks-per-node=1
srun: error: Unable to create job step: More processors requested than permitted 


[frenchwr@vmps08 ~]$ salloc --nodes=3 --ntasks-per-node=8
salloc: Pending job allocation 862174
salloc: job 862174 queued and waiting for resources
salloc: job 862174 has been allocated resources
salloc: Granted job allocation 862174

[frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks=1 hostname
vmp368

[frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks=8 hostname
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
[frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks-per-node=1 hostname
srun: error: Unable to create job step: More processors requested than permitted




I see the same behavior if I request ntasks=24 with sbatch or salloc. Is this a limitation of srun or a bug? I can envision a lot of scenarios where a user may want to control how many processes are being launched within a job step and on which nodes (e.g. performance benchmarks).
Comment 1 David Bigagli 2015-03-18 09:54:12 MDT
Hi,
  the allocation --nodes=3 --ntasks-per-node=8 has 24 tasks. The request 
srun --nodes=1 --ntasks-per-node=1 uses the number of tasks from the allocation 
which is 24 and since you don't have 24 tasks allocated on one node the 
srun request fails. On the other hand the srun --nodes=1 --ntasks=8 or in general ntasks >=1 and <= 8 succeeds because you overwrite the allocated number of tasks using the ntasks option. The ntasks option is the one to use when specifying a
number of tasks less than those allocated.

Let me know if this answers your question.

David
Comment 2 Will French 2015-03-18 11:21:15 MDT
(In reply to David Bigagli from comment #1)
> Hi,
>   the allocation --nodes=3 --ntasks-per-node=8 has 24 tasks. The request 
> srun --nodes=1 --ntasks-per-node=1 uses the number of tasks from the
> allocation 
> which is 24 and since you don't have 24 tasks allocated on one node the 
> srun request fails. On the other hand the srun --nodes=1 --ntasks=8 or in
> general ntasks >=1 and <= 8 succeeds because you overwrite the allocated
> number of tasks using the ntasks option. The ntasks option is the one to use
> when specifying a
> number of tasks less than those allocated.
> 
> Let me know if this answers your question.
> 
> David

Got it. I knew --ntasks would override the value of --ntasks-per-node but I did not realize that ntasks is set for an allocation even when you do not include the --ntasks option, and that this value will then override any --ntasks-per-node value specified at the job step level. That's not super intuitive but I think I've got a good handle on it now. It looks like using --ntasks also does a good job of load balancing the tasks across the nodes in an allocation, which is nice.

Thanks for the response. You can close this ticket.
Comment 3 David Bigagli 2015-03-18 11:23:07 MDT
Excellent!

David