Ticket 1543

Summary: srun with --ntasks-per-node
Product: Slurm Reporter: Will French <will>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 14.11.4   
Hardware: Linux   
OS: Linux   
Site: Vanderbilt Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Will French 2015-03-18 04:15:03 MDT
When I submit a multi-node job and then specify --ntasks-per-node, SLURM complains when I try launching a job step with srun and --ntasks-per-node. For example:

[frenchwr@vmps08 ~]$ cat test.slurm 
#!/bin/bash
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=8

echo "testing nodes=1, ntasks=1"
srun --nodes=1 --ntasks=1 hostname
echo "testing nodes=1, ntasks=8"
srun --nodes=1 --ntasks=8 hostname
echo "testing nodes=1, ntasks-per-node=1"
srun --nodes=1 --ntasks-per-node=1 hostname
[frenchwr@vmps08 ~]$ cat slurm-862190.out 
testing nodes=1, ntasks=1
vmp368
testing nodes=1, ntasks=8
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
testing nodes=1, ntasks-per-node=1
srun: error: Unable to create job step: More processors requested than permitted 


[frenchwr@vmps08 ~]$ salloc --nodes=3 --ntasks-per-node=8
salloc: Pending job allocation 862174
salloc: job 862174 queued and waiting for resources
salloc: job 862174 has been allocated resources
salloc: Granted job allocation 862174

[frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks=1 hostname
vmp368

[frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks=8 hostname
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
vmp368
[frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks-per-node=1 hostname
srun: error: Unable to create job step: More processors requested than permitted




I see the same behavior if I request ntasks=24 with sbatch or salloc. Is this a limitation of srun or a bug? I can envision a lot of scenarios where a user may want to control how many processes are being launched within a job step and on which nodes (e.g. performance benchmarks).
Comment 1 David Bigagli 2015-03-18 09:54:12 MDT
Hi,
  the allocation --nodes=3 --ntasks-per-node=8 has 24 tasks. The request 
srun --nodes=1 --ntasks-per-node=1 uses the number of tasks from the allocation 
which is 24 and since you don't have 24 tasks allocated on one node the 
srun request fails. On the other hand the srun --nodes=1 --ntasks=8 or in general ntasks >=1 and <= 8 succeeds because you overwrite the allocated number of tasks using the ntasks option. The ntasks option is the one to use when specifying a
number of tasks less than those allocated.

Let me know if this answers your question.

David
Comment 2 Will French 2015-03-18 11:21:15 MDT
(In reply to David Bigagli from comment #1)
> Hi,
>   the allocation --nodes=3 --ntasks-per-node=8 has 24 tasks. The request 
> srun --nodes=1 --ntasks-per-node=1 uses the number of tasks from the
> allocation 
> which is 24 and since you don't have 24 tasks allocated on one node the 
> srun request fails. On the other hand the srun --nodes=1 --ntasks=8 or in
> general ntasks >=1 and <= 8 succeeds because you overwrite the allocated
> number of tasks using the ntasks option. The ntasks option is the one to use
> when specifying a
> number of tasks less than those allocated.
> 
> Let me know if this answers your question.
> 
> David

Got it. I knew --ntasks would override the value of --ntasks-per-node but I did not realize that ntasks is set for an allocation even when you do not include the --ntasks option, and that this value will then override any --ntasks-per-node value specified at the job step level. That's not super intuitive but I think I've got a good handle on it now. It looks like using --ntasks also does a good job of load balancing the tasks across the nodes in an allocation, which is nice.

Thanks for the response. You can close this ticket.
Comment 3 David Bigagli 2015-03-18 11:23:07 MDT
Excellent!

David