When I submit a multi-node job and then specify --ntasks-per-node, SLURM complains when I try launching a job step with srun and --ntasks-per-node. For example: [frenchwr@vmps08 ~]$ cat test.slurm #!/bin/bash #SBATCH --nodes=3 #SBATCH --ntasks-per-node=8 echo "testing nodes=1, ntasks=1" srun --nodes=1 --ntasks=1 hostname echo "testing nodes=1, ntasks=8" srun --nodes=1 --ntasks=8 hostname echo "testing nodes=1, ntasks-per-node=1" srun --nodes=1 --ntasks-per-node=1 hostname [frenchwr@vmps08 ~]$ cat slurm-862190.out testing nodes=1, ntasks=1 vmp368 testing nodes=1, ntasks=8 vmp368 vmp368 vmp368 vmp368 vmp368 vmp368 vmp368 vmp368 testing nodes=1, ntasks-per-node=1 srun: error: Unable to create job step: More processors requested than permitted [frenchwr@vmps08 ~]$ salloc --nodes=3 --ntasks-per-node=8 salloc: Pending job allocation 862174 salloc: job 862174 queued and waiting for resources salloc: job 862174 has been allocated resources salloc: Granted job allocation 862174 [frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks=1 hostname vmp368 [frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks=8 hostname vmp368 vmp368 vmp368 vmp368 vmp368 vmp368 vmp368 vmp368 [frenchwr@vmp368 ~]$ srun --nodes=1 --ntasks-per-node=1 hostname srun: error: Unable to create job step: More processors requested than permitted I see the same behavior if I request ntasks=24 with sbatch or salloc. Is this a limitation of srun or a bug? I can envision a lot of scenarios where a user may want to control how many processes are being launched within a job step and on which nodes (e.g. performance benchmarks).
Hi, the allocation --nodes=3 --ntasks-per-node=8 has 24 tasks. The request srun --nodes=1 --ntasks-per-node=1 uses the number of tasks from the allocation which is 24 and since you don't have 24 tasks allocated on one node the srun request fails. On the other hand the srun --nodes=1 --ntasks=8 or in general ntasks >=1 and <= 8 succeeds because you overwrite the allocated number of tasks using the ntasks option. The ntasks option is the one to use when specifying a number of tasks less than those allocated. Let me know if this answers your question. David
(In reply to David Bigagli from comment #1) > Hi, > the allocation --nodes=3 --ntasks-per-node=8 has 24 tasks. The request > srun --nodes=1 --ntasks-per-node=1 uses the number of tasks from the > allocation > which is 24 and since you don't have 24 tasks allocated on one node the > srun request fails. On the other hand the srun --nodes=1 --ntasks=8 or in > general ntasks >=1 and <= 8 succeeds because you overwrite the allocated > number of tasks using the ntasks option. The ntasks option is the one to use > when specifying a > number of tasks less than those allocated. > > Let me know if this answers your question. > > David Got it. I knew --ntasks would override the value of --ntasks-per-node but I did not realize that ntasks is set for an allocation even when you do not include the --ntasks option, and that this value will then override any --ntasks-per-node value specified at the job step level. That's not super intuitive but I think I've got a good handle on it now. It looks like using --ntasks also does a good job of load balancing the tasks across the nodes in an allocation, which is nice. Thanks for the response. You can close this ticket.
Excellent! David