| Summary: | ntasks-per-node error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Martins Innus <minnus> |
| Component: | User Commands | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex, tim |
| Version: | 16.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=5897 https://bugs.schedmd.com/show_bug.cgi?id=5977 https://bugs.schedmd.com/show_bug.cgi?id=8251 |
||
| Site: | University of Buffalo (SUNY) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
batch script
script2 output2 error2 |
||
|
Description
Martins Innus
2016-09-13 08:55:13 MDT
Martin, could you please show the whole batch script including the possible srun requests inside it. Since it's an srun error it will be easier to reproduce. Anyhow, we are able to reproduce something similar with a more simple request: $ salloc --ntasks-per-node=8 -n 8 salloc: Granted job allocation 20004 srun: Warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 1 with the number of requested nodes 1. Ignoring --ntasks-per-node. $ scontrol show config | grep Salloc SallocDefaultCommand = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL So there's definitely an issue going on there. Created attachment 3495 [details]
batch script
OK, attached. This is a much simplified script from the original report but still shows the same problem. Just running mpi hello world. I see in your script you have: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=12 and then NPROCS=`expr $SLURM_NTASKS_PER_NODE \* $SLURM_NNODES` srun --ntasks-per-node=$NPROCS ./helloworld why are you overriding --ntasks-per-node from 12 to 12*2 ? Maybe I'm wrong but it sounds a bit strange to me. Created attachment 3499 [details]
script2
Created attachment 3500 [details]
output2
Created attachment 3501 [details]
error2
OK, sorry. In attempting to simplify the script, there was an error. I uploaded a new script and the corresponding output and error. The easiest way to reproduce the error is to have one part of the job script that has an srun invocation that uses fewer cores than the overall job script. We think we are having this same error with other combinations of nodes/cores, but this was the easiest to simplify. Let me know if it is actually an error in the job script and we can work with the user to fix it. Thanks Martins (In reply to Martins Innus from comment #17) > OK, sorry. In attempting to simplify the script, there was an error. > > I uploaded a new script and the corresponding output and error. > > The easiest way to reproduce the error is to have one part of the job script > that has an srun invocation that uses fewer cores than the overall job > script. Yes we also managed to locally reproduce this way. > We think we are having this same error with other combinations of > nodes/cores, but this was the easiest to simplify. > > Let me know if it is actually an error in the job script and we can work > with the user to fix it. > We've ready a patch for this, once it is pushed we'll come back to you. Martins, following commit silences the warning you see if the number of tasks is less than the number of tasks per node given/inherited: https://github.com/SchedMD/slurm/commit/daacf5afee9 Anyhow, as the documentation states, ntasks-per-node is "Meant to be used with the --ntasks option". Slurm has to figure out how many tasks can run in an allocation based on what the allocation requests. This is done off whatever is given Slurm. Slurm always wants to fill in an allocation so ntasks is ALWAYS inherited from the environment when in one. So any time you are in an allocation you will ALWAYS default to whatever the allocation has for tasks. If you expect a certain number of tasks you should ask for it. The options you are specified are only telling Slurm how to lay out tasks, not the number. Slurm will default to fill the allocated resources unless told otherwise. Alejandro, OK. So if we have a multistep job where we have multiple sruns that require different ntasks-per-node we need to use --ntasks for the srun? Like this: #SBATCH --nodes=2 #SBATCH --ntasks-per-node=12 #SBATCH --ntasks=24 # This inherits srun ./foo.exe # This needs all new params srun --nodes=2 --ntasks=12 --ntasks-per-node=6 ./bar.exe # End sbatch Thanks for the clarification. Martins (In reply to Martins Innus from comment #32) > #SBATCH --nodes=2 > #SBATCH --ntasks-per-node=12 > #SBATCH --ntasks=24 > > # This inherits > srun ./foo.exe Slurm will default to fill the allocated resources unless told otherwise, so this first srun will try to fill 24 tasks across the 2 nodes. > > # This needs all new params > srun --nodes=2 --ntasks=12 --ntasks-per-node=6 ./bar.exe If you want to consume less than the allocated for the job you've to explicitly tell Slurm like you do in this second srun, exactly. > # End sbatch > > > Thanks for the clarification. > > Martins No problem. If you don't have any more questions let me know if we can close the bug. Thanks. OK thanks! I will ask the researcher to resubmit his job and confirm it works. Hi Martins, any progress with this? Thanks. Marking as resolved/timedout. Please reopen if any issue is encountered with the customer feedback. |