Please consider the following simple test script: #!/bin/bash #SBATCH --ntasks 30 #SBATCH --partition linlarge #SBATCH --exclusive for i in {1..4}; do echo $i srun -n 1 -c 1 sleep 150 & done wait the nodes in our partition have each 28 cores therefore 2 nodes are allocated JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 677521 linlarge slurmtes ngib740 R 0:03 2 gisath[034,036] however only 2 processes get started; looking at "sacct" I can see that 28 cores (CPUS) were allocated per jobstep. JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 677521 slurmtest+ linlarge default 56 RUNNING 0:0 677521.batch batch default 28 RUNNING 0:0 677521.0 sleep default 28 RUNNING 0:0 677521.1 sleep default 28 RUNNING 0:0 The SLURM output shows the following warnings: cpu-bind=MASK - gisath034, task 0 0 [7913]: mask 0xfffffff set 1 2 3 4 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 cpu-bind=MASK - gisath034, task 0 0 [7975]: mask 0x1 set cpu-bind=MASK - gisath036, task 0 0 [20700]: mask 0x1 set If I change the script and add --exclusive to the srun parameters I still get these same "srun: Warning:..." but the behaviour is back to expected and one core is allocated per task: JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 677522 slurmtest+ linlarge default 56 RUNNING 0:0 677522.batch batch default 28 RUNNING 0:0 677522.0 sleep default 1 RUNNING 0:0 677522.1 sleep default 1 RUNNING 0:0 677522.2 sleep default 1 RUNNING 0:0 677522.3 sleep default 1 RUNNING 0:0 Without the #SBATCH exclusive (and without srun exclusive) I got another allocation of 15 CPUS/task: 677523 slurmtest+ linlarge default 30 RUNNING 0:0 677523.batch batch default 15 RUNNING 0:0 677523.0 sleep default 15 RUNNING 0:0 677523.1 sleep default 15 RUNNING 0:0 I'm trying to understand what is happening here and why more than 1 CPU's get allocated to the tasks with srun; I'm pretty sure we did not see this behavior with SLURM 20.2; can you please advise if this is an expected change or a bug in 20.11 ? Do we need additional configuration settings with SLURM 20.11 ? thank you our partition is set up as follows: PartitionName=linlarge Nodes=gisath[009-352] MaxTime=INFINITE State=UP OverSubscribe=FORCE:1 PriorityTier=30 QoS=linlarge Nodes like this: NodeName=gisath[017-352] CPUs=28 Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 RealMemory=128000 State=UNKNOWN Weight=1 Feature=athena,broadwell,rhel7,lmem GRES=fv:1 The select type : SelectType=select/cons_tres # for gpu SelectTypeParameters=CR_Core,CR_Pack_Nodes let me know if you need more information thanks
possibly related to the same issue, if I use the following script SLRUM will start 28 tasks on first node, but only single task on second node allocated to job: #!/bin/bash #SBATCH --ntasks 40 #SBATCH --partition linlarge #SBATCH --exclusive for i in {1..40}; do echo $i srun -n 1 -c 1 --exclusive sleep 150 & done wait JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 677547 linlarge slurmtes ngib740 R 2:56 2 gisath[018,020] cat slurm-677547.out cpu-bind=MASK - gisath018, task 0 0 [14335]: mask 0xfffffff set 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 cpu-bind=MASK - gisath018, task 0 0 [14584]: mask 0x2 set cpu-bind=MASK - gisath018, task 0 0 [14591]: mask 0x4 set cpu-bind=MASK - gisath020, task 0 0 [25583]: mask 0x1 set cpu-bind=MASK - gisath018, task 0 0 [14600]: mask 0x4000 set cpu-bind=MASK - gisath018, task 0 0 [14608]: mask 0x8000 set cpu-bind=MASK - gisath018, task 0 0 [14616]: mask 0x20 set cpu-bind=MASK - gisath018, task 0 0 [14624]: mask 0x40 set cpu-bind=MASK - gisath018, task 0 0 [14631]: mask 0x20000 set cpu-bind=MASK - gisath018, task 0 0 [14640]: mask 0x10000 set cpu-bind=MASK - gisath018, task 0 0 [14647]: mask 0x40000 set cpu-bind=MASK - gisath018, task 0 0 [14657]: mask 0x1 set cpu-bind=MASK - gisath018, task 0 0 [14665]: mask 0x80000 set cpu-bind=MASK - gisath018, task 0 0 [14673]: mask 0x100 set cpu-bind=MASK - gisath018, task 0 0 [14681]: mask 0x400 set cpu-bind=MASK - gisath018, task 0 0 [14689]: mask 0x1000 set cpu-bind=MASK - gisath018, task 0 0 [14697]: mask 0x10 set cpu-bind=MASK - gisath018, task 0 0 [14705]: mask 0x8 set cpu-bind=MASK - gisath018, task 0 0 [14714]: mask 0x800 set cpu-bind=MASK - gisath018, task 0 0 [14722]: mask 0x100000 set cpu-bind=MASK - gisath018, task 0 0 [14730]: mask 0x80 set cpu-bind=MASK - gisath018, task 0 0 [14738]: mask 0x2000 set cpu-bind=MASK - gisath018, task 0 0 [14746]: mask 0x200000 set cpu-bind=MASK - gisath018, task 0 0 [14754]: mask 0x400000 set cpu-bind=MASK - gisath018, task 0 0 [14762]: mask 0x200 set cpu-bind=MASK - gisath018, task 0 0 [14770]: mask 0x4000000 set cpu-bind=MASK - gisath018, task 0 0 [14778]: mask 0x8000000 set cpu-bind=MASK - gisath018, task 0 0 [14786]: mask 0x1000000 set cpu-bind=MASK - gisath018, task 0 0 [14792]: mask 0x800000 set cpu-bind=MASK - gisath018, task 0 0 [14795]: mask 0x2000000 set srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 677547 step creation temporarily disabled, retrying (Requested nodes are busy)
Hi Patrick, First, to answer your question about this warning: > srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 srun requested 2 nodes implicitly - the job allocation is 2 nodes, and srun didn't specify node count, so it uses the number for the job allocation. But, srun requested 1 CPU and 1 task, and you can't run 1 process on 2 nodes. You can specify the number of nodes to silence this warning (-N1). Now to answer your main question: There were some changes to srun in 20.11. From our RELEASE_NOTES document: https://github.com/SchedMD/slurm/blob/slurm-20-11-5-1/RELEASE_NOTES -- By default, a step started with srun will be granted exclusive (or non- overlapping) access to the resources assigned to that step. No other parallel step will be allowed to run on the same resources at the same time. This replaces one facet of the '--exclusive' option's behavior, but does not imply the '--exact' option described below. To get the previous default behavior - which allowed parallel steps to share all resources - use the new srun '--overlap' option. -- In conjunction to this non-overlapping step allocation behavior being the new default, there is an additional new option for step management '--exact', which will allow a step access to only those resources requested by the step. This is the second half of the '--exclusive' behavior. Otherwise, by default all non-gres resources on each node in the allocation will be used by the step, making it so no other parallel step will have access to those resources unless both steps have specified '--overlap'. You can find more background about this change here: https://bugs.schedmd.com/show_bug.cgi?id=10383#c63 In other words, the default behavior for srun is: * Exclusive access to resources it requests (srun --exclusive) * All the resources of the job on the node (srun --whole) These can be overridden by: * srun --overlap (srun can overlap each other) * srun --exact (only use exactly the resources requested) Here's an example: I have 8 cores and 2 threads per core on my nodes. #!/bin/bash #SBATCH -n20 set -x srun --exact -N1 -c1 -n1 whereami printf "\n\n\n" srun -N1 -c1 -n1 whereami printf "\n\n\n" srun -N2 -n2 whereami | sort ("whereami" is a simple program we wrote that just displays CPU masks. You can also display the masks like you are already doing. I can share this program with you if you want, though.) $ sbatch 11275.batch Submitted batch job 202 $ cat slurm-202.out + srun --exact -N1 -c1 -n1 whereami 0000 n1-1 - Cpus_allowed: 0101 Cpus_allowed_list: 0,8 + printf '\n\n\n' + srun -N1 -c1 -n1 whereami 0000 n1-1 - Cpus_allowed: 1f1f Cpus_allowed_list: 0-4,8-12 + printf '\n\n\n' + srun -N2 -n2 whereami + sort 0000 n1-1 - Cpus_allowed: 1f1f Cpus_allowed_list: 0-4,8-12 0001 n1-2 - Cpus_allowed: 1f1f Cpus_allowed_list: 0-4,8-12 My first step uses --exact to get exactly the CPUs it asked for. My second step doesn't use --exact so it is given all the resources on the node. This is what you're seeing. My third step uses all the resources in the job just to show what the CPU bindings are for the entire job across the 2 nodes. Basically, unless --exact is specified, --cpus-per-task is ignored in a job step. I admit this was surprising to me, even though I knew about this change to srun - I thought if I specified --cpus-per-task that slurmctld would give me exactly what I asked for. I will look into changing the behavior so that if --cpus-per-task is explicitly requested by the user, then we can imply --exact. But if it turns out that we don't want to do that, then I will at least write a documentation patch to the srun man page to clarify that --cpus-per-task requires --exact. Can you run your tests with --exact and let me know if it does what you expect?
> If I change the script and add --exclusive to the srun parameters I still get these same "srun: Warning:..." but the behaviour is back to expected and one core is allocated per task: This happens because explicitly requesting --exclusive implicitly set --exact.
Hello - thank you for the explanation; I just did a quick test but I believe I still see some issue when going across nodes. The following script will request two desktops with 3 CPU's configured each: #!/bin/bash #SBATCH --ntasks 6 #SBATCH --partition desktop for i in {1..6}; do echo $i srun -N1 -n1 -c1 --exact sleep 150 & done wait SLURM assigns two nodes: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1380409 desktop slurmtes ngib740 R 2:49 2 giswlx[100-101] The output file however shows that only 1 CPU is used on the second node (giswlx101): cpu-bind=MASK - giswlx100, task 0 0 [11655]: mask 0x7 set 1 2 3 4 5 6 cpu-bind=MASK - giswlx100, task 0 0 [11752]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [11764]: mask 0x4 set cpu-bind=MASK - giswlx100, task 0 0 [11767]: mask 0x2 set cpu-bind=MASK - giswlx101, task 0 0 [25074]: mask 0x1 set srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy) If I use "srun -N1 -n1 -c1 --exact --overlap sleep 150 &", then it is even worse with SLURM starting more tasks (5) on giswlx100 than there are configured CPU's: cpu-bind=MASK - giswlx100, task 0 0 [16047]: mask 0x7 set 1 2 3 4 5 6 cpu-bind=MASK - giswlx100, task 0 0 [16125]: mask 0x4 set cpu-bind=MASK - giswlx100, task 0 0 [16138]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [16144]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [16154]: mask 0x2 set cpu-bind=MASK - giswlx100, task 0 0 [16155]: mask 0x2 set cpu-bind=MASK - giswlx101, task 0 0 [26888]: mask 0x1 set So am I still missing something in the configuration here ? thanks to advise. I agree that if cpus-per-tasks is specified it would make sense to imply "exact" or at least issue a warning in the log file that the option is ignored.
(In reply to Patrick from comment #26) > Hello - thank you for the explanation; > I just did a quick test but I believe I still see some issue when going > across nodes. The following script will request two desktops with 3 CPU's > configured each: > > #!/bin/bash > #SBATCH --ntasks 6 > #SBATCH --partition desktop > > for i in {1..6}; do > echo $i > srun -N1 -n1 -c1 --exact sleep 150 & > done > > wait Actually, this job doesn't request two nodes with 3 CPUs each. This job only requests 6 tasks on partition "desktop". The tasks don't have to be distributed evenly on two nodes, and they don't even have to be on two nodes. If Slurm can fit the 6 tasks on one node, then it will try to do that since "block" is the default distribution for tasks across nodes with select/cons_res. You can read more about the distribution in the sbatch/srun/salloc man pages (search for the -m, --distribution option). To request exactly 3 tasks on 2 nodes, use sbatch --ntasks-per-node=3 -N2 So it looks like what actually happened is 5 tasks were on one node, and 1 task was on another node. > If I use "srun -N1 -n1 -c1 --exact --overlap sleep 150 &", then it is even > worse with SLURM starting more tasks (5) on giswlx100 than there are > configured CPU's: --overlap allows steps to share CPUs.
Marshall - the "desktops" are all configured with 3 CPU's: NodeName=giswlx100 Arch=x86_64 CoresPerSocket=3 CPUAlloc=0 CPUTot=3 CPULoad=2.01 This means that asking for 6 tasks requests 2 hosts with 3 CPU's each. The main issue here is that I'm asking for 6 tasks but can only start (srun) 4 at the same time : (SLURM output file) 1 2 3 4 5 6 cpu-bind=MASK - giswlx100, task 0 0 [11752]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [11764]: mask 0x4 set cpu-bind=MASK - giswlx100, task 0 0 [11767]: mask 0x2 set cpu-bind=MASK - giswlx101, task 0 0 [25074]: mask 0x1 set srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy) To illustrate this better here's a test on 3 nodes: #!/bin/bash #SBATCH --ntasks 9 #SBATCH --ntasks-per-node 3 #SBATCH --partition desktop for i in {1..9}; do echo $i srun -N1 -n1 -c1 --exact sleep 30 & done wait SLURM will use 3 CPU's / tasks on the first assigned host (giswlx100) but only a single task on each of the 2 other hosts: cpu-bind=MASK - giswlx100, task 0 0 [32493]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [32498]: mask 0x4 set cpu-bind=MASK - giswlx100, task 0 0 [32504]: mask 0x2 set cpu-bind=MASK - giswlx101, task 0 0 [13746]: mask 0x1 set cpu-bind=MASK - giswlx102, task 0 0 [27852]: mask 0x1 set srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 cpu-bind=MASK - giswlx100, task 0 0 [321]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [338]: mask 0x2 set cpu-bind=MASK - giswlx102, task 0 0 [28906]: mask 0x1 set cpu-bind=MASK - giswlx101, task 0 0 [14921]: mask 0x1 set
Patrick, Thanks for the clarification. I can reproduce what you're seeing in your latest comment, and I agree it seems like a bug. But it's different from the original issue you reported about --cpus-per-task, so I've created bug 11357 to handle this and added you to CC. Let's continue our conversation over there, and we'll leave this bug to handle --cpus-per-task implying --exact.
Patrick, We've pushed a fix for --cpus-per-task (and --threads-per-core, which had the same issue) to imply --exact. We pushed this to the master branch since it is a change in behavior, and documented the behavior change in NEWS, RELEASE_NOTES, and the srun man page. So this will be in 21.08 when it is released (this August). If you want this for 20.11, you should be able to cherry pick the patch (just src/srun/libsrun/launch.c, since the changes to other files are only documentation changes). Closing this as resolved/fixed in 21.08.
I realized I never made the commit hash public. Here's the commit with the change: https://github.com/SchedMD/slurm/commit/e01e884f3c294 (Sorry about the extra email.)
Hi Patrick, I just wanted to give you some updates on this: It turned out that due to a bug, -c/--cpus-per-task and --threads-per-core did NOT imply --exact in 21.08.0 through 21.08.4. In 21.08.5, we fixed that bug so that they imply --exact properly. However, we discovered that this broke MPI programs pretty badly, since mpirun can't work like it needs to when --exact is specified. So in 21.08.6+, we reverted -c/--cpus-per-task and --threads-per-core implying --exact. However, in 22.05, we are going to make -c/--cpus-per-task imply --exact again, but we are also going to change it so that srun does NOT inherit any -c specified by salloc/sbatch. This will give us the best of both worlds - it won't break MPI programs, and it also will fix the issue you highlighted in this ticket (where `srun -cX ...` gives you the job's whole allocation when it really doesn't make sense for it to). What this means for you is that in 21.08.6+, make sure users specify --exact when doing `srun -c ...` to get only the CPUs you expect. In 22.05, you can get rid of the extra --exact. See bug 13351 comment 76 for more details. Thanks! -Michael
thank you for the update Michal, we'll keep that in mind when we will update to the latest SLURM release.