(Filing as site = Goodyear since they reported this.) Coming from bug 11275 comment 28, which is a clear reproducer (I can reproduce this as well): Marshall - the "desktops" are all configured with 3 CPU's: NodeName=giswlx100 Arch=x86_64 CoresPerSocket=3 CPUAlloc=0 CPUTot=3 CPULoad=2.01 This means that asking for 6 tasks requests 2 hosts with 3 CPU's each. The main issue here is that I'm asking for 6 tasks but can only start (srun) 4 at the same time : (SLURM output file) 1 2 3 4 5 6 cpu-bind=MASK - giswlx100, task 0 0 [11752]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [11764]: mask 0x4 set cpu-bind=MASK - giswlx100, task 0 0 [11767]: mask 0x2 set cpu-bind=MASK - giswlx101, task 0 0 [25074]: mask 0x1 set srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy) To illustrate this better here's a test on 3 nodes: #!/bin/bash #SBATCH --ntasks 9 #SBATCH --ntasks-per-node 3 #SBATCH --partition desktop for i in {1..9}; do echo $i srun -N1 -n1 -c1 --exact sleep 30 & done wait SLURM will use 3 CPU's / tasks on the first assigned host (giswlx100) but only a single task on each of the 2 other hosts: cpu-bind=MASK - giswlx100, task 0 0 [32493]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [32498]: mask 0x4 set cpu-bind=MASK - giswlx100, task 0 0 [32504]: mask 0x2 set cpu-bind=MASK - giswlx101, task 0 0 [13746]: mask 0x1 set cpu-bind=MASK - giswlx102, task 0 0 [27852]: mask 0x1 set srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 1388346 cpu-bind=MASK - giswlx100, task 0 0 [321]: mask 0x1 set cpu-bind=MASK - giswlx100, task 0 0 [338]: mask 0x2 set cpu-bind=MASK - giswlx102, task 0 0 [28906]: mask 0x1 set cpu-bind=MASK - giswlx101, task 0 0 [14921]: mask 0x1 set * We have 9 tasks (3 per node) available to run steps * Step 0 is allocated to node 0 * Step 1 is allocated to node 1 * Step 2 is allocated to node 2 * Step 3 is allocated to node 0 * Step 4 is allocated to node 0 * Step 5 (the 6th step) doesn't run. slurmctld tries to allocated it to node 0 but all the CPUs are used. I don't know why slurmctld doesn't allocate the remaining steps on nodes 1 and 2, and also don't know why the distribution of tasks begins cyclic but ends block. I'm pretty sure I've worked on a bug dealing with this before, so I'll have to hunt it down.
Patrick, We've fixed this issue in commit 6a2c99edbf96e. It will be in Slurm 20.11.7 onward. Thanks for reporting this! Closing this bug as resolved/fixed.