Ticket 11357

Summary: slurmctld doesn't start all possible steps when tasks are spread across nodes
Product: Slurm Reporter: Marshall Garey <marshall>
Component: slurmctldAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: lyeager, phock, tripiana
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=11687
Site: Goodyear Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 20.11.7 21.08.0pre1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Marshall Garey 2021-04-13 10:18:31 MDT
(Filing as site = Goodyear since they reported this.)

Coming from bug 11275 comment 28, which is a clear reproducer (I can reproduce this as well):


Marshall - the "desktops" are all configured with 3 CPU's:
NodeName=giswlx100 Arch=x86_64 CoresPerSocket=3
   CPUAlloc=0 CPUTot=3 CPULoad=2.01

This means that asking for 6 tasks requests 2 hosts with 3 CPU's each.
The main issue here is that I'm asking for 6 tasks but can only start (srun) 4 at the same time :
(SLURM output file)
1
2
3
4
5
6
cpu-bind=MASK - giswlx100, task  0  0 [11752]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [11764]: mask 0x4 set
cpu-bind=MASK - giswlx100, task  0  0 [11767]: mask 0x2 set
cpu-bind=MASK - giswlx101, task  0  0 [25074]: mask 0x1 set
srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1380409 step creation temporarily disabled, retrying (Requested nodes are busy)

To illustrate this better here's a test on 3 nodes:

#!/bin/bash
#SBATCH --ntasks 9
#SBATCH --ntasks-per-node 3
#SBATCH --partition desktop

for i in {1..9}; do
echo $i
srun -N1 -n1 -c1 --exact sleep 30 &
done

wait

SLURM will use 3 CPU's / tasks on the first assigned host (giswlx100) but only a single task on each of the 2 other hosts:

cpu-bind=MASK - giswlx100, task  0  0 [32493]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [32498]: mask 0x4 set
cpu-bind=MASK - giswlx100, task  0  0 [32504]: mask 0x2 set
cpu-bind=MASK - giswlx101, task  0  0 [13746]: mask 0x1 set
cpu-bind=MASK - giswlx102, task  0  0 [27852]: mask 0x1 set
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1388346 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
srun: Job 1388346 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 1388346
cpu-bind=MASK - giswlx100, task  0  0 [321]: mask 0x1 set
cpu-bind=MASK - giswlx100, task  0  0 [338]: mask 0x2 set
cpu-bind=MASK - giswlx102, task  0  0 [28906]: mask 0x1 set
cpu-bind=MASK - giswlx101, task  0  0 [14921]: mask 0x1 set



* We have 9 tasks (3 per node) available to run steps
* Step 0 is allocated to node 0
* Step 1 is allocated to node 1
* Step 2 is allocated to node 2
* Step 3 is allocated to node 0
* Step 4 is allocated to node 0
* Step 5 (the 6th step) doesn't run. slurmctld tries to allocated it to node 0 but all the CPUs are used. I don't know why slurmctld doesn't allocate the remaining steps on nodes 1 and 2, and also don't know why the distribution of tasks begins cyclic but ends block. I'm pretty sure I've worked on a bug dealing with this before, so I'll have to hunt it down.
Comment 17 Marshall Garey 2021-04-29 10:59:32 MDT
Patrick,

We've fixed this issue in commit 6a2c99edbf96e. It will be in Slurm 20.11.7 onward. Thanks for reporting this!

Closing this bug as resolved/fixed.