Ticket 14608

Summary: Idle nodes, pending jobs, start time constantly pushed
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: SchedulingAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld logs with BackfillMap,Backfill
slurmctld debug3 logs with +Accrue,BackfillMap,Backfill,Priority

Description Kilian Cavalotti 2022-07-25 16:05:51 MDT
Created attachment 26017 [details]
slurmctld logs with BackfillMap,Backfill

Hi SchedMD,

Here's yet another one of those "nodes are idle, jobs are pending" kind of question :)

We have a situation where, at times (esp. when some nodes are drained, but not clear if that's a determinant factor or not), jobs stay pending with their expected start time basically set to now() for extended periods of time, while nodes are idle. it seems like the backfill scheduler is evaluating them, but not starting them.

Here's an example, where you can see the start time being constantly pushed, and set to now()+1s:

-- 8< -------------------------------------------------------------
$ for i in {1..5}; do sleep 10; date; squeue -j 58574370 -O jobid,state,starttime,schednodes,reason | column -t; done
Mon Jul 25 14:55:37 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58574370  PENDING  2022-07-25T14:55:37  (null)      Priority
Mon Jul 25 14:55:47 PDT 2022
JOBID     STATE    START_TIME  SCHEDNODES  REASON
58574370  PENDING  N/A         (null)      Priority
Mon Jul 25 14:55:58 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58574370  PENDING  2022-07-25T14:55:59  (null)      Priority
Mon Jul 25 14:56:09 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58574370  PENDING  2022-07-25T14:56:09  (null)      Priority
Mon Jul 25 14:56:19 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58574370  PENDING  2022-07-25T14:56:19  (null)      Priority
-- 8< -------------------------------------------------------------


Details about the job:
-- 8< -------------------------------------------------------------
# scontrol -dd show job 58574370
JobId=58574370 JobName=runstickleback
   UserId=maxczap(332832) GroupId=jergold(250244) MCS_label=N/A
   Priority=244505 Nice=0 Account=jergold QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A
   SubmitTime=2022-07-25T13:37:10 EligibleTime=2022-07-25T13:37:10
   AccrueTime=2022-07-25T13:37:31
   StartTime=2022-07-25T14:56:01 EndTime=2022-07-25T16:56:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-07-25T14:55:56 Scheduler=Backfill:*
   Partition=normal AllocNode:Sid=sh02-01n57:350
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:*
   TRES=cpu=3,mem=12G,node=1,billing=6
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=3 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/users/maxczap/repos/sticklebackms/analysis/cluster/runstickleback.sh
   WorkDir=/home/users/maxczap/repos/sticklebackms
   StdErr=/home/users/maxczap/repos/sticklebackms/slurm-58574370.out
   StdIn=/dev/null
   StdOut=/home/users/maxczap/repos/sticklebackms/slurm-58574370.out
   Power=
   MailUser=maxczap MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
-- 8< -------------------------------------------------------------

State of the partition:
-- 8< -------------------------------------------------------------
$ sinfo -p normal
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 7-00:00:00      3  drng@ sh02-01n[48-49,52]
normal*      up 7-00:00:00      1  drain sh02-01n16
normal*      up 7-00:00:00     87    mix sh02-01n[01,03,05-15,17-20,22,24-31,33-34,36-43,45-47,50-51],sh02-08n[37-41,50,52,54-56],sh02-10n[41-42,44,47-48],sh03-01n[01-02,25,27-30,39-42,44-49,51-52,54-58,60,62-63,66-68,70]
normal*      up 7-00:00:00     34  alloc sh02-01n[02,04,21,23,32,35,44,53-56],sh02-08n[42-43,45-49,51],sh02-10n[43,45-46],sh03-01n[07,16,18,26,43,50,53,59,61,64-65,69]
normal*      up 7-00:00:00     29   idle sh02-08n[44,53],sh03-01n[03-06,08-15,17,19-24,31-38]
-- 8< -------------------------------------------------------------

State of the queue:
-- 8< -------------------------------------------------------------
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          58572860    normal a1658779 suhasrao PD       0:00      1 (Dependency)
          58581793    normal     cron nrapstin PD       0:00      1 (BeginTime)
          58556204    normal cron-dai      rgm PD       0:00      1 (BeginTime)
          58557588    normal Backup / drinnenb PD       0:00      1 (BeginTime)
          58088703    normal     cron  slmeier PD       0:00      1 (BeginTime)
          58575044    normal dmm1m4_s shenglun PD       0:00      1 (MaxCpuPerAccount)
  37073695_[3-100]    normal    t25vq  wenwang PD       0:00      1 (MaxCpuPerAccount)
  36835732_[9-100]    normal    t25vq  wenwang PD       0:00      1 (MaxCpuPerAccount)
 37073694_[11-100]    normal    t25vq  wenwang PD       0:00      1 (MaxCpuPerAccount)
 36761483_[12-100]    normal    t25vq  wenwang PD       0:00      1 (MaxCpuPerAccount)
36401554_[20,23-24    normal    t25vq  wenwang PD       0:00      1 (MaxCpuPerAccount)
36835731_[6-9,15-1    normal    t25vq  wenwang PD       0:00      1 (MaxCpuPerAccount)
          58574370    normal runstick  maxczap PD       0:00      1 (Priority)
          58573380    normal sys/dash   earlew PD       0:00      1 (Priority)
          58572736    normal norm_coi   mhchin PD       0:00      1 (Priority)
          58572737    normal norm_coi   mhchin PD       0:00      1 (Priority)
          58572738    normal norm_coi   mhchin PD       0:00      1 (Priority)
          58572739    normal norm_coi   mhchin PD       0:00      1 (Priority)
          58572740    normal norm_coi   mhchin PD       0:00      1 (Priority)
[...]
-- 8< -------------------------------------------------------------


So that job (58574370) is pretty much the first runnable job in line (after the Dependency, BeginTime and MaxCpuPerAccount ones). I'd be expecting it to be backfilled pretty quickly, but instead, it stayed pending for hours (as well as all the other ones below in queue).

The backfill scheduler seems to evaluate it during each cycle, but doesn't seem to be starting it:
-- 8< -------------------------------------------------------------
Jul 25 14:51:28 sh03-sl01.int slurmctld[28748]: sched/backfill: _dump_job_test: Test JobId=58574370 at 2022-07-25T14:51:28 on sh02-01n[01-15,17-56],sh02-08n[37-56],sh02-10n[41-48],sh03-01n[01-70]
-- 8< -------------------------------------------------------------


Here are our current scheduler parameters:
-- 8< -------------------------------------------------------------
SchedulerParameters=\
default_queue_depth=1000,\
partition_job_depth=100,\
sched_min_interval=5000000,\
build_queue_timeout=6000000,\
max_sched_time=10,\
bf_continue,\
bf_interval=60,\
bf_max_time=800,\
bf_yield_interval=1000000,\
bf_yield_sleep=1000000,\
bf_window=10080,\
bf_resolution=1200,\
bf_max_job_test=25000,\
bf_max_job_part=5000,\
bf_max_job_user=500,\
bf_max_job_user_part=100,\
bf_max_job_array_resv=2,\
bf_job_part_count_reserve=10,\
bf_min_prio_reserve=100000,\
bf_min_age_reserve=21600,\
nohold_on_prolog_fail,\
bf_busy_nodes,\
pack_serial_at_end,\
preempt_youngest_first,\
batch_sched_delay=10,\
max_array_tasks=1000,\
jobs_per_user_per_hour=1000,\
max_switch_wait=28800,\
max_rpc_cnt=128
-- 8< -------------------------------------------------------------


I'm attaching the full slurmctld log, with setdebugflags +BackfillMap,Backfill, and I'm happy to provide additional logs.

If you have any idea as to why this is happening, that would be very much appreciated!

Thanks,
--
Kilian
Comment 1 Kilian Cavalotti 2022-07-25 16:12:16 MDT
Just a quick comment to add that there's no reservation in the system, and no preemption set up for the `normal` partition.

Cheers,
--
Kilian
Comment 2 Kilian Cavalotti 2022-07-25 16:26:07 MDT
And one more to add that the controller logs this in debug2:

sched/backfill: _attempt_backfill: BACKFILL: JobId=58574370 has a prio_reserve of 100000
sched/backfill: _attempt_backfill: BACKFILL: test for JobId=58574370 Prio=240290 Partition=normal
debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=58574370.
debug2: select/cons_tres: select_p_job_test: evaluating JobId=58574370
sched/backfill: _attempt_backfill: BACKFILL: JobId=58574370 StartTime set but no backfill reservation created.

The job priority (240290) seems higher than prio_reserve (100000), so it should create a backfill reservation, right?

Or even better, start the job right away, since resources are available in the partition.

Cheers,
--
Kilian
Comment 3 Carlos Tripiana Montes 2022-07-26 00:06:30 MDT
Hi Killian,

It seems like the attached log file doesn't include those debug details. I can't spot any messages like: "has a prio_reserve of" in it.

If it's possible, use debug3 and DebugFlags to +Accrue,BackfillMap,Backfill,Priority. But only the time needed to reproduce the issue and collect the logs again. Please revert back to production log level right after.

I'll try to give you the answer after looking at the logs.

Thanks,
Carlos.
Comment 4 Kilian Cavalotti 2022-07-26 09:18:25 MDT
Created attachment 26033 [details]
slurmctld debug3 logs with +Accrue,BackfillMap,Backfill,Priority

Hi Carlos, 

(In reply to Carlos Tripiana Montes from comment #3)
> It seems like the attached log file doesn't include those debug details. I
> can't spot any messages like: "has a prio_reserve of" in it.

Yes, that was the default logging level, I just enabled debug2 later on. But I didn't spot anything unusual with the default logging level either.

> If it's possible, use debug3 and DebugFlags to
> +Accrue,BackfillMap,Backfill,Priority. But only the time needed to reproduce
> the issue and collect the logs again. Please revert back to production log
> level right after.

All right, please find slurmctld logs attached, on a 10min period at debug3 with +Accrue,BackfillMap,Backfill,Priority enabled.

You can take a look at job 58623543, for instance:
-- 8< --------------------------------------------------------------
$ for i in {1..5}; do sleep 10; date; squeue -j 58623543 -O jobid,state,starttime,schednodes,reason | column -t; done
Tue Jul 26 07:44:36 PDT 2022
JOBID     STATE    START_TIME  SCHEDNODES  REASON
58623543  PENDING  N/A         (null)      Priority
Tue Jul 26 07:44:46 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58623543  PENDING  2022-07-26T07:44:46  (null)      Priority
Tue Jul 26 07:44:56 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58623543  PENDING  2022-07-26T07:44:59  (null)      Priority
Tue Jul 26 07:45:09 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58623543  PENDING  2022-07-26T07:45:09  (null)      Priority
Tue Jul 26 07:45:19 PDT 2022
JOBID     STATE    START_TIME           SCHEDNODES  REASON
58623543  PENDING  2022-07-26T07:45:19  (null)      Priority
-- 8< --------------------------------------------------------------

It's one of the top jobs in line, and there are 20 idle nodes in the partition:
-- 8< --------------------------------------------------------------
# sinfo -p normal
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 7-00:00:00      3  drng@ sh02-01n[48-49,52]
normal*      up 7-00:00:00      1  drain sh02-01n16
normal*      up 7-00:00:00     89    mix sh02-01n[03,05,09-15,17-20,24-35,38-43,45-47,50-51],sh02-08n[37,39-40,42-46,53,55],sh02-10n[42,44,46-48],sh03-01n[03,09,11,13-14,16,21-25,28,39,41-47,49,51-53,56-58,60-70]
normal*      up 7-00:00:00     41  alloc sh02-01n[01-02,04,06-08,21-23,36-37,44,53-56],sh02-08n[38,41,47,49-52],sh02-10n[41,43,45],sh03-01n[01-02,04-07,10,15,17-18,20,26,31,35,50]
normal*      up 7-00:00:00     20   idle sh02-08n[48,54,56],sh03-01n[08,12,19,27,29-30,32-34,36-38,40,48,54-55,59]
-- 8< --------------------------------------------------------------


Here's what I see in the logs for that job:
-- 8< --------------------------------------------------------------
sched/backfill: _attempt_backfill: BACKFILL: JobId=58623543 has a prio_reserve of 100000
sched/backfill: _attempt_backfill: BACKFILL: test for JobId=58623543 Prio=144462 Partition=normal
debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=58623543.
sched/backfill: _dump_job_test: Test JobId=58623543 at 2022-07-26T08:00:05 on sh02-01n[01-15,17-56],sh02-08n[37-56],sh02-10n[41-48],sh03-01n[01-34,36-70]
debug2: select/cons_tres: select_p_job_test: evaluating JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n03, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n05, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n09, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n10, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n12, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n13, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n17, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n18, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n19, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n20, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n29, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n30, free_mem < min_mem (0 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n33, free_mem < min_mem (5115 < 6400) for JobId=58623543
debug3: select/cons_tres: _verify_node_state: Not considering node sh02-01n44, free_mem < min_mem (5120 < 6400) for JobId=58623543
sched/backfill: _attempt_backfill: BACKFILL: JobId=58623543 StartTime set but no backfill reservation created.
-- 8< --------------------------------------------------------------

It looks like the scheduler is only testing the first few nodes in the list (sh02-01n[01-15,17-56],sh02-08n[37-56],sh02-10n[41-48],sh03-01n[01-34,36-70]), then stops after sh02-01n44, and misses out the last nodes in the list which are idle?

> I'll try to give you the answer after looking at the logs.

Thank you!

Cheers,
--
Kilian
Comment 5 Kilian Cavalotti 2022-07-27 17:47:36 MDT
Hi!

I wanted to add that we're now also seeing this impacting preemption between partitions: preemptable jobs are not preempted, and jobs stay pending with no apparent reason.

Here's an example, with 2 overlapping partitions (engreitz, and the preemptable owners partition)

All the nodes in the partition are occupied:
-- 8< -------------------------------------------------
# sinfo -p engreitz
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
engreitz     up 7-00:00:00      2    mix sh03-04n[21,24]
engreitz     up 7-00:00:00      2  alloc sh03-04n[22-23]
-- 8< -------------------------------------------------

One job is running, and one job is pending:
-- 8< -------------------------------------------------
# squeue -p engreitz
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          58726681  engreitz     sdev  rosaxma PD       0:00      1 (Resources)
          58695937  engreitz R-jupyte    kmguo  R    6:02:21      1 sh03-04n21
-- 8< -------------------------------------------------

The other nodes are allocated with preemptable jobs:
-- 8< -------------------------------------------------
# squeue -w sh03-04n[21-24]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          58695937  engreitz R-jupyte    kmguo  R    6:04:14      1 sh03-04n21
          58678002    owners run_subm   johnjp  R    7:21:46      1 sh03-04n22
       58716303_30    owners  MR_prot  yleguen  R    2:48:57      1 sh03-04n23
       58716303_31    owners  MR_prot  yleguen  R    2:48:57      1 sh03-04n23
          58699167    owners owners_p   zhouy7  R    5:08:10      1 sh03-04n23
          58699169    owners owners_p   zhouy7  R    5:08:10      1 sh03-04n23
          58699171    owners owners_p   zhouy7  R    5:08:10      1 sh03-04n23
          58699173    owners owners_p   zhouy7  R    5:08:10      1 sh03-04n23
          58699174    owners owners_p   zhouy7  R    5:08:10      1 sh03-04n24
          58184287    owners dlbo-80-   tatang  R    4:15:01      1 sh03-04n21
       58581982_28    owners 8x8_tp-0 xunger08  R      30:24      3 sh03-03n[60,65],sh03-04n21
       57934472_19    owners 8x8-tp-0 xunger08  R    2:18:57      3 sh03-03n30,sh03-04n23,sh03-05n72
       58496656_48    owners 8x8-tp-0 xunger08  R    2:38:26      3 sh03-03n45,sh03-04n23,sh03-07n09
        58390824_7    owners 8x8_tp-0 xunger08  R    3:22:39      2 sh03-04n[21,23]
        58467266_6    owners 8x8-tp-0  wenwang  R    1:41:03      2 sh03-04n[23-24]
       58467265_28    owners 8x8-tp-0  wenwang  R    1:41:03      2 sh03-04n[03,21]
        58467267_4    owners 8x8-tp-0  wenwang  R    1:41:03      1 sh03-04n24
       58467267_13    owners 8x8-tp-0  wenwang  R    1:41:03      1 sh03-04n24
       58467267_32    owners 8x8-tp-0  wenwang  R    1:41:03      1 sh03-04n24
       58466440_27    owners 8x8-tp-0  wenwang  R    3:22:35      2 sh03-04n[24,36]
-- 8< -------------------------------------------------


Yet those jobs are not preempted, as none seem to have enough resources for the pending job (58726681). Although, obviously, they would if those "owners" jobs were preempted.

Here are the slurmctld logs in debug3:
-- 8< -------------------------------------------------
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-01n[01-60,65-72],sh02-02n[01-03,10-12,17-27,33-72],sh02-03n[01-24],sh02-04n[17-24,37-46,49-70],sh02-05n[01-12,15-16,23-24,71-72]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-03n[25-72],sh02-06n[01-36,53,55-72],sh02-07n[01-20,23-26,29-56,59-72],sh02-08n[01-16,21-56,62-72],sh02-09n[01-12,25-44,49-52,57-72],sh02-10n[01-04,09-10,17-60],sh02-16n[12-17],sh02-17n[01-44]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-01n[61-64],sh02-02n[04-08,13-16,28-32],sh02-04n[01-16,25-36,47-48],sh02-05n[13-14,17-22],sh02-05n[29-64,69-70]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-07n[01-16]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-06n[37-52],sh02-07n[21-22,27-28,57-58],sh02-08n[17-20,57-61],sh02-09n[13-24,45-48,53-56],sh02-10n[05-08,11-16,61-72],sh02-16n[18-19],sh02-17n[45-48]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-02n09,sh02-04n[71-72],sh02-05n[65-68]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-06n54) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n02,sh02-13n09) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-16n07,sh02-16n11) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-11n[17-21]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-12n[18-24],sh03-13n[18-24],sh03-14n[17-24],sh03-15n[17-24],sh03-16n[17-24]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n03) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n01) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-12n17,sh03-13n17) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-15n12,sh02-16n04,sh02-16n10) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n12,sh02-13n[06-07],sh02-14n[07-08],sh02-13n15,sh02-15n[07-08]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-16n[05-06,08-09]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-15n09) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n04,sh02-14n[05-06]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n05,sh02-13n08) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-15n[05-06]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-13n[12-14],sh02-14n11,sh02-15n[01-02]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-15n[10-11,13],sh02-16n02) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-12n[01-16],sh03-13n[03-09,12-15],sh03-14n[01-02]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-15n[01-15]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-13n[01-02,10-11]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-13n16,sh03-16n[01,03-04]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n[06-07]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n[08-11],sh02-13n[10-11],sh02-14n[09,13-15],sh02-15n04) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-11n01,sh03-16n08) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-15n16,sh03-16n[02,06-07]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-14n[10,12]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-12n[13-17],sh02-13n[01-04]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-15n03) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-13n05,sh02-16n03) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh02-16n01) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: _build_node_list: JobId=58726681 matched 0 nodes (sh03-14n[03,05,07,09,11,13,15]) due to job partition or features
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: _pick_best_nodes: JobId=58726681 idle_nodes 17 share_nodes 1397
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug2: select/cons_tres: select_p_job_test: evaluating JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n21, free_mem < min_mem (18624 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n22, free_mem < min_mem (0 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n23, free_mem < min_mem (11008 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n22, free_mem < min_mem (0 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n23, free_mem < min_mem (11008 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n22, free_mem < min_mem (0 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n23, free_mem < min_mem (27008 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n21, free_mem < min_mem (18624 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n22, free_mem < min_mem (0 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n23, free_mem < min_mem (27008 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n21, free_mem < min_mem (18624 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n22, free_mem < min_mem (0 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: select/cons_tres: _verify_node_state: Not considering node sh03-04n23, free_mem < min_mem (27008 < 32768) for JobId=58726681
Jul 27 16:38:57 sh03-sl01.int slurmctld[28748]: debug3: sched: JobId=58726681. State=PENDING. Reason=Resources. Priority=784934. Partition=engreitz.
-- 8< -------------------------------------------------


I don't really want to raise this to Sev2, but this starts to have a serious impact on utilization of our resources :\

Any help would be much appreciated!

Thanks,
--
Kilian
Comment 6 Carlos Tripiana Montes 2022-07-28 03:41:58 MDT
I think it might be: "bf_min_age_reserve=21600", which is 6 hours, the parameter that it's firing the: "StartTime set but no backfill reservation created".

See [1]. And see how it works in [2] (L1955-L1970) and [3] (L2688-L2699). Its default value is 0. I think you can test if setting this to default clear your issue and decide after which is the smallest value for it that fits yours needs.

Regards,
Carlos.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_bf_min_age_reserve=#
[2] https://github.com/SchedMD/slurm/blob/d9857162bc5d4a4412c29efc5c2fd5bc27d23bb5/src/plugins/sched/backfill/backfill.c#L1955
[3] https://github.com/SchedMD/slurm/blob/d9857162bc5d4a4412c29efc5c2fd5bc27d23bb5/src/plugins/sched/backfill/backfill.c#L2688
Comment 7 Kilian Cavalotti 2022-07-28 11:07:33 MDT
Hi Carlos, 

(In reply to Carlos Tripiana Montes from comment #6)
> I think it might be: "bf_min_age_reserve=21600", which is 6 hours, the
> parameter that it's firing the: "StartTime set but no backfill reservation
> created".

Thanks for the pointer. I'll give it a shot, and set it back to 0, but I have a couple questions/remarks about this:

1. bf_min_age_reserve wouldn't prevent the main scheduler to preempt jobs, like 58726681 in comment #5, right?

2. some of those constantly deferred jobs have eventually started before they reached the 6 hours mark in queue. The job I took as an example in comment #4, 58623543, has started after staying pending for 4h30:

# sacct -Xj 58623543 -o submit,start,reserved
             Submit               Start   Reserved
------------------- ------------------- ----------
2022-07-26T03:56:34 2022-07-26T08:28:18   04:31:44

3. some of those constantly deferred jobs also stay pending for more than 6 hours (of course, I can't find any example right now, but that's definitely something I noted before...)


Cheers,
--
Kilian
Comment 8 Carlos Tripiana Montes 2022-07-29 01:12:57 MDT
> 1. bf_min_age_reserve wouldn't prevent the main scheduler to preempt jobs, like 58726681 in comment #5, right?

I'm not sure if I'm understanding your concern here. For a job to get preempted it needs to be already running. bf_min_age_reserve only plays a role in the main scheduler in _top_priority function, just skipping the job if "pend_time < bf_min_age_reserve". So _top_priority never will return a job that is pending for less time than bf_min_age_reserve. This can lead to the main scheduler to skip this job even if it is the highest priority in queue.

As side effect, it might skip a job that, if it was started, it will have preempted another job. That's not literally saying that bf_min_age_reserve has an effect on preemption, but for this specific scenario there're would be a relationship between them.


> 2. some of those constantly deferred jobs have eventually started before they reached the 6 hours mark in queue.

Let me check how reservation is calculated/stored because right now it doesn't match with bf_min_age_reserve. The only reason I can think of now is in another logic executed before bf_min_age_reserve gets into the game.

If you have time, increase debug with the same flags again and collect the whole life of such a job. Revert the log level and filter out the job lifespan in logs with a grep or something. Then, send is this filtered logs so we can see why the job got running before bf_min_age_reserve. This would speed finding the reason.


> 3. some of those constantly deferred jobs also stay pending for more than 6 hours (of course, I can't find any example right now, but that's definitely something I noted before...)

Well, this seems more reasonable than 2. to me right now. Being skipped from getting a reservation by bf_min_age_reserve doesn't ensure that having such reservation after 6 hours would make that job to run just in 6 hours time. You also need to consider that bf_busy_nodes will make backfiling less efficient and bf_job_part_count_reserve=10 will reserve resources for top 10 priority jobs. So, all in all, there are less luckily jobs that would need more than 6h to run.

Cheers,
Carlos.
Comment 9 Kilian Cavalotti 2022-07-29 09:29:02 MDT
Hi Carlos,

(In reply to Carlos Tripiana Montes from comment #8)
> > 1. bf_min_age_reserve wouldn't prevent the main scheduler to preempt jobs, like 58726681 in comment #5, right?
> 
> I'm not sure if I'm understanding your concern here. 

What I meant is that in comment #5, job 58726681 is pending, has no reserved resources (SchedNodes is null), and its StartTime is always set to now()+1s. And this job should be able to preempt all of the jobs running in the `owners` partition on sh03-04n[21-24] (58678002, 58716303_30, etc). But it doesn't, and its start time is constantly pushed back. 

This is the behavior we're concerned about (jobs staying pending despite idle nodes or preemptable jobs), and I'm not sure the backfill scheduler is even involved in the latter case, hence my question about bf_min_age_reserve's relevance.

> > 2. some of those constantly deferred jobs have eventually started before they reached the 6 hours mark in queue.
> 
> Let me check how reservation is calculated/stored because right now it
> doesn't match with bf_min_age_reserve. The only reason I can think of now is
> in another logic executed before bf_min_age_reserve gets into the game.
> 
> If you have time, increase debug with the same flags again and collect the
> whole life of such a job. Revert the log level and filter out the job
> lifespan in logs with a grep or something. Then, send is this filtered logs
> so we can see why the job got running before bf_min_age_reserve. This would
> speed finding the reason.

Ok. I'm trying to find other examples now, but there doesn't seem to be any right now.

Which brings me to another aspect of this behavior: it's not constant, and is definitely not the norm. Things usually work great, but we've seen the problem happening several times in the past, although never for long enough or with enough of an impact to be worth reporting. But that last occurrence was more visible.

I believe this happens mostly when some (but not all) nodes have been drained. We usually drain series of nodes to deploy OS updates in a rolling fashion, and when we do that, we can see cases where jobs stay pending while there are idle nodes (not drained!) in the partition. This seems to particularly be the case when nodes come back from being drained: they seem to stay idle and the scheduler doesn't select them to run jobs.

Now, of course, we've reached the end of our draining process, all our nodes have been updated, and I can't find any more case of this happening.

I know that's not much to try to debug and identify the root cause, and I don't have hard evidence (logs) to provide, but this has been an ongoing thing. But I will try to find a way to reproduce the problem more specifically.

> > 3. some of those constantly deferred jobs also stay pending for more than 6 hours (of course, I can't find any example right now, but that's definitely something I noted before...)
> 
> Well, this seems more reasonable than 2. to me right now. Being skipped from
> getting a reservation by bf_min_age_reserve doesn't ensure that having such
> reservation after 6 hours would make that job to run just in 6 hours time.
> You also need to consider that bf_busy_nodes will make backfiling less
> efficient and bf_job_part_count_reserve=10 will reserve resources for top 10
> priority jobs. So, all in all, there are less luckily jobs that would need
> more than 6h to run.

Yes, that makes total sense and there are definitely other parameters that could further delay the start of a job.

Thanks!
--
Kilian
Comment 10 Carlos Tripiana Montes 2022-07-31 04:54:53 MDT
Hi Killian,

> What I meant is that in comment #5, job 58726681 is pending, has no reserved
> resources (SchedNodes is null), and its StartTime is always set to now()+1s.
> And this job should be able to preempt all of the jobs running in the
> `owners` partition on sh03-04n[21-24] (58678002, 58716303_30, etc). But it
> doesn't, and its start time is constantly pushed back. 
> 
> This is the behavior we're concerned about (jobs staying pending despite
> idle nodes or preemptable jobs), and I'm not sure the backfill scheduler is
> even involved in the latter case, hence my question about
> bf_min_age_reserve's relevance.

As noted in my Comment 11, bf_min_age_reserve plays a role in the main scheduler skipping the job if "pend_time < bf_min_age_reserve", not only in the backfill scheduler. This job won't be eligible to run until it will have been pending for such time. This is enforced, no matter how much free nodes or preemptable jobs will be at this point in time.


> I know that's not much to try to debug and identify the root cause, and I
> don't have hard evidence (logs) to provide, but this has been an ongoing
> thing. But I will try to find a way to reproduce the problem more
> specifically.

If you find one of those with the right debug level/flags enabled, please provide us with the logs so we can investigate why.

Thanks.
Comment 11 Kilian Cavalotti 2022-08-01 12:27:14 MDT
(In reply to Carlos Tripiana Montes from comment #10)
> As noted in my Comment 11, bf_min_age_reserve plays a role in the main
> scheduler skipping the job if "pend_time < bf_min_age_reserve", not only in
> the backfill scheduler. This job won't be eligible to run until it will have
> been pending for such time. This is enforced, no matter how much free nodes
> or preemptable jobs will be at this point in time.

Mmh, then I'm confused, because what we observe in practice doesn't really match that description. It would mean that *no* job can start before it has spent at least bf_min_age_reserve pending in queue, and that's clearly not the case:

$ scontrol show config | grep -Po 'bf_min_age_reserve=([0-9]+)'
bf_min_age_reserve=3600

$ time srun -p normal -t 0:2:0 echo ok
srun: job 59318458 queued and waiting for resources
srun: job 59318458 has been allocated resources
ok

real    0m32.106s
user    0m0.030s
sys     0m0.010s

And this definitely involves the backfill scheduler: 
-- 8< ------------------------------------------------------------
2022-08-01T11:18:35-07:00 sh03-sl01 slurmctld[855]: sched: _slurm_rpc_allocate_resources JobId=59318458 NodeList=(null) usec=33060
2022-08-01T11:19:05-07:00 sh03-sl01 slurmctld[855]: sched/backfill: _start_job: Started JobId=59318458 in normal on sh02-01n56
2022-08-01T11:19:06-07:00 sh03-sl01 slurmctld: slurmctld: prolog_running_decr: Configuration for JobId=59318458 is complete
2022-08-01T11:19:06-07:00 sh03-sl01 slurmctld[855]: prolog_running_decr: Configuration for JobId=59318458 is complete
-- 8< ------------------------------------------------------------


And with the main scheduler:
-- 8< ------------------------------------------------------------
$ time srun -p normal echo ok
srun: job 59318669 queued and waiting for resources
srun: job 59318669 has been allocated resources
ok

real    0m13.166s
user    0m0.027s
sys     0m0.014s
-- 8< ------------------------------------------------------------

logs (to show the job has not been backfilled):
-- 8< ------------------------------------------------------------
2022-08-01T11:22:51-07:00 sh03-sl01 slurmctld[855]: sched: _slurm_rpc_allocate_resources JobId=59318669 NodeList=(null) usec=26125
2022-08-01T11:23:01-07:00 sh03-sl01 slurmctld[855]: sched: Allocate JobId=59318669 NodeList=sh02-01n56 #CPUs=1 Partition=normal
2022-08-01T11:23:02-07:00 sh03-sl01 slurmctld: slurmctld: prolog_running_decr: Configuration for JobId=59318669 is complete
2022-08-01T11:23:02-07:00 sh03-sl01 slurmctld[855]: prolog_running_decr: Configuration for JobId=59318669 is complete
-- 8< ------------------------------------------------------------

So I have a hard time reconciliating those observations with your statement that "bf_min_age_reserve plays a role in the main scheduler skipping the job if "pend_time < bf_min_age_reserve", not only in the backfill scheduler. This job won't be eligible to run until it will have been pending for such time". We definitely see jobs being eligible and starting without having spent bf_min_age_reserve waiting in queue.

Cheers,
--
Kilian
Comment 12 Carlos Tripiana Montes 2022-08-01 23:46:50 MDT
Yes, that seems like Comment 7, 8 dot 2. To properly answer you I think I need the debug (with flags as requested the last time) logs with a couple of reproducer jobs like that.

I'm looking at source, and the if "pend_time < bf_min_age_reserve" is a real thing in the places I told you is affecting, and clearly the effect is aligned with the stated in our docs, but there's quite a lot of code before reaching this point an I think there's other paths the code can choose before reaching this condition.

So to know the real code path and give you a detailed answer I need the logs to see how your other Slurm config params are affecting.

Cheers,
Carlos.
Comment 13 Carlos Tripiana Montes 2022-08-11 01:27:16 MDT
Killian,

We need to close the issue as timed out. Just reopen it and provide the requested information to continue our investigation whenever possible.

Thanks.