21627 – Low priority jobs not being scheduled even when resources available

Ticket 21627 - Low priority jobs not being scheduled even when resources available

Summary: Low priority jobs not being scheduled even when resources available

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	24.05.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Ben Glines
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-12-12 10:33 MST by Paul Edmon
Modified:	2025-01-02 09:46 MST (History)
CC List:	0 users

See Also:
Site:	Harvard University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (68.83 KB, text/x-matlab) 2024-12-12 10:33 MST, Paul Edmon	Details
topology.conf (4.75 KB, text/x-matlab) 2024-12-12 10:33 MST, Paul Edmon	Details
Picture of overall stats from sdaig (158.07 KB, image/png) 2024-12-12 12:34 MST, Paul Edmon	Details
Picture of primary loop stats from sdiag (135.12 KB, image/png) 2024-12-12 12:34 MST, Paul Edmon	Details
Picture of backfill stats from sdiag (161.08 KB, image/png) 2024-12-12 12:35 MST, Paul Edmon	Details
Log snippet. +backfill +select type starts around 14:25 (1.11 MB, application/x-bzip2) 2024-12-12 12:40 MST, Paul Edmon	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Paul Edmon 2024-12-12 10:33:16 MST

Created attachment 40117 [details]
slurm.conf

I'm seeing a situation where I have a partition with available resources but the scheduler is not picking up the top jobs to be scheduled. For example:

[root@holy-slurm02 ~]# showq -p huce_cascade -o

SUMMARY OF JOBS FOR QUEUE: <huce_cascade>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   GPU    REMAINING  STARTTIME
====================================================================================
57893229  test_box   yc2965        Running 200    0       694:55:45  Fri Nov 15 21:59:15
59200246  test_box   yc2965        Running 200    0       851:39:19  Fri Nov 22 10:42:49
60111397  test_box   yc2965        Running 200    0       20:43:06  Fri Nov 29 11:46:36
60459128  interactiv kchiama       Running 48     0       493:13:08  Mon Dec  2 13:36:38
61556763  .fasrcood/ kuang         Running 48     0       2297:19:43  Sun Dec  8 05:43:13
61568765  0          kuang         Running 48     0        0:00:15  Sun Dec  8 08:23:45
61569129  0          kuang         Running 48     0        0:03:58  Sun Dec  8 08:27:28
61569346  0          kuang         Running 48     0        0:07:07  Sun Dec  8 08:30:37
61569616  0          kuang         Running 48     0        0:09:04  Sun Dec  8 08:32:34
61569803  0          kuang         Running 48     0        0:21:48  Sun Dec  8 08:45:18
61569841  0          kuang         Running 48     0        0:26:52  Sun Dec  8 08:50:22
61569882  0          kuang         Running 48     0        0:46:30  Sun Dec  8 09:10:00
61569925  0          kuang         Running 48     0        1:18:37  Sun Dec  8 09:42:07
61569962  0          kuang         Running 48     0        1:43:28  Sun Dec  8 10:06:58
61570007  0          kuang         Running 48     0        1:43:53  Sun Dec  8 10:07:23
61570072  0          kuang         Running 48     0        1:43:57  Sun Dec  8 10:07:27
61570110  0          kuang         Running 48     0        1:43:58  Sun Dec  8 10:07:28
61570151  0          kuang         Running 48     0        1:44:04  Sun Dec  8 10:07:34
61570211  0          kuang         Running 48     0        1:44:23  Sun Dec  8 10:07:53
61570253  0          kuang         Running 48     0        1:44:26  Sun Dec  8 10:07:56
61570312  0          kuang         Running 48     0        1:44:27  Sun Dec  8 10:07:57
61570354  0          kuang         Running 48     0        1:44:47  Sun Dec  8 10:08:17
61570393  0          kuang         Running 48     0        1:44:57  Sun Dec  8 10:08:27
61570452  0          kuang         Running 48     0        1:44:58  Sun Dec  8 10:08:28
61570492  0          kuang         Running 48     0        1:44:59  Sun Dec  8 10:08:29
61570534  0          kuang         Running 48     0        1:46:04  Sun Dec  8 10:09:34
61570581  0          kuang         Running 48     0        1:49:26  Sun Dec  8 10:12:56
61570630  0          kuang         Running 48     0        1:55:00  Sun Dec  8 10:18:30
61570675  0          kuang         Running 48     0        2:12:09  Sun Dec  8 10:35:39
61570718  0          kuang         Running 48     0        2:13:05  Sun Dec  8 10:36:35
61570768  0          kuang         Running 48     0        2:24:22  Sun Dec  8 10:47:52
61570818  0          kuang         Running 48     0        2:40:32  Sun Dec  8 11:04:02
61570861  0          kuang         Running 48     0        2:49:18  Sun Dec  8 11:12:48
61570920  0          kuang         Running 48     0        2:52:07  Sun Dec  8 11:15:37
61570961  0          kuang         Running 48     0        3:27:25  Sun Dec  8 11:50:55
61571020  0          kuang         Running 48     0        4:57:26  Sun Dec  8 13:20:56
61571064  0          kuang         Running 48     0        5:04:45  Sun Dec  8 13:28:15
61571119  0          kuang         Running 48     0        6:28:53  Sun Dec  8 14:52:23
61571191  0          kuang         Running 48     0        6:44:38  Sun Dec  8 15:08:08
61571306  0          kuang         Running 48     0        7:37:52  Sun Dec  8 16:01:22
61571403  0          kuang         Running 48     0        8:21:00  Sun Dec  8 16:44:30
61571497  0          kuang         Running 48     0        8:27:30  Sun Dec  8 16:51:00
61571595  0          kuang         Running 48     0        8:27:45  Sun Dec  8 16:51:15
61571688  0          kuang         Running 48     0        9:27:22  Sun Dec  8 17:50:52
61571783  0          kuang         Running 48     0        9:37:18  Sun Dec  8 18:00:48
61571983  0          kuang         Running 48     0       12:26:47  Sun Dec  8 20:50:17
61572071  0          kuang         Running 48     0       25:34:00  Mon Dec  9 09:57:30
61572175  0          kuang         Running 48     0       35:16:38  Mon Dec  9 19:40:08
61572277  0          kuang         Running 48     0       35:18:56  Mon Dec  9 19:42:26
61572341  0          kuang         Running 48     0       35:30:09  Mon Dec  9 19:53:39
61572393  0          kuang         Running 48     0       35:58:11  Mon Dec  9 20:21:41
61572467  0          kuang         Running 48     0       36:08:22  Mon Dec  9 20:31:52
61572541  0          kuang         Running 48     0       36:12:29  Mon Dec  9 20:35:59
61572623  0          kuang         Running 48     0       36:37:30  Mon Dec  9 21:01:00
61572701  0          kuang         Running 48     0       37:11:10  Mon Dec  9 21:34:40
61572754  0          kuang         Running 48     0       37:19:42  Mon Dec  9 21:43:12
61572828  0          kuang         Running 48     0       37:27:10  Mon Dec  9 21:50:40
61572901  0          kuang         Running 48     0       38:37:01  Mon Dec  9 23:00:31
61572967  0          kuang         Running 48     0       38:37:30  Mon Dec  9 23:01:00
61573027  0          kuang         Running 48     0       38:38:09  Mon Dec  9 23:01:39
61658109  run_region thackray      Running 96     0       69:08:17  Mon Dec  9 21:31:47
61761058  interactiv kchiama       Running 48     0       657:51:53  Mon Dec  9 10:15:23
62104854  0.GJ_SO4.r ssong33       Running 32     0       145:34:13  Wed Dec 11 19:57:43
62138548  .fasrcood/ dingma        Running 1      0       293:16:59  Thu Dec 12 06:40:29
62151211  .fasrcood/ samdur        Running 1      0       358:10:57  Thu Dec 12 10:34:27
62158242  .fasrcood/ amatthews     Running 1      0       47:30:10  Thu Dec 12 11:53:40
62158983  .fasrcood/ amatthews     Running 1      0       47:53:15  Thu Dec 12 12:16:45

    67 active jobs : 3516 of 6672 cores ( 52.70 %):    0 of    0 gpus (  0.00 %):   78 of  139 nodes ( 56.12 %)

WAITING JOBS------------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   GPU     WCLIMIT  QUEUETIME
====================================================================================
62151213  combine_ob mhe           Waiting 32     0       48:00:00  Thu Dec 12 10:32:30
62158795  run_imi.sh lestrada      Waiting 8      0       96:00:00  Thu Dec 12 12:01:16

Total Jobs: 69    Active Jobs: 67    Idle Jobs: 2     Blocked Jobs: 0

[root@holy-slurm02 ~]# showq -p huce_cascade -o

SUMMARY OF JOBS FOR QUEUE: <huce_cascade>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   GPU    REMAINING  STARTTIME
====================================================================================
57893229  test_box   yc2965        Running 200    0       694:55:45  Fri Nov 15 21:59:15
59200246  test_box   yc2965        Running 200    0       851:39:19  Fri Nov 22 10:42:49
60111397  test_box   yc2965        Running 200    0       20:43:06  Fri Nov 29 11:46:36
60459128  interactiv kchiama       Running 48     0       493:13:08  Mon Dec  2 13:36:38
61556763  .fasrcood/ kuang         Running 48     0       2297:19:43  Sun Dec  8 05:43:13
61568765  0          kuang         Running 48     0        0:00:15  Sun Dec  8 08:23:45
61569129  0          kuang         Running 48     0        0:03:58  Sun Dec  8 08:27:28
61569346  0          kuang         Running 48     0        0:07:07  Sun Dec  8 08:30:37
61569616  0          kuang         Running 48     0        0:09:04  Sun Dec  8 08:32:34
61569803  0          kuang         Running 48     0        0:21:48  Sun Dec  8 08:45:18
61569841  0          kuang         Running 48     0        0:26:52  Sun Dec  8 08:50:22
61569882  0          kuang         Running 48     0        0:46:30  Sun Dec  8 09:10:00
61569925  0          kuang         Running 48     0        1:18:37  Sun Dec  8 09:42:07
61569962  0          kuang         Running 48     0        1:43:28  Sun Dec  8 10:06:58
61570007  0          kuang         Running 48     0        1:43:53  Sun Dec  8 10:07:23
61570072  0          kuang         Running 48     0        1:43:57  Sun Dec  8 10:07:27
61570110  0          kuang         Running 48     0        1:43:58  Sun Dec  8 10:07:28
61570151  0          kuang         Running 48     0        1:44:04  Sun Dec  8 10:07:34
61570211  0          kuang         Running 48     0        1:44:23  Sun Dec  8 10:07:53
61570253  0          kuang         Running 48     0        1:44:26  Sun Dec  8 10:07:56
61570312  0          kuang         Running 48     0        1:44:27  Sun Dec  8 10:07:57
61570354  0          kuang         Running 48     0        1:44:47  Sun Dec  8 10:08:17
61570393  0          kuang         Running 48     0        1:44:57  Sun Dec  8 10:08:27
61570452  0          kuang         Running 48     0        1:44:58  Sun Dec  8 10:08:28
61570492  0          kuang         Running 48     0        1:44:59  Sun Dec  8 10:08:29
61570534  0          kuang         Running 48     0        1:46:04  Sun Dec  8 10:09:34
61570581  0          kuang         Running 48     0        1:49:26  Sun Dec  8 10:12:56
61570630  0          kuang         Running 48     0        1:55:00  Sun Dec  8 10:18:30
61570675  0          kuang         Running 48     0        2:12:09  Sun Dec  8 10:35:39
61570718  0          kuang         Running 48     0        2:13:05  Sun Dec  8 10:36:35
61570768  0          kuang         Running 48     0        2:24:22  Sun Dec  8 10:47:52
61570818  0          kuang         Running 48     0        2:40:32  Sun Dec  8 11:04:02
61570861  0          kuang         Running 48     0        2:49:18  Sun Dec  8 11:12:48
61570920  0          kuang         Running 48     0        2:52:07  Sun Dec  8 11:15:37
61570961  0          kuang         Running 48     0        3:27:25  Sun Dec  8 11:50:55
61571020  0          kuang         Running 48     0        4:57:26  Sun Dec  8 13:20:56
61571064  0          kuang         Running 48     0        5:04:45  Sun Dec  8 13:28:15
61571119  0          kuang         Running 48     0        6:28:53  Sun Dec  8 14:52:23
61571191  0          kuang         Running 48     0        6:44:38  Sun Dec  8 15:08:08
61571306  0          kuang         Running 48     0        7:37:52  Sun Dec  8 16:01:22
61571403  0          kuang         Running 48     0        8:21:00  Sun Dec  8 16:44:30
61571497  0          kuang         Running 48     0        8:27:30  Sun Dec  8 16:51:00
61571595  0          kuang         Running 48     0        8:27:45  Sun Dec  8 16:51:15
61571688  0          kuang         Running 48     0        9:27:22  Sun Dec  8 17:50:52
61571783  0          kuang         Running 48     0        9:37:18  Sun Dec  8 18:00:48
61571983  0          kuang         Running 48     0       12:26:47  Sun Dec  8 20:50:17
61572071  0          kuang         Running 48     0       25:34:00  Mon Dec  9 09:57:30
61572175  0          kuang         Running 48     0       35:16:38  Mon Dec  9 19:40:08
61572277  0          kuang         Running 48     0       35:18:56  Mon Dec  9 19:42:26
61572341  0          kuang         Running 48     0       35:30:09  Mon Dec  9 19:53:39
61572393  0          kuang         Running 48     0       35:58:11  Mon Dec  9 20:21:41
61572467  0          kuang         Running 48     0       36:08:22  Mon Dec  9 20:31:52
61572541  0          kuang         Running 48     0       36:12:29  Mon Dec  9 20:35:59
61572623  0          kuang         Running 48     0       36:37:30  Mon Dec  9 21:01:00
61572701  0          kuang         Running 48     0       37:11:10  Mon Dec  9 21:34:40
61572754  0          kuang         Running 48     0       37:19:42  Mon Dec  9 21:43:12
61572828  0          kuang         Running 48     0       37:27:10  Mon Dec  9 21:50:40
61572901  0          kuang         Running 48     0       38:37:01  Mon Dec  9 23:00:31
61572967  0          kuang         Running 48     0       38:37:30  Mon Dec  9 23:01:00
61573027  0          kuang         Running 48     0       38:38:09  Mon Dec  9 23:01:39
61658109  run_region thackray      Running 96     0       69:08:17  Mon Dec  9 21:31:47
61761058  interactiv kchiama       Running 48     0       657:51:53  Mon Dec  9 10:15:23
62104854  0.GJ_SO4.r ssong33       Running 32     0       145:34:13  Wed Dec 11 19:57:43
62138548  .fasrcood/ dingma        Running 1      0       293:16:59  Thu Dec 12 06:40:29
62151211  .fasrcood/ samdur        Running 1      0       358:10:57  Thu Dec 12 10:34:27
62158242  .fasrcood/ amatthews     Running 1      0       47:30:10  Thu Dec 12 11:53:40
62158983  .fasrcood/ amatthews     Running 1      0       47:53:15  Thu Dec 12 12:16:45

    67 active jobs : 3516 of 6672 cores ( 52.70 %):    0 of    0 gpus (  0.00 %):   78 of  139 nodes ( 56.12 %)

WAITING JOBS------------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   GPU     WCLIMIT  QUEUETIME
====================================================================================
62151213  combine_ob mhe           Waiting 32     0       48:00:00  Thu Dec 12 10:32:30
62158795  run_imi.sh lestrada      Waiting 8      0       96:00:00  Thu Dec 12 12:01:16

Total Jobs: 69    Active Jobs: 67    Idle Jobs: 2     Blocked Jobs: 0  

[root@holy-slurm02 ~]# scontrol show job 62151213
JobId=62151213 JobName=combine_obs
   UserId=mhe(63926) GroupId=jacob_lab(40114) MCS_label=N/A
   Priority=266374 Nice=0 Account=jacob_lab QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2024-12-12T10:32:30 EligibleTime=2024-12-12T10:32:30
   AccrueTime=2024-12-12T10:32:30
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-12T10:54:29 Scheduler=Main
   Partition=huce_ice,huce_cascade,seas_compute,sapphire AllocNode:Sid=holylogin08:2951731
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   StepMgrEnabled=Yes
   NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=32,mem=20G,node=1,billing=37
   AllocTRES=(null)
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryNode=20G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/n/home12/mhe/process_obspack/run.job
   WorkDir=/n/home12/mhe/process_obspack
   StdErr=/n/home12/mhe/process_obspack/slurm-62151213.out
   StdIn=/dev/null
   StdOut=/n/home12/mhe/process_obspack/slurm-62151213.out
   TresPerTask=cpu=32

There is certainly space there to fit those jobs on those vacant nodes. In my scheduler I have set:

SchedulerParameters=\
default_queue_depth=1500,\
partition_job_depth=10,\
bf_continue,\
bf_interval=30,\
bf_resolution=600,\
bf_window=11520,\
bf_max_job_test=300000,\
bf_max_job_start=1000,\
bf_max_job_user_part=10,\
bf_ignore_newly_avail_nodes,\
bf_one_resv_per_job,\
enable_user_top,\
pack_serial_at_end,\
nohold_on_prolog_fail,\
permit_job_expansion,\
requeue_setup_env_fail,\
max_rpc_cnt=16

With partition_job_depth set the scheduler should pick up those jobs regardless of their low priority because they are the top priority in the queue. Even with out that the backfill loop should also pick up those jobs because of bf_max_job_user_part.

I will note that this is being exacerbated due to higher cluster activity right now so the scheduler is pretty busy. Still I expect it even under that load to pickup the top jobs in each partition even if the backfill doesn't penetrate the full stack.

I've including my configs. Here is a dump of sdiag right now.

 [root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Thu Dec 12 12:31:32 2024 (1734024692)
Data since      Thu Dec 12 12:15:58 2024 (1734023758)
*******************************************************
Server thread count:  12
RPC queue enabled:    0
Agent queue size:     0
Agent count:          2
Agent thread count:   6
DBD Agent queue size: 0

Jobs submitted: 2325
Jobs started:   734
Jobs completed: 1191
Jobs canceled:  22
Jobs failed:    0

Job states ts:  Thu Dec 12 12:31:19 2024 (1734024679)
Jobs pending:   33332
Jobs running:   27281

Main schedule statistics (microseconds):
	Last cycle:   1380037
	Max cycle:    1755574
	Total cycles: 413
	Mean cycle:   860399
	Mean depth cycle:  9
	Cycles per minute: 27
	Last queue length: 29104

Main scheduler exit:
	End of job queue: 0
	Hit default_queue_depth: 0
	Hit sched_max_job_start: 0
	Blocked on licenses: 0
	Hit max_rpc_cnt:299
	Timeout (max_sched_time):114

Backfilling stats
	Total backfilled jobs (since last slurm start): 221
	Total backfilled jobs (since last stats cycle start): 221
	Total backfilled heterogeneous job components: 0
	Total cycles: 14
	Last cycle when: Thu Dec 12 12:30:49 2024 (1734024649)
	Last cycle: 14806760
	Max cycle:  29604732
	Mean cycle: 23233447
	Last depth cycle: 27
	Last depth cycle (try sched): 21
	Depth Mean: 97
	Depth Mean (try depth): 45
	Last queue length: 29045
	Queue length mean: 28651
	Last table size: 33
	Mean table size: 35

Backfill exit
	End of job queue: 0
	Hit bf_max_job_start: 0
	Hit bf_max_job_test: 0
	System state changed: 0
	Hit table size limit (bf_node_space_size): 0
	Timeout (bf_max_time):14

Latency for 1000 calls to gettimeofday(): 15 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:14000  ave_time:12255  total_time:171582486
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:6759   ave_time:238306 total_time:1610710332
	REQUEST_FED_INFO                        ( 2049) count:6750   ave_time:90     total_time:612678
	REQUEST_NODE_INFO                       ( 2007) count:5551   ave_time:507832 total_time:2818979934
	REQUEST_STEP_COMPLETE                   ( 5016) count:2669   ave_time:945011 total_time:2522236823
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:2663   ave_time:2041707 total_time:5437066581
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:2663   ave_time:1544223 total_time:4112267063
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:1608   ave_time:2207661 total_time:3549919341
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:1261   ave_time:302957 total_time:382029879
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:887    ave_time:2785814 total_time:2471017268
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:735    ave_time:1196834 total_time:879673485
	REQUEST_JOB_INFO                        ( 2003) count:465    ave_time:1888868 total_time:878323867
	REQUEST_DBD_RELAY                       ( 1028) count:100    ave_time:247    total_time:24711
	REQUEST_SHARE_INFO                      ( 2022) count:32     ave_time:60298  total_time:1929562
	REQUEST_KILL_JOB                        ( 5032) count:24     ave_time:2719635 total_time:65271258
	REQUEST_STATS_INFO                      ( 2035) count:16     ave_time:164    total_time:2630
	REQUEST_JOB_USER_INFO                   ( 2039) count:14     ave_time:189756 total_time:2656597
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:13     ave_time:1453062 total_time:18889815
	REQUEST_JOB_READY                       ( 4019) count:9      ave_time:582902 total_time:5246121
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:8      ave_time:1552123 total_time:12416986
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:5      ave_time:1717682 total_time:8588410
	REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:3      ave_time:520916 total_time:1562749
	REQUEST_BUILD_INFO                      ( 2001) count:2      ave_time:2149   total_time:4299
	REQUEST_CRONTAB                         ( 2200) count:2      ave_time:518835 total_time:1037670
	REQUEST_UPDATE_NODE                     ( 3002) count:2      ave_time:645294 total_time:1290589
	REQUEST_UPDATE_JOB                      ( 3001) count:1      ave_time:5953332 total_time:5953332
	REQUEST_JOB_STEP_INFO                   ( 2005) count:1      ave_time:1684285 total_time:1684285

While this isn't breaking scheduling as high priority jobs are still being scheduled, very low priority jobs (even if they are the top of the queue) are not. This leads me to think that the scheduling loops are not getting down to these low priority jobs.

Comment 1 Paul Edmon 2024-12-12 10:33:34 MST

Created attachment 40118 [details]
topology.conf

Comment 2 Jason Booth 2024-12-12 11:09:09 MST

Would it be possible to gather logging with these debugflags enabled and upload 
those logs?

> Backfill,SelectType

Also if 62151213 is not longer in the queue then please highlight one of the jobs 
you think should be scheduled.

Comment 3 Paul Edmon 2024-12-12 12:31:58 MST

Sure. 62151213 eventually scheduled, I think the backfill eventually got 
deep enough. I've attached a graph of our backfill loop penetration. 
Here is another example:

[root@holy8a24507 ~]# scontrol show job 62164818
JobId=62164818 JobName=t1errgrid_theta1_0.8_theta2_3.2
    UserId=bliang(60349) GroupId=janson_lab(403769) MCS_label=N/A
    Priority=9165110 Nice=0 Account=janson_lab QOS=normal
    JobState=PENDING Reason=Resources Dependency=(null)
    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
    RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A
    SubmitTime=2024-12-12T13:13:39 EligibleTime=2024-12-12T13:13:39
    AccrueTime=2024-12-12T13:13:39
    StartTime=Unknown EndTime=Unknown Deadline=N/A
    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-12T14:21:59 
Scheduler=Main
Partition=janson,janson_cascade,janson_bigmem,sapphire,shared,seas_compute,serial_requeue 
AllocNode:Sid=holylogin07:672391
    ReqNodeList=(null) ExcNodeList=(null)
    NodeList=
    StepMgrEnabled=Yes
    NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
    ReqTRES=cpu=1,mem=60G,node=1,billing=4
    AllocTRES=(null)
    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
    MinCPUsNode=1 MinMemoryNode=60G MinTmpDiskNode=0
    Features=(null) DelayBoot=00:00:00
    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
    Command=/n/home08/bliang/single_pc_power/batch_t1err_grid.sh
    WorkDir=/n/home08/bliang/single_pc_power
StdErr=/n/home08/bliang/single_pc_power/err/err_t1errgrid_theta1_0.8_theta2_3.2.stdout.txt
    StdIn=/dev/null
StdOut=/n/home08/bliang/single_pc_power/output/out_t1errgrid_theta1_0.8_theta2_3.2.stdout.txt

SUMMARY OF JOBS FOR QUEUE: <seas_compute>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   GPU REMAINING  STARTTIME
====================================================================================
62092030  ded_adj    jholey        Running 112    0 26:44:11  Wed Dec 11 
17:13:19
62099746  M2GMI_tavg tmooring      Running 1      0 64:38:16  Thu Dec 12 
07:07:24
62136569  M2GMI_tavg tmooring      Running 1      0 70:19:50  Thu Dec 12 
12:48:58
62138358  ded_adj    jholey        Running 112    0 40:01:35  Thu Dec 12 
06:30:43
62140447  biome_Dese luciagordon   Running 1      0 63:17:31  Thu Dec 12 
07:46:39

      5 active jobs :  227 of 5584 cores (  4.07 %):    0 of    0 gpus 
(  0.00 %):    4 of   93 nodes (  4.30 %)

WAITING JOBS------------------------
JOBID     JOBNAME    USERNAME      STATE   CORE   GPU     WCLIMIT QUEUETIME
====================================================================================
62164818  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:13:39
62164821  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:13:41
62164831  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:13:44
62164837  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:13:46
62164850  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:13:54
62164864  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:13:56
62164868  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:14:00
62164872  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:14:05
62164877  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:14:13
62164884  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:14:15
62164888  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:14:18
62164892  t1errgrid_ bliang        Waiting 1      0 12:00:00  Thu Dec 12 
13:14:20

Even though its a multipartition job it is the top of that partition and 
should go. There are available it could run on with out preempting anything.

I've also attached a 10 minute snippet of our logs with the 
Backfill,SelectType on.

-Paul Edmon-


On 12/12/24 1:09 PM, bugs@schedmd.com wrote:
>
> *Comment # 2 <https://support.schedmd.com/show_bug.cgi?id=21627#c2> on 
> ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> from 
> Jason Booth <mailto:jbooth@schedmd.com> *
> Would it be possible to gather logging with these debugflags enabled and upload
> those logs?
>
> > Backfill,SelectType
>
> Also if 62151213 is not longer in the queue then please highlight one of the
> jobs
> you think should be scheduled.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 4 Paul Edmon 2024-12-12 12:34:25 MST

Created attachment 40119 [details]
Picture of overall stats from sdaig

Comment 5 Paul Edmon 2024-12-12 12:34:45 MST

Created attachment 40120 [details]
Picture of primary loop stats from sdiag

Comment 6 Paul Edmon 2024-12-12 12:35:04 MST

Created attachment 40121 [details]
Picture of backfill stats from sdiag

Comment 7 Paul Edmon 2024-12-12 12:40:20 MST

Created attachment 40122 [details]
Log snippet. +backfill +select type starts around 14:25

Comment 8 Paul Edmon 2024-12-12 12:45:11 MST

Sorry listed wrong version we are running 24.05.4

Comment 9 Paul Edmon 2024-12-12 13:13:12 MST

For reference here are the jobs pending per partition:

[root@holy8a24507 ~]# /usr/bin/squeue --noheader --state=PD -o %80P | sort | uniq -c
      1 arguelles_delgado                                                               
      1 arguelles_delgado,shared                                                        
      5 arguelles_delgado,shared,unrestricted                                           
      1 bigmem                                                                          
      2 conroy                                                                          
    838 gpu                                                                             
     28 gpu_requeue                                                                     
      1 hernquist_ice                                                                   
      2 holy-smokes-priority,shared,sapphire                                            
    198 hsph                                                                            
      2 hsph,sapphire,intermediate                                                      
    151 hsph,xlin,serial_requeue                                                        
      1 huce_bigmem,bigmem,bigmem_intermediate                                          
    115 intermediate                                                                    
      2 itc_cluster,sapphire,intermediate                                               
      2 janson,janson_bigmem,serial_requeue                                             
    473 janson,janson_cascade,janson_bigmem,sapphire,shared,seas_compute,serial_requeue 
      1 kempner                                                                         
      6 kempner_h100                                                                    
      5 kempner_requeue                                                                 
    316 kovac,itc_cluster,serial_requeue                                                
      2 kozinsky_gpu,gpu,seas_gpu                                                       
      7 lukin                                                                           
      1 mweber_gpu                                                                      
      2 olveczkygpu,gpu                                                                 
      7 olveczkygpu,gpu,gpu_requeue                                                     
      1 remoteviz                                                                       
    621 sapphire                                                                        
      7 sapphire,shared                                                                 
     98 seas_compute                                                                    
     19 seas_compute,shared                                                             
    130 seas_gpu                                                                        
     32 seas_gpu,gpu                                                                    
  13729 serial_requeue                                                                  
    502 serial_requeue,shared,sapphire                                                  
    191 serial_requeue,yao,shared,sapphire                                              
   5459 shared                                                                          
      1 shared,holy-smokes-priority,sapphire                                            
   4627 shared,sapphire,serial_requeue                                                  
      1 sompolinsky_gpu                                                                 
      4 test                                                                            
     15 unrestricted                                                                    
     31 yao,shared,sapphire,serial_requeue

Comment 10 Ben Glines 2024-12-12 16:46:16 MST

partition_job_depth is a limit on how many jobs the main scheduler attempts to schedule, not a guarantee. It's possible that the main scheduler has stopped before even getting to the partition you expect should be getting scheduled.

From your sdiag, the main scheduler has a mean job depth of 9, so likely not getting to the jobs you're looking at since you have ~30000 pending jobs:
> 	Mean depth cycle:  9

A new feature added in 23.11 is to see why the main/backfill schedulers exit. In your case, it looks like the main scheduler is often hitting your max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2 seconds in your case).
>  Main scheduler exit:
>  	End of job queue: 0
>  	Hit default_queue_depth: 0
>  	Hit sched_max_job_start: 0
>  	Blocked on licenses: 0
>  	Hit max_rpc_cnt:299
>  	Timeout (max_sched_time):114
Notice that you're never hitting your default_queue_depth=1500 either.

To get the main scheduler to attempt to schedule more jobs, you could increase max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents deferring scheduling too much, while still providing acceptable RPC latency. You can try increasing max_rpc_cnt moderately at first to something like ~50 in case your RPC load turns out to be especially heavy, although you very well may be able to set 150 without problem.

Is there a particular reason you have max_rpc_cnt=16? If not, I suggest increasing that to let your main scheduler hit a larger job depth.

Comment 11 Paul Edmon 2024-12-13 07:49:08 MST

We've historically gone with max_rpc_cnt=16 because above that limit the 
scheduler started becoming unresponsive or started thrashing. We have a 
bunch of users that query the cluster a lot and we found that having 
that set low was necessary to prevent the scheduler being overwhelmed 
while being responsive to users asking it about job state and submitting 
new work.

That said the last time we experimented with max_rpc_cnt was several 
years and versions back. It may be time to reevaluate and go to a higher 
setting.

Let me experiment with ours and see if we can go higher.

-Paul Edmon-

On 12/12/24 6:46 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://support.schedmd.com/show_bug.cgi?id=21627#c10> 
> on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> 
> from Ben Glines <mailto:ben.glines@schedmd.com> *
> partition_job_depth is a limit on how many jobs the main scheduler attempts to
> schedule, not a guarantee. It's possible that the main scheduler has stopped
> before even getting to the partition you expect should be getting scheduled.
>
>  From your sdiag, the main scheduler has a mean job depth of 9, so likely not
> getting to the jobs you're looking at since you have ~30000 pending jobs:
> > 	Mean depth cycle:  9
>
> A new feature added in 23.11 is to see why the main/backfill schedulers exit.
> In your case, it looks like the main scheduler is often hitting your
> max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2
> seconds in your case).
> >  Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit 
> sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 
> > Timeout (max_sched_time):114
> Notice that you're never hitting your default_queue_depth=1500 either.
>
> To get the main scheduler to attempt to schedule more jobs, you could increase
> max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents
> deferring scheduling too much, while still providing acceptable RPC latency.
> You can try increasing max_rpc_cnt moderately at first to something like ~50 in
> case your RPC load turns out to be especially heavy, although you very well may
> be able to set 150 without problem.
>
> Is there a particular reason you have max_rpc_cnt=16? If not, I suggest
> increasing that to let your main scheduler hit a larger job depth.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 12 Paul Edmon 2024-12-13 08:52:31 MST

Hiking the max_rpc_cnt to 64 seems to have helped. I'm going to keep 
experimenting with it to see how far we can push it as we obviously want 
to run as high as we can with out the scheduler being unusable.

Thanks for the assist!

-Paul Edmon-

On 12/12/24 6:46 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://support.schedmd.com/show_bug.cgi?id=21627#c10> 
> on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> 
> from Ben Glines <mailto:ben.glines@schedmd.com> *
> partition_job_depth is a limit on how many jobs the main scheduler attempts to
> schedule, not a guarantee. It's possible that the main scheduler has stopped
> before even getting to the partition you expect should be getting scheduled.
>
>  From your sdiag, the main scheduler has a mean job depth of 9, so likely not
> getting to the jobs you're looking at since you have ~30000 pending jobs:
> > 	Mean depth cycle:  9
>
> A new feature added in 23.11 is to see why the main/backfill schedulers exit.
> In your case, it looks like the main scheduler is often hitting your
> max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2
> seconds in your case).
> >  Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit 
> sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 
> > Timeout (max_sched_time):114
> Notice that you're never hitting your default_queue_depth=1500 either.
>
> To get the main scheduler to attempt to schedule more jobs, you could increase
> max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents
> deferring scheduling too much, while still providing acceptable RPC latency.
> You can try increasing max_rpc_cnt moderately at first to something like ~50 in
> case your RPC load turns out to be especially heavy, although you very well may
> be able to set 150 without problem.
>
> Is there a particular reason you have max_rpc_cnt=16? If not, I suggest
> increasing that to let your main scheduler hit a larger job depth.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 13 Paul Edmon 2024-12-13 08:54:40 MST

As a further follow up has there been any thought about increasing the 
fundamental RPC limit of 256? It would be nice if this was a user 
adjustable limit.  With the cores being faster and multicore nodes being 
popular it may be worth reevaluating the fundamental limit or at least 
granting latitude.

-Paul Edmon-

On 12/12/24 6:46 PM, bugs@schedmd.com wrote:
>
> *Comment # 10 <https://support.schedmd.com/show_bug.cgi?id=21627#c10> 
> on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> 
> from Ben Glines <mailto:ben.glines@schedmd.com> *
> partition_job_depth is a limit on how many jobs the main scheduler attempts to
> schedule, not a guarantee. It's possible that the main scheduler has stopped
> before even getting to the partition you expect should be getting scheduled.
>
>  From your sdiag, the main scheduler has a mean job depth of 9, so likely not
> getting to the jobs you're looking at since you have ~30000 pending jobs:
> > 	Mean depth cycle:  9
>
> A new feature added in 23.11 is to see why the main/backfill schedulers exit.
> In your case, it looks like the main scheduler is often hitting your
> max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2
> seconds in your case).
> >  Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit 
> sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 
> > Timeout (max_sched_time):114
> Notice that you're never hitting your default_queue_depth=1500 either.
>
> To get the main scheduler to attempt to schedule more jobs, you could increase
> max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents
> deferring scheduling too much, while still providing acceptable RPC latency.
> You can try increasing max_rpc_cnt moderately at first to something like ~50 in
> case your RPC load turns out to be especially heavy, although you very well may
> be able to set 150 without problem.
>
> Is there a particular reason you have max_rpc_cnt=16? If not, I suggest
> increasing that to let your main scheduler hit a larger job depth.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 14 Ben Glines 2024-12-27 09:46:48 MST

(In reply to Paul Edmon from comment #13)
> As a further follow up has there been any thought about increasing the 
> fundamental RPC limit of 256? It would be nice if this was a user 
> adjustable limit.  With the cores being faster and multicore nodes being 
> popular it may be worth reevaluating the fundamental limit or at least 
> granting latitude.
> 
> -Paul Edmon
I believe by "fundamental RPC limit" you are referring to MAX_SERVER_THREADS which defines how many RPC threads can exist at a time? If so, this isn't really relevant anymore. As of 24.11 slurmctld now uses a thread pool system that is known as "conmgr", and the accompanying conmgr_* options can be see in the man page [1]. Currently, the default for conmgr_threads is set to 64, but it is a configurable option if you'd like to increase/decrease it. 

[1] - https://www.schedmd.com/what-slurm-24-11-brings-to-the-table/

Comment 15 Paul Edmon 2025-01-02 08:06:21 MST

That's awesome. I will look forward to that when we upgrade to 24.11

-Paul Edmon-

On 12/27/24 11:46 AM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://support.schedmd.com/show_bug.cgi?id=21627#c14> 
> on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> 
> from Ben Glines <mailto:ben.glines@schedmd.com> *
> (In reply to Paul Edmon fromcomment #13 <show_bug.cgi?id=21627#c13>)
> > As a further follow up has there been any thought about increasing the > fundamental RPC limit of 256? It would be nice if this was a user > 
> adjustable limit.  With the cores being faster and multicore nodes 
> being > popular it may be worth reevaluating the fundamental limit or 
> at least > granting latitude. > > -Paul Edmon
> I believe by "fundamental RPC limit" you are referring to MAX_SERVER_THREADS
> which defines how many RPC threads can exist at a time? If so, this isn't
> really relevant anymore. As of 24.11 slurmctld now uses a thread pool system
> that is known as "conmgr", and the accompanying conmgr_* options can be see in
> the man page [1]. Currently, the default for conmgr_threads is set to 64, but
> it is a configurable option if you'd like to increase/decrease it.
>
> [1] -https://www.schedmd.com/what-slurm-24-11-brings-to-the-table/
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 16 Paul Edmon 2025-01-02 08:09:15 MST

By the way for reference, we bumped our max_rpc_cnt to 128 and 
everything is operating great.  You can resolve this ticket.

-Paul Edmon-

On 12/27/24 11:46 AM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://support.schedmd.com/show_bug.cgi?id=21627#c14> 
> on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> 
> from Ben Glines <mailto:ben.glines@schedmd.com> *
> (In reply to Paul Edmon fromcomment #13 <show_bug.cgi?id=21627#c13>)
> > As a further follow up has there been any thought about increasing the > fundamental RPC limit of 256? It would be nice if this was a user > 
> adjustable limit.  With the cores being faster and multicore nodes 
> being > popular it may be worth reevaluating the fundamental limit or 
> at least > granting latitude. > > -Paul Edmon
> I believe by "fundamental RPC limit" you are referring to MAX_SERVER_THREADS
> which defines how many RPC threads can exist at a time? If so, this isn't
> really relevant anymore. As of 24.11 slurmctld now uses a thread pool system
> that is known as "conmgr", and the accompanying conmgr_* options can be see in
> the man page [1]. Currently, the default for conmgr_threads is set to 64, but
> it is a configurable option if you'd like to increase/decrease it.
>
> [1] -https://www.schedmd.com/what-slurm-24-11-brings-to-the-table/
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the ticket.
>

Comment 17 Ben Glines 2025-01-02 09:46:42 MST

(In reply to Paul Edmon from comment #16)
> By the way for reference, we bumped our max_rpc_cnt to 128 and 
> everything is operating great.  You can resolve this ticket.
> 
> -Paul Edmon-
Good to hear! Closing now.