Created attachment 40117 [details] slurm.conf I'm seeing a situation where I have a partition with available resources but the scheduler is not picking up the top jobs to be scheduled. For example: [root@holy-slurm02 ~]# showq -p huce_cascade -o SUMMARY OF JOBS FOR QUEUE: <huce_cascade> ACTIVE JOBS-------------------- JOBID JOBNAME USERNAME STATE CORE GPU REMAINING STARTTIME ==================================================================================== 57893229 test_box yc2965 Running 200 0 694:55:45 Fri Nov 15 21:59:15 59200246 test_box yc2965 Running 200 0 851:39:19 Fri Nov 22 10:42:49 60111397 test_box yc2965 Running 200 0 20:43:06 Fri Nov 29 11:46:36 60459128 interactiv kchiama Running 48 0 493:13:08 Mon Dec 2 13:36:38 61556763 .fasrcood/ kuang Running 48 0 2297:19:43 Sun Dec 8 05:43:13 61568765 0 kuang Running 48 0 0:00:15 Sun Dec 8 08:23:45 61569129 0 kuang Running 48 0 0:03:58 Sun Dec 8 08:27:28 61569346 0 kuang Running 48 0 0:07:07 Sun Dec 8 08:30:37 61569616 0 kuang Running 48 0 0:09:04 Sun Dec 8 08:32:34 61569803 0 kuang Running 48 0 0:21:48 Sun Dec 8 08:45:18 61569841 0 kuang Running 48 0 0:26:52 Sun Dec 8 08:50:22 61569882 0 kuang Running 48 0 0:46:30 Sun Dec 8 09:10:00 61569925 0 kuang Running 48 0 1:18:37 Sun Dec 8 09:42:07 61569962 0 kuang Running 48 0 1:43:28 Sun Dec 8 10:06:58 61570007 0 kuang Running 48 0 1:43:53 Sun Dec 8 10:07:23 61570072 0 kuang Running 48 0 1:43:57 Sun Dec 8 10:07:27 61570110 0 kuang Running 48 0 1:43:58 Sun Dec 8 10:07:28 61570151 0 kuang Running 48 0 1:44:04 Sun Dec 8 10:07:34 61570211 0 kuang Running 48 0 1:44:23 Sun Dec 8 10:07:53 61570253 0 kuang Running 48 0 1:44:26 Sun Dec 8 10:07:56 61570312 0 kuang Running 48 0 1:44:27 Sun Dec 8 10:07:57 61570354 0 kuang Running 48 0 1:44:47 Sun Dec 8 10:08:17 61570393 0 kuang Running 48 0 1:44:57 Sun Dec 8 10:08:27 61570452 0 kuang Running 48 0 1:44:58 Sun Dec 8 10:08:28 61570492 0 kuang Running 48 0 1:44:59 Sun Dec 8 10:08:29 61570534 0 kuang Running 48 0 1:46:04 Sun Dec 8 10:09:34 61570581 0 kuang Running 48 0 1:49:26 Sun Dec 8 10:12:56 61570630 0 kuang Running 48 0 1:55:00 Sun Dec 8 10:18:30 61570675 0 kuang Running 48 0 2:12:09 Sun Dec 8 10:35:39 61570718 0 kuang Running 48 0 2:13:05 Sun Dec 8 10:36:35 61570768 0 kuang Running 48 0 2:24:22 Sun Dec 8 10:47:52 61570818 0 kuang Running 48 0 2:40:32 Sun Dec 8 11:04:02 61570861 0 kuang Running 48 0 2:49:18 Sun Dec 8 11:12:48 61570920 0 kuang Running 48 0 2:52:07 Sun Dec 8 11:15:37 61570961 0 kuang Running 48 0 3:27:25 Sun Dec 8 11:50:55 61571020 0 kuang Running 48 0 4:57:26 Sun Dec 8 13:20:56 61571064 0 kuang Running 48 0 5:04:45 Sun Dec 8 13:28:15 61571119 0 kuang Running 48 0 6:28:53 Sun Dec 8 14:52:23 61571191 0 kuang Running 48 0 6:44:38 Sun Dec 8 15:08:08 61571306 0 kuang Running 48 0 7:37:52 Sun Dec 8 16:01:22 61571403 0 kuang Running 48 0 8:21:00 Sun Dec 8 16:44:30 61571497 0 kuang Running 48 0 8:27:30 Sun Dec 8 16:51:00 61571595 0 kuang Running 48 0 8:27:45 Sun Dec 8 16:51:15 61571688 0 kuang Running 48 0 9:27:22 Sun Dec 8 17:50:52 61571783 0 kuang Running 48 0 9:37:18 Sun Dec 8 18:00:48 61571983 0 kuang Running 48 0 12:26:47 Sun Dec 8 20:50:17 61572071 0 kuang Running 48 0 25:34:00 Mon Dec 9 09:57:30 61572175 0 kuang Running 48 0 35:16:38 Mon Dec 9 19:40:08 61572277 0 kuang Running 48 0 35:18:56 Mon Dec 9 19:42:26 61572341 0 kuang Running 48 0 35:30:09 Mon Dec 9 19:53:39 61572393 0 kuang Running 48 0 35:58:11 Mon Dec 9 20:21:41 61572467 0 kuang Running 48 0 36:08:22 Mon Dec 9 20:31:52 61572541 0 kuang Running 48 0 36:12:29 Mon Dec 9 20:35:59 61572623 0 kuang Running 48 0 36:37:30 Mon Dec 9 21:01:00 61572701 0 kuang Running 48 0 37:11:10 Mon Dec 9 21:34:40 61572754 0 kuang Running 48 0 37:19:42 Mon Dec 9 21:43:12 61572828 0 kuang Running 48 0 37:27:10 Mon Dec 9 21:50:40 61572901 0 kuang Running 48 0 38:37:01 Mon Dec 9 23:00:31 61572967 0 kuang Running 48 0 38:37:30 Mon Dec 9 23:01:00 61573027 0 kuang Running 48 0 38:38:09 Mon Dec 9 23:01:39 61658109 run_region thackray Running 96 0 69:08:17 Mon Dec 9 21:31:47 61761058 interactiv kchiama Running 48 0 657:51:53 Mon Dec 9 10:15:23 62104854 0.GJ_SO4.r ssong33 Running 32 0 145:34:13 Wed Dec 11 19:57:43 62138548 .fasrcood/ dingma Running 1 0 293:16:59 Thu Dec 12 06:40:29 62151211 .fasrcood/ samdur Running 1 0 358:10:57 Thu Dec 12 10:34:27 62158242 .fasrcood/ amatthews Running 1 0 47:30:10 Thu Dec 12 11:53:40 62158983 .fasrcood/ amatthews Running 1 0 47:53:15 Thu Dec 12 12:16:45 67 active jobs : 3516 of 6672 cores ( 52.70 %): 0 of 0 gpus ( 0.00 %): 78 of 139 nodes ( 56.12 %) WAITING JOBS------------------------ JOBID JOBNAME USERNAME STATE CORE GPU WCLIMIT QUEUETIME ==================================================================================== 62151213 combine_ob mhe Waiting 32 0 48:00:00 Thu Dec 12 10:32:30 62158795 run_imi.sh lestrada Waiting 8 0 96:00:00 Thu Dec 12 12:01:16 Total Jobs: 69 Active Jobs: 67 Idle Jobs: 2 Blocked Jobs: 0 [root@holy-slurm02 ~]# showq -p huce_cascade -o SUMMARY OF JOBS FOR QUEUE: <huce_cascade> ACTIVE JOBS-------------------- JOBID JOBNAME USERNAME STATE CORE GPU REMAINING STARTTIME ==================================================================================== 57893229 test_box yc2965 Running 200 0 694:55:45 Fri Nov 15 21:59:15 59200246 test_box yc2965 Running 200 0 851:39:19 Fri Nov 22 10:42:49 60111397 test_box yc2965 Running 200 0 20:43:06 Fri Nov 29 11:46:36 60459128 interactiv kchiama Running 48 0 493:13:08 Mon Dec 2 13:36:38 61556763 .fasrcood/ kuang Running 48 0 2297:19:43 Sun Dec 8 05:43:13 61568765 0 kuang Running 48 0 0:00:15 Sun Dec 8 08:23:45 61569129 0 kuang Running 48 0 0:03:58 Sun Dec 8 08:27:28 61569346 0 kuang Running 48 0 0:07:07 Sun Dec 8 08:30:37 61569616 0 kuang Running 48 0 0:09:04 Sun Dec 8 08:32:34 61569803 0 kuang Running 48 0 0:21:48 Sun Dec 8 08:45:18 61569841 0 kuang Running 48 0 0:26:52 Sun Dec 8 08:50:22 61569882 0 kuang Running 48 0 0:46:30 Sun Dec 8 09:10:00 61569925 0 kuang Running 48 0 1:18:37 Sun Dec 8 09:42:07 61569962 0 kuang Running 48 0 1:43:28 Sun Dec 8 10:06:58 61570007 0 kuang Running 48 0 1:43:53 Sun Dec 8 10:07:23 61570072 0 kuang Running 48 0 1:43:57 Sun Dec 8 10:07:27 61570110 0 kuang Running 48 0 1:43:58 Sun Dec 8 10:07:28 61570151 0 kuang Running 48 0 1:44:04 Sun Dec 8 10:07:34 61570211 0 kuang Running 48 0 1:44:23 Sun Dec 8 10:07:53 61570253 0 kuang Running 48 0 1:44:26 Sun Dec 8 10:07:56 61570312 0 kuang Running 48 0 1:44:27 Sun Dec 8 10:07:57 61570354 0 kuang Running 48 0 1:44:47 Sun Dec 8 10:08:17 61570393 0 kuang Running 48 0 1:44:57 Sun Dec 8 10:08:27 61570452 0 kuang Running 48 0 1:44:58 Sun Dec 8 10:08:28 61570492 0 kuang Running 48 0 1:44:59 Sun Dec 8 10:08:29 61570534 0 kuang Running 48 0 1:46:04 Sun Dec 8 10:09:34 61570581 0 kuang Running 48 0 1:49:26 Sun Dec 8 10:12:56 61570630 0 kuang Running 48 0 1:55:00 Sun Dec 8 10:18:30 61570675 0 kuang Running 48 0 2:12:09 Sun Dec 8 10:35:39 61570718 0 kuang Running 48 0 2:13:05 Sun Dec 8 10:36:35 61570768 0 kuang Running 48 0 2:24:22 Sun Dec 8 10:47:52 61570818 0 kuang Running 48 0 2:40:32 Sun Dec 8 11:04:02 61570861 0 kuang Running 48 0 2:49:18 Sun Dec 8 11:12:48 61570920 0 kuang Running 48 0 2:52:07 Sun Dec 8 11:15:37 61570961 0 kuang Running 48 0 3:27:25 Sun Dec 8 11:50:55 61571020 0 kuang Running 48 0 4:57:26 Sun Dec 8 13:20:56 61571064 0 kuang Running 48 0 5:04:45 Sun Dec 8 13:28:15 61571119 0 kuang Running 48 0 6:28:53 Sun Dec 8 14:52:23 61571191 0 kuang Running 48 0 6:44:38 Sun Dec 8 15:08:08 61571306 0 kuang Running 48 0 7:37:52 Sun Dec 8 16:01:22 61571403 0 kuang Running 48 0 8:21:00 Sun Dec 8 16:44:30 61571497 0 kuang Running 48 0 8:27:30 Sun Dec 8 16:51:00 61571595 0 kuang Running 48 0 8:27:45 Sun Dec 8 16:51:15 61571688 0 kuang Running 48 0 9:27:22 Sun Dec 8 17:50:52 61571783 0 kuang Running 48 0 9:37:18 Sun Dec 8 18:00:48 61571983 0 kuang Running 48 0 12:26:47 Sun Dec 8 20:50:17 61572071 0 kuang Running 48 0 25:34:00 Mon Dec 9 09:57:30 61572175 0 kuang Running 48 0 35:16:38 Mon Dec 9 19:40:08 61572277 0 kuang Running 48 0 35:18:56 Mon Dec 9 19:42:26 61572341 0 kuang Running 48 0 35:30:09 Mon Dec 9 19:53:39 61572393 0 kuang Running 48 0 35:58:11 Mon Dec 9 20:21:41 61572467 0 kuang Running 48 0 36:08:22 Mon Dec 9 20:31:52 61572541 0 kuang Running 48 0 36:12:29 Mon Dec 9 20:35:59 61572623 0 kuang Running 48 0 36:37:30 Mon Dec 9 21:01:00 61572701 0 kuang Running 48 0 37:11:10 Mon Dec 9 21:34:40 61572754 0 kuang Running 48 0 37:19:42 Mon Dec 9 21:43:12 61572828 0 kuang Running 48 0 37:27:10 Mon Dec 9 21:50:40 61572901 0 kuang Running 48 0 38:37:01 Mon Dec 9 23:00:31 61572967 0 kuang Running 48 0 38:37:30 Mon Dec 9 23:01:00 61573027 0 kuang Running 48 0 38:38:09 Mon Dec 9 23:01:39 61658109 run_region thackray Running 96 0 69:08:17 Mon Dec 9 21:31:47 61761058 interactiv kchiama Running 48 0 657:51:53 Mon Dec 9 10:15:23 62104854 0.GJ_SO4.r ssong33 Running 32 0 145:34:13 Wed Dec 11 19:57:43 62138548 .fasrcood/ dingma Running 1 0 293:16:59 Thu Dec 12 06:40:29 62151211 .fasrcood/ samdur Running 1 0 358:10:57 Thu Dec 12 10:34:27 62158242 .fasrcood/ amatthews Running 1 0 47:30:10 Thu Dec 12 11:53:40 62158983 .fasrcood/ amatthews Running 1 0 47:53:15 Thu Dec 12 12:16:45 67 active jobs : 3516 of 6672 cores ( 52.70 %): 0 of 0 gpus ( 0.00 %): 78 of 139 nodes ( 56.12 %) WAITING JOBS------------------------ JOBID JOBNAME USERNAME STATE CORE GPU WCLIMIT QUEUETIME ==================================================================================== 62151213 combine_ob mhe Waiting 32 0 48:00:00 Thu Dec 12 10:32:30 62158795 run_imi.sh lestrada Waiting 8 0 96:00:00 Thu Dec 12 12:01:16 Total Jobs: 69 Active Jobs: 67 Idle Jobs: 2 Blocked Jobs: 0 [root@holy-slurm02 ~]# scontrol show job 62151213 JobId=62151213 JobName=combine_obs UserId=mhe(63926) GroupId=jacob_lab(40114) MCS_label=N/A Priority=266374 Nice=0 Account=jacob_lab QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2024-12-12T10:32:30 EligibleTime=2024-12-12T10:32:30 AccrueTime=2024-12-12T10:32:30 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-12T10:54:29 Scheduler=Main Partition=huce_ice,huce_cascade,seas_compute,sapphire AllocNode:Sid=holylogin08:2951731 ReqNodeList=(null) ExcNodeList=(null) NodeList= StepMgrEnabled=Yes NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=32,mem=20G,node=1,billing=37 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=32 MinMemoryNode=20G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/home12/mhe/process_obspack/run.job WorkDir=/n/home12/mhe/process_obspack StdErr=/n/home12/mhe/process_obspack/slurm-62151213.out StdIn=/dev/null StdOut=/n/home12/mhe/process_obspack/slurm-62151213.out TresPerTask=cpu=32 There is certainly space there to fit those jobs on those vacant nodes. In my scheduler I have set: SchedulerParameters=\ default_queue_depth=1500,\ partition_job_depth=10,\ bf_continue,\ bf_interval=30,\ bf_resolution=600,\ bf_window=11520,\ bf_max_job_test=300000,\ bf_max_job_start=1000,\ bf_max_job_user_part=10,\ bf_ignore_newly_avail_nodes,\ bf_one_resv_per_job,\ enable_user_top,\ pack_serial_at_end,\ nohold_on_prolog_fail,\ permit_job_expansion,\ requeue_setup_env_fail,\ max_rpc_cnt=16 With partition_job_depth set the scheduler should pick up those jobs regardless of their low priority because they are the top priority in the queue. Even with out that the backfill loop should also pick up those jobs because of bf_max_job_user_part. I will note that this is being exacerbated due to higher cluster activity right now so the scheduler is pretty busy. Still I expect it even under that load to pickup the top jobs in each partition even if the backfill doesn't penetrate the full stack. I've including my configs. Here is a dump of sdiag right now. [root@holy-slurm02 ~]# sdiag ******************************************************* sdiag output at Thu Dec 12 12:31:32 2024 (1734024692) Data since Thu Dec 12 12:15:58 2024 (1734023758) ******************************************************* Server thread count: 12 RPC queue enabled: 0 Agent queue size: 0 Agent count: 2 Agent thread count: 6 DBD Agent queue size: 0 Jobs submitted: 2325 Jobs started: 734 Jobs completed: 1191 Jobs canceled: 22 Jobs failed: 0 Job states ts: Thu Dec 12 12:31:19 2024 (1734024679) Jobs pending: 33332 Jobs running: 27281 Main schedule statistics (microseconds): Last cycle: 1380037 Max cycle: 1755574 Total cycles: 413 Mean cycle: 860399 Mean depth cycle: 9 Cycles per minute: 27 Last queue length: 29104 Main scheduler exit: End of job queue: 0 Hit default_queue_depth: 0 Hit sched_max_job_start: 0 Blocked on licenses: 0 Hit max_rpc_cnt:299 Timeout (max_sched_time):114 Backfilling stats Total backfilled jobs (since last slurm start): 221 Total backfilled jobs (since last stats cycle start): 221 Total backfilled heterogeneous job components: 0 Total cycles: 14 Last cycle when: Thu Dec 12 12:30:49 2024 (1734024649) Last cycle: 14806760 Max cycle: 29604732 Mean cycle: 23233447 Last depth cycle: 27 Last depth cycle (try sched): 21 Depth Mean: 97 Depth Mean (try depth): 45 Last queue length: 29045 Queue length mean: 28651 Last table size: 33 Mean table size: 35 Backfill exit End of job queue: 0 Hit bf_max_job_start: 0 Hit bf_max_job_test: 0 System state changed: 0 Hit table size limit (bf_node_space_size): 0 Timeout (bf_max_time):14 Latency for 1000 calls to gettimeofday(): 15 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:14000 ave_time:12255 total_time:171582486 REQUEST_JOB_INFO_SINGLE ( 2021) count:6759 ave_time:238306 total_time:1610710332 REQUEST_FED_INFO ( 2049) count:6750 ave_time:90 total_time:612678 REQUEST_NODE_INFO ( 2007) count:5551 ave_time:507832 total_time:2818979934 REQUEST_STEP_COMPLETE ( 5016) count:2669 ave_time:945011 total_time:2522236823 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:2663 ave_time:2041707 total_time:5437066581 MESSAGE_EPILOG_COMPLETE ( 6012) count:2663 ave_time:1544223 total_time:4112267063 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:1608 ave_time:2207661 total_time:3549919341 REQUEST_NODE_INFO_SINGLE ( 2040) count:1261 ave_time:302957 total_time:382029879 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:887 ave_time:2785814 total_time:2471017268 REQUEST_COMPLETE_PROLOG ( 6018) count:735 ave_time:1196834 total_time:879673485 REQUEST_JOB_INFO ( 2003) count:465 ave_time:1888868 total_time:878323867 REQUEST_DBD_RELAY ( 1028) count:100 ave_time:247 total_time:24711 REQUEST_SHARE_INFO ( 2022) count:32 ave_time:60298 total_time:1929562 REQUEST_KILL_JOB ( 5032) count:24 ave_time:2719635 total_time:65271258 REQUEST_STATS_INFO ( 2035) count:16 ave_time:164 total_time:2630 REQUEST_JOB_USER_INFO ( 2039) count:14 ave_time:189756 total_time:2656597 REQUEST_CANCEL_JOB_STEP ( 5005) count:13 ave_time:1453062 total_time:18889815 REQUEST_JOB_READY ( 4019) count:9 ave_time:582902 total_time:5246121 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:8 ave_time:1552123 total_time:12416986 REQUEST_RESOURCE_ALLOCATION ( 4001) count:5 ave_time:1717682 total_time:8588410 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:3 ave_time:520916 total_time:1562749 REQUEST_BUILD_INFO ( 2001) count:2 ave_time:2149 total_time:4299 REQUEST_CRONTAB ( 2200) count:2 ave_time:518835 total_time:1037670 REQUEST_UPDATE_NODE ( 3002) count:2 ave_time:645294 total_time:1290589 REQUEST_UPDATE_JOB ( 3001) count:1 ave_time:5953332 total_time:5953332 REQUEST_JOB_STEP_INFO ( 2005) count:1 ave_time:1684285 total_time:1684285 While this isn't breaking scheduling as high priority jobs are still being scheduled, very low priority jobs (even if they are the top of the queue) are not. This leads me to think that the scheduling loops are not getting down to these low priority jobs.
Created attachment 40118 [details] topology.conf
Would it be possible to gather logging with these debugflags enabled and upload those logs? > Backfill,SelectType Also if 62151213 is not longer in the queue then please highlight one of the jobs you think should be scheduled.
Sure. 62151213 eventually scheduled, I think the backfill eventually got deep enough. I've attached a graph of our backfill loop penetration. Here is another example: [root@holy8a24507 ~]# scontrol show job 62164818 JobId=62164818 JobName=t1errgrid_theta1_0.8_theta2_3.2 UserId=bliang(60349) GroupId=janson_lab(403769) MCS_label=N/A Priority=9165110 Nice=0 Account=janson_lab QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2024-12-12T13:13:39 EligibleTime=2024-12-12T13:13:39 AccrueTime=2024-12-12T13:13:39 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-12-12T14:21:59 Scheduler=Main Partition=janson,janson_cascade,janson_bigmem,sapphire,shared,seas_compute,serial_requeue AllocNode:Sid=holylogin07:672391 ReqNodeList=(null) ExcNodeList=(null) NodeList= StepMgrEnabled=Yes NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=60G,node=1,billing=4 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=60G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/n/home08/bliang/single_pc_power/batch_t1err_grid.sh WorkDir=/n/home08/bliang/single_pc_power StdErr=/n/home08/bliang/single_pc_power/err/err_t1errgrid_theta1_0.8_theta2_3.2.stdout.txt StdIn=/dev/null StdOut=/n/home08/bliang/single_pc_power/output/out_t1errgrid_theta1_0.8_theta2_3.2.stdout.txt SUMMARY OF JOBS FOR QUEUE: <seas_compute> ACTIVE JOBS-------------------- JOBID JOBNAME USERNAME STATE CORE GPU REMAINING STARTTIME ==================================================================================== 62092030 ded_adj jholey Running 112 0 26:44:11 Wed Dec 11 17:13:19 62099746 M2GMI_tavg tmooring Running 1 0 64:38:16 Thu Dec 12 07:07:24 62136569 M2GMI_tavg tmooring Running 1 0 70:19:50 Thu Dec 12 12:48:58 62138358 ded_adj jholey Running 112 0 40:01:35 Thu Dec 12 06:30:43 62140447 biome_Dese luciagordon Running 1 0 63:17:31 Thu Dec 12 07:46:39 5 active jobs : 227 of 5584 cores ( 4.07 %): 0 of 0 gpus ( 0.00 %): 4 of 93 nodes ( 4.30 %) WAITING JOBS------------------------ JOBID JOBNAME USERNAME STATE CORE GPU WCLIMIT QUEUETIME ==================================================================================== 62164818 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:13:39 62164821 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:13:41 62164831 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:13:44 62164837 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:13:46 62164850 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:13:54 62164864 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:13:56 62164868 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:14:00 62164872 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:14:05 62164877 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:14:13 62164884 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:14:15 62164888 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:14:18 62164892 t1errgrid_ bliang Waiting 1 0 12:00:00 Thu Dec 12 13:14:20 Even though its a multipartition job it is the top of that partition and should go. There are available it could run on with out preempting anything. I've also attached a 10 minute snippet of our logs with the Backfill,SelectType on. -Paul Edmon- On 12/12/24 1:09 PM, bugs@schedmd.com wrote: > > *Comment # 2 <https://support.schedmd.com/show_bug.cgi?id=21627#c2> on > ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> from > Jason Booth <mailto:jbooth@schedmd.com> * > Would it be possible to gather logging with these debugflags enabled and upload > those logs? > > > Backfill,SelectType > > Also if 62151213 is not longer in the queue then please highlight one of the > jobs > you think should be scheduled. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Created attachment 40119 [details] Picture of overall stats from sdaig
Created attachment 40120 [details] Picture of primary loop stats from sdiag
Created attachment 40121 [details] Picture of backfill stats from sdiag
Created attachment 40122 [details] Log snippet. +backfill +select type starts around 14:25
Sorry listed wrong version we are running 24.05.4
For reference here are the jobs pending per partition: [root@holy8a24507 ~]# /usr/bin/squeue --noheader --state=PD -o %80P | sort | uniq -c 1 arguelles_delgado 1 arguelles_delgado,shared 5 arguelles_delgado,shared,unrestricted 1 bigmem 2 conroy 838 gpu 28 gpu_requeue 1 hernquist_ice 2 holy-smokes-priority,shared,sapphire 198 hsph 2 hsph,sapphire,intermediate 151 hsph,xlin,serial_requeue 1 huce_bigmem,bigmem,bigmem_intermediate 115 intermediate 2 itc_cluster,sapphire,intermediate 2 janson,janson_bigmem,serial_requeue 473 janson,janson_cascade,janson_bigmem,sapphire,shared,seas_compute,serial_requeue 1 kempner 6 kempner_h100 5 kempner_requeue 316 kovac,itc_cluster,serial_requeue 2 kozinsky_gpu,gpu,seas_gpu 7 lukin 1 mweber_gpu 2 olveczkygpu,gpu 7 olveczkygpu,gpu,gpu_requeue 1 remoteviz 621 sapphire 7 sapphire,shared 98 seas_compute 19 seas_compute,shared 130 seas_gpu 32 seas_gpu,gpu 13729 serial_requeue 502 serial_requeue,shared,sapphire 191 serial_requeue,yao,shared,sapphire 5459 shared 1 shared,holy-smokes-priority,sapphire 4627 shared,sapphire,serial_requeue 1 sompolinsky_gpu 4 test 15 unrestricted 31 yao,shared,sapphire,serial_requeue
partition_job_depth is a limit on how many jobs the main scheduler attempts to schedule, not a guarantee. It's possible that the main scheduler has stopped before even getting to the partition you expect should be getting scheduled. From your sdiag, the main scheduler has a mean job depth of 9, so likely not getting to the jobs you're looking at since you have ~30000 pending jobs: > Mean depth cycle: 9 A new feature added in 23.11 is to see why the main/backfill schedulers exit. In your case, it looks like the main scheduler is often hitting your max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2 seconds in your case). > Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 > Timeout (max_sched_time):114 Notice that you're never hitting your default_queue_depth=1500 either. To get the main scheduler to attempt to schedule more jobs, you could increase max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents deferring scheduling too much, while still providing acceptable RPC latency. You can try increasing max_rpc_cnt moderately at first to something like ~50 in case your RPC load turns out to be especially heavy, although you very well may be able to set 150 without problem. Is there a particular reason you have max_rpc_cnt=16? If not, I suggest increasing that to let your main scheduler hit a larger job depth.
We've historically gone with max_rpc_cnt=16 because above that limit the scheduler started becoming unresponsive or started thrashing. We have a bunch of users that query the cluster a lot and we found that having that set low was necessary to prevent the scheduler being overwhelmed while being responsive to users asking it about job state and submitting new work. That said the last time we experimented with max_rpc_cnt was several years and versions back. It may be time to reevaluate and go to a higher setting. Let me experiment with ours and see if we can go higher. -Paul Edmon- On 12/12/24 6:46 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://support.schedmd.com/show_bug.cgi?id=21627#c10> > on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> > from Ben Glines <mailto:ben.glines@schedmd.com> * > partition_job_depth is a limit on how many jobs the main scheduler attempts to > schedule, not a guarantee. It's possible that the main scheduler has stopped > before even getting to the partition you expect should be getting scheduled. > > From your sdiag, the main scheduler has a mean job depth of 9, so likely not > getting to the jobs you're looking at since you have ~30000 pending jobs: > > Mean depth cycle: 9 > > A new feature added in 23.11 is to see why the main/backfill schedulers exit. > In your case, it looks like the main scheduler is often hitting your > max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2 > seconds in your case). > > Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit > sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 > > Timeout (max_sched_time):114 > Notice that you're never hitting your default_queue_depth=1500 either. > > To get the main scheduler to attempt to schedule more jobs, you could increase > max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents > deferring scheduling too much, while still providing acceptable RPC latency. > You can try increasing max_rpc_cnt moderately at first to something like ~50 in > case your RPC load turns out to be especially heavy, although you very well may > be able to set 150 without problem. > > Is there a particular reason you have max_rpc_cnt=16? If not, I suggest > increasing that to let your main scheduler hit a larger job depth. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
Hiking the max_rpc_cnt to 64 seems to have helped. I'm going to keep experimenting with it to see how far we can push it as we obviously want to run as high as we can with out the scheduler being unusable. Thanks for the assist! -Paul Edmon- On 12/12/24 6:46 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://support.schedmd.com/show_bug.cgi?id=21627#c10> > on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> > from Ben Glines <mailto:ben.glines@schedmd.com> * > partition_job_depth is a limit on how many jobs the main scheduler attempts to > schedule, not a guarantee. It's possible that the main scheduler has stopped > before even getting to the partition you expect should be getting scheduled. > > From your sdiag, the main scheduler has a mean job depth of 9, so likely not > getting to the jobs you're looking at since you have ~30000 pending jobs: > > Mean depth cycle: 9 > > A new feature added in 23.11 is to see why the main/backfill schedulers exit. > In your case, it looks like the main scheduler is often hitting your > max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2 > seconds in your case). > > Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit > sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 > > Timeout (max_sched_time):114 > Notice that you're never hitting your default_queue_depth=1500 either. > > To get the main scheduler to attempt to schedule more jobs, you could increase > max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents > deferring scheduling too much, while still providing acceptable RPC latency. > You can try increasing max_rpc_cnt moderately at first to something like ~50 in > case your RPC load turns out to be especially heavy, although you very well may > be able to set 150 without problem. > > Is there a particular reason you have max_rpc_cnt=16? If not, I suggest > increasing that to let your main scheduler hit a larger job depth. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >
As a further follow up has there been any thought about increasing the fundamental RPC limit of 256? It would be nice if this was a user adjustable limit. With the cores being faster and multicore nodes being popular it may be worth reevaluating the fundamental limit or at least granting latitude. -Paul Edmon- On 12/12/24 6:46 PM, bugs@schedmd.com wrote: > > *Comment # 10 <https://support.schedmd.com/show_bug.cgi?id=21627#c10> > on ticket 21627 <https://support.schedmd.com/show_bug.cgi?id=21627> > from Ben Glines <mailto:ben.glines@schedmd.com> * > partition_job_depth is a limit on how many jobs the main scheduler attempts to > schedule, not a guarantee. It's possible that the main scheduler has stopped > before even getting to the partition you expect should be getting scheduled. > > From your sdiag, the main scheduler has a mean job depth of 9, so likely not > getting to the jobs you're looking at since you have ~30000 pending jobs: > > Mean depth cycle: 9 > > A new feature added in 23.11 is to see why the main/backfill schedulers exit. > In your case, it looks like the main scheduler is often hitting your > max_rpc_cnt limit or hitting the max_sched_time timeout (default will be 2 > seconds in your case). > > Main scheduler exit: > End of job queue: 0 > Hit default_queue_depth: 0 > Hit > sched_max_job_start: 0 > Blocked on licenses: 0 > Hit max_rpc_cnt:299 > > Timeout (max_sched_time):114 > Notice that you're never hitting your default_queue_depth=1500 either. > > To get the main scheduler to attempt to schedule more jobs, you could increase > max_rpc_cnt. 16 is fairly low, most sites set it to 150, which prevents > deferring scheduling too much, while still providing acceptable RPC latency. > You can try increasing max_rpc_cnt moderately at first to something like ~50 in > case your RPC load turns out to be especially heavy, although you very well may > be able to set 150 without problem. > > Is there a particular reason you have max_rpc_cnt=16? If not, I suggest > increasing that to let your main scheduler hit a larger job depth. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the ticket. >