For the past couple of days we've noticed that the agent queue size has been abnormally large and we haven't been able to find a cause. It's been causing a large number of jobs to get stuck in completing state there by causing the scheduler to be come functionally inoperable, as most nodes are in IDLE+COMPLETING. Tonight it caused a full wedge as the system capped out at 256 threads and then locked up fully. We had previously been operating for a couple of months fine with max_rpc_cnt at 128. While I'm not sure this is the cause I did drop our max_rpc_cnt to 32 to see if it helps though the agent queue still remains very large. Below is the print out of sdiag I just did (this is with max_rpc_cnt at 32). I've also attached a graph of the agent queue and thread behavior. This behavior seems to come and go. [root@holy-slurm02 slurm]# sdiag ******************************************************* sdiag output at Fri Jan 17 21:26:48 2025 (1737167208) Data since Fri Jan 17 21:11:48 2025 (1737166308) ******************************************************* Server thread count: 2 RPC queue enabled: 0 Agent queue size: 231 Agent count: 82 Agent thread count: 246 DBD Agent queue size: 0 Jobs submitted: 61 Jobs started: 1003 Jobs completed: 899 Jobs canceled: 10 Jobs failed: 0 Job states ts: Fri Jan 17 21:26:25 2025 (1737167185) Jobs pending: 8960 Jobs running: 8308 Main schedule statistics (microseconds): Last cycle: 124494 Max cycle: 124494 Total cycles: 225 Mean cycle: 78342 Mean depth cycle: 87 Cycles per minute: 15 Last queue length: 4124 Main scheduler exit: End of job queue:225 Hit default_queue_depth: 0 Hit sched_max_job_start: 0 Blocked on licenses: 0 Hit max_rpc_cnt: 0 Timeout (max_sched_time): 0 Backfilling stats Total backfilled jobs (since last slurm start): 250 Total backfilled jobs (since last stats cycle start): 250 Total backfilled heterogeneous job components: 0 Total cycles: 26 Last cycle when: Fri Jan 17 21:26:42 2025 (1737167202) Last cycle: 3701744 Max cycle: 4180596 Mean cycle: 3847068 Last depth cycle: 4146 Last depth cycle (try sched): 150 Depth Mean: 4351 Depth Mean (try depth): 149 Last queue length: 4137 Queue length mean: 4346 Last table size: 119 Mean table size: 116 Backfill exit End of job queue:26 Hit bf_max_job_start: 0 Hit bf_max_job_test: 0 System state changed: 0 Hit table size limit (bf_node_space_size): 0 Timeout (bf_max_time): 0 Latency for 1000 calls to gettimeofday(): 17 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:5571 ave_time:4503 total_time:25091591 REQUEST_NODE_INFO ( 2007) count:2788 ave_time:128049 total_time:357001586 REQUEST_FED_INFO ( 2049) count:1778 ave_time:112 total_time:200604 REQUEST_JOB_INFO_SINGLE ( 2021) count:1759 ave_time:133832 total_time:235412193 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:1670 ave_time:131661 total_time:219874502 MESSAGE_EPILOG_COMPLETE ( 6012) count:1642 ave_time:13540 total_time:22232837 REQUEST_STEP_COMPLETE ( 5016) count:1634 ave_time:48413 total_time:79107607 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:1627 ave_time:315533 total_time:513373614 REQUEST_COMPLETE_PROLOG ( 6018) count:978 ave_time:21294 total_time:20825828 REQUEST_JOB_INFO ( 2003) count:754 ave_time:324072 total_time:244350623 REQUEST_NODE_INFO_SINGLE ( 2040) count:369 ave_time:52740 total_time:19461259 REQUEST_DBD_RELAY ( 1028) count:72 ave_time:150 total_time:10810 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:41 ave_time:269007 total_time:11029290 REQUEST_JOB_USER_INFO ( 2039) count:23 ave_time:160139 total_time:3683213 REQUEST_SHARE_INFO ( 2022) count:16 ave_time:50626 total_time:810031 REQUEST_STATS_INFO ( 2035) count:15 ave_time:380 total_time:5701 REQUEST_JOB_READY ( 4019) count:13 ave_time:120 total_time:1570 REQUEST_KILL_JOB ( 5032) count:11 ave_time:100242 total_time:1102668 REQUEST_RESOURCE_ALLOCATION ( 4001) count:7 ave_time:480808 total_time:3365659 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:1 ave_time:1684 total_time:1684 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:1 ave_time:101 total_time:101 Remote Procedure Call statistics by user root ( 0) count:14221 ave_time:89250 total_time:1269227812 ameterez ( 65882) count:1240 ave_time:171924 total_time:213186050 ktyssowski ( 20332) count:513 ave_time:37370 total_time:19170968 bdelwood ( 62707) count:492 ave_time:32901 total_time:16187336 rishii ( 60427) count:480 ave_time:57741 total_time:27715826 jjensen ( 30306) count:479 ave_time:57474 total_time:27530310 oomoruyi ( 60299) count:479 ave_time:55698 total_time:26679431 sbrielle ( 60563) count:360 ave_time:21458 total_time:7725104 xyang338 ( 66072) count:354 ave_time:16176 total_time:5726492 tzeng ( 64498) count:243 ave_time:64091 total_time:15574306 amoulana ( 20818) count:243 ave_time:63890 total_time:15525309 ylh202 ( 22451) count:243 ave_time:61360 total_time:14910597 axzhu ( 60693) count:242 ave_time:68780 total_time:16644810 rdang ( 60884) count:242 ave_time:73268 total_time:17730931 kjia ( 62052) count:240 ave_time:98220 total_time:23572815 mcheng1 ( 61588) count:182 ave_time:39220 total_time:7138095 xlwang ( 64283) count:118 ave_time:53045 total_time:6259404 ruzhang ( 558295) count:90 ave_time:3462 total_time:311589 thuonghoang ( 64556) count:60 ave_time:60682 total_time:3640976 lestrada ( 60915) count:56 ave_time:31876 total_time:1785067 yifanli ( 66296) count:56 ave_time:47665 total_time:2669240 wonjung ( 14455) count:42 ave_time:9473 total_time:397888 ycchen ( 60940) count:15 ave_time:72416 total_time:1086242 pryke ( 40838) count:9 ave_time:61298 total_time:551685 ani ( 64281) count:9 ave_time:364676 total_time:3282087 jiling ( 30365) count:9 ave_time:402363 total_time:3621274 ruichen ( 65316) count:7 ave_time:24212 total_time:169489 ssong33 ( 13928) count:6 ave_time:415050 total_time:2490302 ankitbiswas ( 65426) count:6 ave_time:17236 total_time:103418 cmohri ( 64201) count:6 ave_time:238500 total_time:1431001 guzman ( 65986) count:6 ave_time:3165 total_time:18992 sjelassi ( 64203) count:5 ave_time:404156 total_time:2020783 mchase ( 12043) count:5 ave_time:98967 total_time:494836 mhwang ( 61480) count:4 ave_time:479695 total_time:1918781 treyscott ( 64602) count:3 ave_time:41304 total_time:123914 katrinabrown ( 61836) count:3 ave_time:63021 total_time:189063 xmorgan ( 56634) count:1 ave_time:121449 total_time:121449 ebauer ( 30178) count:1 ave_time:9299 total_time:9299 Pending RPC statistics SRUN_TIMEOUT ( 7002) count:81 REQUEST_TERMINATE_JOB ( 6011) count:41 REQUEST_LAUNCH_PROLOG ( 6017) count:25 REQUEST_BATCH_JOB_LAUNCH ( 4005) count:25 SRUN_JOB_COMPLETE ( 7004) count:30 REQUEST_KILL_TIMELIMIT ( 6009) count:28 REQUEST_HEALTH_CHECK ( 1011) count:1 Pending RPCs 1: SRUN_TIMEOUT holy7c12108 2: SRUN_TIMEOUT holy8a26108 3: REQUEST_TERMINATE_JOB holy8a32107 4: REQUEST_TERMINATE_JOB holy8a27207 5: REQUEST_TERMINATE_JOB holy7c16509 6: REQUEST_TERMINATE_JOB holy8a26109 7: REQUEST_TERMINATE_JOB holy8a32107 8: REQUEST_TERMINATE_JOB holy8a32108 9: REQUEST_TERMINATE_JOB holy8a31408 10: REQUEST_LAUNCH_PROLOG holy8a26410 11: REQUEST_BATCH_JOB_LAUNCH holy8a26410 12: REQUEST_LAUNCH_PROLOG holy8a26410 13: REQUEST_BATCH_JOB_LAUNCH holy8a26410 14: REQUEST_LAUNCH_PROLOG holy8a26410 15: REQUEST_BATCH_JOB_LAUNCH holy8a26410 16: REQUEST_LAUNCH_PROLOG holy8a26410 17: REQUEST_BATCH_JOB_LAUNCH holy8a26410 18: REQUEST_LAUNCH_PROLOG holy8a26410 19: REQUEST_BATCH_JOB_LAUNCH holy8a26410 20: REQUEST_TERMINATE_JOB holy7c16503 21: REQUEST_TERMINATE_JOB holy8a32306 22: REQUEST_TERMINATE_JOB holy8a32111 23: SRUN_JOB_COMPLETE holygpu8a22603 24: REQUEST_TERMINATE_JOB holygpu8a22603 25: REQUEST_TERMINATE_JOB holy8a32308
Created attachment 40442 [details] Graph of agent queue and server threads
Created attachment 40443 [details] Current slurm.conf
Created attachment 40444 [details] Current topology.conf
Looks like lowering the max_rpc_cnt to 32 did the trick. Things are much more stable. I'm resolving this ticket.