Ticket 21847 - Abnormally large agent queue size causing wedging
Summary: Abnormally large agent queue size causing wedging
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 24.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-01-17 19:30 MST by Paul Edmon
Modified: 2025-01-20 08:57 MST (History)
0 users

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: None
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Graph of agent queue and server threads (296.24 KB, image/png)
2025-01-17 19:32 MST, Paul Edmon
Details
Current slurm.conf (69.32 KB, text/x-matlab)
2025-01-17 19:33 MST, Paul Edmon
Details
Current topology.conf (4.77 KB, text/x-matlab)
2025-01-17 19:33 MST, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2025-01-17 19:30:53 MST
For the past couple of days we've noticed that the agent queue size has been abnormally large and we haven't been able to find a cause. It's been causing a large number of jobs to get stuck in completing state there by causing the scheduler to be come functionally inoperable, as most nodes are in IDLE+COMPLETING. Tonight it caused a full wedge as the system capped out at 256 threads and then locked up fully.

We had previously been operating for a couple of months fine with max_rpc_cnt at 128. While I'm not sure this is the cause I did drop our max_rpc_cnt to 32 to see if it helps though the agent queue still remains very large.

Below is the print out of sdiag I just did (this is with max_rpc_cnt at 32). I've also attached a graph of the agent queue and thread behavior.

This behavior seems to come and go.

[root@holy-slurm02 slurm]# sdiag
*******************************************************
sdiag output at Fri Jan 17 21:26:48 2025 (1737167208)
Data since      Fri Jan 17 21:11:48 2025 (1737166308)
*******************************************************
Server thread count:  2
RPC queue enabled:    0
Agent queue size:     231
Agent count:          82
Agent thread count:   246
DBD Agent queue size: 0

Jobs submitted: 61
Jobs started:   1003
Jobs completed: 899
Jobs canceled:  10
Jobs failed:    0

Job states ts:  Fri Jan 17 21:26:25 2025 (1737167185)
Jobs pending:   8960
Jobs running:   8308

Main schedule statistics (microseconds):
        Last cycle:   124494
        Max cycle:    124494
        Total cycles: 225
        Mean cycle:   78342
        Mean depth cycle:  87
        Cycles per minute: 15
        Last queue length: 4124

Main scheduler exit:
        End of job queue:225
        Hit default_queue_depth: 0
        Hit sched_max_job_start: 0
        Blocked on licenses: 0
        Hit max_rpc_cnt: 0
        Timeout (max_sched_time): 0

Backfilling stats
        Total backfilled jobs (since last slurm start): 250
        Total backfilled jobs (since last stats cycle start): 250
        Total backfilled heterogeneous job components: 0
        Total cycles: 26
        Last cycle when: Fri Jan 17 21:26:42 2025 (1737167202)
        Last cycle: 3701744
        Max cycle:  4180596
        Mean cycle: 3847068
        Last depth cycle: 4146
        Last depth cycle (try sched): 150
        Depth Mean: 4351
        Depth Mean (try depth): 149
        Last queue length: 4137
        Queue length mean: 4346
        Last table size: 119
        Mean table size: 116

Backfill exit
        End of job queue:26
        Hit bf_max_job_start: 0
        Hit bf_max_job_test: 0
        System state changed: 0
        Hit table size limit (bf_node_space_size): 0
        Timeout (bf_max_time): 0

Latency for 1000 calls to gettimeofday(): 17 microseconds

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:5571   ave_time:4503   total_time:25091591
        REQUEST_NODE_INFO                       ( 2007) count:2788   ave_time:128049 total_time:357001586
        REQUEST_FED_INFO                        ( 2049) count:1778   ave_time:112    total_time:200604
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:1759   ave_time:133832 total_time:235412193
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:1670   ave_time:131661 total_time:219874502
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:1642   ave_time:13540  total_time:22232837
        REQUEST_STEP_COMPLETE                   ( 5016) count:1634   ave_time:48413  total_time:79107607
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:1627   ave_time:315533 total_time:513373614
        REQUEST_COMPLETE_PROLOG                 ( 6018) count:978    ave_time:21294  total_time:20825828
        REQUEST_JOB_INFO                        ( 2003) count:754    ave_time:324072 total_time:244350623
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:369    ave_time:52740  total_time:19461259
        REQUEST_DBD_RELAY                       ( 1028) count:72     ave_time:150    total_time:10810
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:41     ave_time:269007 total_time:11029290
        REQUEST_JOB_USER_INFO                   ( 2039) count:23     ave_time:160139 total_time:3683213
        REQUEST_SHARE_INFO                      ( 2022) count:16     ave_time:50626  total_time:810031
        REQUEST_STATS_INFO                      ( 2035) count:15     ave_time:380    total_time:5701
        REQUEST_JOB_READY                       ( 4019) count:13     ave_time:120    total_time:1570
        REQUEST_KILL_JOB                        ( 5032) count:11     ave_time:100242 total_time:1102668
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:7      ave_time:480808 total_time:3365659
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:1      ave_time:1684   total_time:1684
        REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:1      ave_time:101    total_time:101

Remote Procedure Call statistics by user
        root            (       0) count:14221  ave_time:89250  total_time:1269227812
        ameterez        (   65882) count:1240   ave_time:171924 total_time:213186050
        ktyssowski      (   20332) count:513    ave_time:37370  total_time:19170968
        bdelwood        (   62707) count:492    ave_time:32901  total_time:16187336
        rishii          (   60427) count:480    ave_time:57741  total_time:27715826
        jjensen         (   30306) count:479    ave_time:57474  total_time:27530310
        oomoruyi        (   60299) count:479    ave_time:55698  total_time:26679431
        sbrielle        (   60563) count:360    ave_time:21458  total_time:7725104
        xyang338        (   66072) count:354    ave_time:16176  total_time:5726492
        tzeng           (   64498) count:243    ave_time:64091  total_time:15574306
        amoulana        (   20818) count:243    ave_time:63890  total_time:15525309
        ylh202          (   22451) count:243    ave_time:61360  total_time:14910597
        axzhu           (   60693) count:242    ave_time:68780  total_time:16644810
        rdang           (   60884) count:242    ave_time:73268  total_time:17730931
        kjia            (   62052) count:240    ave_time:98220  total_time:23572815
        mcheng1         (   61588) count:182    ave_time:39220  total_time:7138095
        xlwang          (   64283) count:118    ave_time:53045  total_time:6259404
        ruzhang         (  558295) count:90     ave_time:3462   total_time:311589
        thuonghoang     (   64556) count:60     ave_time:60682  total_time:3640976
        lestrada        (   60915) count:56     ave_time:31876  total_time:1785067
        yifanli         (   66296) count:56     ave_time:47665  total_time:2669240
        wonjung         (   14455) count:42     ave_time:9473   total_time:397888
        ycchen          (   60940) count:15     ave_time:72416  total_time:1086242
        pryke           (   40838) count:9      ave_time:61298  total_time:551685
        ani             (   64281) count:9      ave_time:364676 total_time:3282087
        jiling          (   30365) count:9      ave_time:402363 total_time:3621274
        ruichen         (   65316) count:7      ave_time:24212  total_time:169489
        ssong33         (   13928) count:6      ave_time:415050 total_time:2490302
        ankitbiswas     (   65426) count:6      ave_time:17236  total_time:103418
        cmohri          (   64201) count:6      ave_time:238500 total_time:1431001
        guzman          (   65986) count:6      ave_time:3165   total_time:18992
        sjelassi        (   64203) count:5      ave_time:404156 total_time:2020783
        mchase          (   12043) count:5      ave_time:98967  total_time:494836
        mhwang          (   61480) count:4      ave_time:479695 total_time:1918781
        treyscott       (   64602) count:3      ave_time:41304  total_time:123914
        katrinabrown    (   61836) count:3      ave_time:63021  total_time:189063
        xmorgan         (   56634) count:1      ave_time:121449 total_time:121449
        ebauer          (   30178) count:1      ave_time:9299   total_time:9299

Pending RPC statistics
        SRUN_TIMEOUT                            ( 7002) count:81
        REQUEST_TERMINATE_JOB                   ( 6011) count:41
        REQUEST_LAUNCH_PROLOG                   ( 6017) count:25
        REQUEST_BATCH_JOB_LAUNCH                ( 4005) count:25
        SRUN_JOB_COMPLETE                       ( 7004) count:30
        REQUEST_KILL_TIMELIMIT                  ( 6009) count:28
        REQUEST_HEALTH_CHECK                    ( 1011) count:1

Pending RPCs
         1: SRUN_TIMEOUT                         holy7c12108
         2: SRUN_TIMEOUT                         holy8a26108
         3: REQUEST_TERMINATE_JOB                holy8a32107
         4: REQUEST_TERMINATE_JOB                holy8a27207
         5: REQUEST_TERMINATE_JOB                holy7c16509
         6: REQUEST_TERMINATE_JOB                holy8a26109
         7: REQUEST_TERMINATE_JOB                holy8a32107
         8: REQUEST_TERMINATE_JOB                holy8a32108
         9: REQUEST_TERMINATE_JOB                holy8a31408
        10: REQUEST_LAUNCH_PROLOG                holy8a26410
        11: REQUEST_BATCH_JOB_LAUNCH             holy8a26410
        12: REQUEST_LAUNCH_PROLOG                holy8a26410
        13: REQUEST_BATCH_JOB_LAUNCH             holy8a26410
        14: REQUEST_LAUNCH_PROLOG                holy8a26410
        15: REQUEST_BATCH_JOB_LAUNCH             holy8a26410
        16: REQUEST_LAUNCH_PROLOG                holy8a26410
        17: REQUEST_BATCH_JOB_LAUNCH             holy8a26410
        18: REQUEST_LAUNCH_PROLOG                holy8a26410
        19: REQUEST_BATCH_JOB_LAUNCH             holy8a26410
        20: REQUEST_TERMINATE_JOB                holy7c16503
        21: REQUEST_TERMINATE_JOB                holy8a32306
        22: REQUEST_TERMINATE_JOB                holy8a32111
        23: SRUN_JOB_COMPLETE                    holygpu8a22603
        24: REQUEST_TERMINATE_JOB                holygpu8a22603
        25: REQUEST_TERMINATE_JOB                holy8a32308
Comment 1 Paul Edmon 2025-01-17 19:32:32 MST
Created attachment 40442 [details]
Graph of agent queue and server threads
Comment 2 Paul Edmon 2025-01-17 19:33:32 MST
Created attachment 40443 [details]
Current slurm.conf
Comment 3 Paul Edmon 2025-01-17 19:33:47 MST
Created attachment 40444 [details]
Current topology.conf
Comment 4 Paul Edmon 2025-01-20 08:57:14 MST
Looks like lowering the max_rpc_cnt to 32 did the trick. Things are much more stable. I'm resolving this ticket.