| Summary: | Slurmctld 100% cpu and intermittent connection issues | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | lhuang |
| Component: | slurmctld | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NY Genome | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm conf
slurm ctld logs sdiag output |
||
Created attachment 19982 [details]
slurm ctld logs
Created attachment 19983 [details]
sdiag output
Can you confirm the version of slurmctld you are running? The first thing I'm noticing in sdiag that you have a fairly large number of REQUEST_PARTION_INFO and REQUEST_NODE_INFO_SINGLE rpcs. These are generally from user commands like squeue and sinfo. If you have users that watch queues or their particular nodes, or make calls to these inside of jobs it can cause the slurmctld to slow down. I would suggest trying to reduce the number of these queries. I'm still looking through the slurmctld log to see what information is there, I'll let you know when I have more info! Thanks! --Tim This is the version we are on. [root@pe2-slurm01 ~]# sinfo -V slurm 20.11.0 I suspect the large rpc counts might be coming from jobs such as cromwell and snakemake. They assist users to submit large amount of jobs and it also constantly probes the cluster to see if they have completed. We will see if we can get the users to tweak the polling parameters. Thanks! (In reply to lhuang from comment #4) > This is the version we are on. > [root@pe2-slurm01 ~]# sinfo -V > slurm 20.11.0 Thanks for the confirmation! From the logs it appears that you are running into a couple of bugs that have been fixed/improved upon in later 20.11 releases. I would suggest upgrading to the latest 20.11 release, but note the further behavior change in 20.11.3 regarding srun. These errors are ones I think have seen improvements: error: _slurm_rpc_complete_batch_script: Could not find batch step for... error: _find_node_record(764): lookup failure for... > I suspect the large rpc counts might be coming from jobs such as cromwell > and snakemake. They assist users to submit large amount of jobs and it also > constantly probes the cluster to see if they have completed. We will see if > we can get the users to tweak the polling parameters. This sounds like a promising lead on those RPCs! Looking at the sdiag + slurm.conf together, I think the backfill scheduler could use some tuning. Its running for a very long time, which can again cause some issues with user commands appearing slow or not responding. We can try these changes before an upgrade to see how much this changes things as well. I'd like to start with: SchedulerParameters=bf_continue,bf_interval=45,bf_window=20160,bf_resolution=300,bf_yield_interval=1000000,bf_max_job_test=1000,nohold_on_prolog_fail The notable changes here are (with a little bit of why): Addition of bf_interval=45 and removal of bf_max_time=120: a long max time like this can starve out other RPCs. Changing bf_interval to 45 slightly decreases the frequency that the backfill scheduler will run and limits it to 45 seconds run time which should let job submissions/user commands have more time to get processed. Addition of bf_yield_interval=1000000: This has the backfill scheduler pause briefly more often to let other RPCs get serviced. Addition of bf_resolution=300: This changes the number of tests we need to do for each job to see if it will run. The trade-off is that it is slightly less precise about when a job can actually start, but should make getting through a long queue much faster and less potentially problematic. Sorry about the novella-length response! --Tim Hi Tim, Thanks for the detailed information. Good to understand what we are changing. Since the change, we've not seen any outages and the cluster is more responsive. We are also seeing much shorter backfill cycles. Here's the latest sdiag. Feel free to close the ticket. ******************************************************* sdiag output at Thu Jun 17 10:16:08 2021 (1623939368) Data since Wed Jun 16 20:36:30 2021 (1623890190) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 4589 Jobs started: 5165 Jobs completed: 4677 Jobs canceled: 37 Jobs failed: 0 Job states ts: Thu Jun 17 10:15:44 2021 (1623939344) Jobs pending: 1941 Jobs running: 320 Main schedule statistics (microseconds): Last cycle: 142908 Max cycle: 432444 Total cycles: 7103 Mean cycle: 167257 Mean depth cycle: 91 Cycles per minute: 8 Last queue length: 253 Backfilling stats Total backfilled jobs (since last slurm start): 1774 Total backfilled jobs (since last stats cycle start): 1774 Total backfilled heterogeneous job components: 0 Total cycles: 1048 Last cycle when: Thu Jun 17 10:15:38 2021 (1623939338) Last cycle: 707272 Max cycle: 16825802 Mean cycle: 1361145 Last depth cycle: 253 Last depth cycle (try sched): 253 Depth Mean: 331 Depth Mean (try depth): 331 Last queue length: 253 Queue length mean: 328 Last table size: 4 Mean table size: 2 Latency for 1000 calls to gettimeofday(): 712 microseconds Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:140747 ave_time:500 total_time:70496373 REQUEST_NODE_INFO_SINGLE ( 2040) count:125488 ave_time:10226 total_time:1283270778 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:49393 ave_time:14888 total_time:735383074 REQUEST_NODE_INFO ( 2007) count:9858 ave_time:17710 total_time:174587734 REQUEST_STEP_COMPLETE ( 5016) count:5297 ave_time:188412 total_time:998018906 MESSAGE_EPILOG_COMPLETE ( 6012) count:5288 ave_time:692767 total_time:3663354051 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:5280 ave_time:304777 total_time:1609222582 REQUEST_COMPLETE_PROLOG ( 6018) count:5165 ave_time:150587 total_time:777782389 REQUEST_JOB_INFO ( 2003) count:4938 ave_time:190220 total_time:939306855 REQUEST_STATS_INFO ( 2035) count:4919 ave_time:320 total_time:1578565 REQUEST_UPDATE_NODE ( 3002) count:4734 ave_time:6474 total_time:30651187 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:562 ave_time:7706 total_time:4331231 REQUEST_FED_INFO ( 2049) count:501 ave_time:206 total_time:103627 REQUEST_JOB_INFO_SINGLE ( 2021) count:388 ave_time:50136 total_time:19452826 REQUEST_JOB_USER_INFO ( 2039) count:111 ave_time:54517 total_time:6051405 REQUEST_KILL_JOB ( 5032) count:37 ave_time:19598 total_time:725148 REQUEST_JOB_READY ( 4019) count:21 ave_time:23653 total_time:496720 REQUEST_JOB_STEP_CREATE ( 5001) count:15 ave_time:18968 total_time:284533 REQUEST_CANCEL_JOB_STEP ( 5005) count:14 ave_time:7087 total_time:99230 REQUEST_RESOURCE_ALLOCATION ( 4001) count:13 ave_time:55337 total_time:719388 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:10 ave_time:4886 total_time:48863 REQUEST_HET_JOB_ALLOC_INFO ( 4027) count:3 ave_time:108276 total_time:324828 REQUEST_RECONFIGURE ( 1003) count:1 ave_time:819124 total_time:819124 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:1 ave_time:178 total_time:178 Remote Procedure Call statistics by user root ( 0) count:360569 ave_time:28513 total_time:10281049961 zgajic ( 50121) count:879 ave_time:22058 total_time:19389533 jrahman ( 50867) count:558 ave_time:11242 total_time:6273469 tlanganay ( 50994) count:261 ave_time:7024 total_time:1833438 schoudhary ( 50929) count:209 ave_time:7052 total_time:1474002 mbyrska-bishop ( 20158) count:104 ave_time:11614 total_time:1207945 agalianese ( 50993) count:20 ave_time:14052 total_time:281043 wliao ( 5042) count:19 ave_time:2858 total_time:54318 cdurmaz ( 50800) count:18 ave_time:15311 total_time:275604 lhuang ( 20116) count:18 ave_time:62308 total_time:1121557 rmurray ( 91015) count:16 ave_time:80080 total_time:1281295 skasela ( 50484) count:14 ave_time:11931 total_time:167044 lkluegel ( 50879) count:13 ave_time:1796 total_time:23356 tprieto ( 50986) count:12 ave_time:29925 total_time:359104 awidman ( 50507) count:11 ave_time:85422 total_time:939652 abasile ( 20198) count:11 ave_time:6835 total_time:75192 ylee ( 20052) count:10 ave_time:30881 total_time:308815 fgaiti ( 50293) count:10 ave_time:22775 total_time:227751 fizzo ( 50130) count:7 ave_time:3457 total_time:24199 jmorris ( 50680) count:4 ave_time:136179 total_time:544719 dyuan ( 50990) count:4 ave_time:4759 total_time:19037 vle ( 50574) count:3 ave_time:47363 total_time:142091 smarks ( 20054) count:3 ave_time:1521 total_time:4565 pchamely ( 50807) count:3 ave_time:1454 total_time:4362 daknowles ( 30842) count:3 ave_time:280 total_time:840 ashaiber ( 50890) count:3 ave_time:2041 total_time:6124 tlin ( 91013) count:2 ave_time:10289 total_time:20579 Awesome! I'm glad to hear that things are running better now. I'll close this ticket out, but let us know if you encounter any other issues! Thanks! --Tim |
Created attachment 19981 [details] slurm conf Hi, Our users have problems issuing simple commands to the cluster such as squeue, sbatch etc. The problem seems to be intermittent and we have around 30-40k jobs. Seems to get worse as more jobs are added to the cluster. I'll attach the slurm conf and slurmctld logs.