Ticket 11849

Summary: Slurmctld 100% cpu and intermittent connection issues
Product: Slurm Reporter: lhuang
Component: slurmctldAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.0   
Hardware: Linux   
OS: Linux   
Site: NY Genome Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm conf
slurm ctld logs
sdiag output

Description lhuang 2021-06-16 10:47:29 MDT
Created attachment 19981 [details]
slurm conf

Hi,

Our users have problems issuing simple commands to the cluster such as squeue, sbatch etc.

The problem seems to be intermittent and we have around 30-40k jobs. Seems to get worse as more jobs are added to the cluster.

I'll attach the slurm conf and slurmctld logs.
Comment 1 lhuang 2021-06-16 10:48:23 MDT
Created attachment 19982 [details]
slurm ctld logs
Comment 2 lhuang 2021-06-16 10:48:47 MDT
Created attachment 19983 [details]
sdiag output
Comment 3 Tim McMullan 2021-06-16 12:00:55 MDT
Can you confirm the version of slurmctld you are running?

The first thing I'm noticing in sdiag that you have a fairly large number of REQUEST_PARTION_INFO and REQUEST_NODE_INFO_SINGLE rpcs. These are generally from user commands like squeue and sinfo.  If you have users that watch queues or their particular nodes, or make calls to these inside of jobs it can cause the slurmctld to slow down.  I would suggest trying to reduce the number of these queries.

I'm still looking through the slurmctld log to see what information is there, I'll let you know when I have more info!

Thanks!
--Tim
Comment 4 lhuang 2021-06-16 12:07:48 MDT
This is the version we are on.
[root@pe2-slurm01 ~]# sinfo -V
slurm 20.11.0

I suspect the large rpc counts might be coming from jobs such as cromwell and snakemake. They assist users to submit large amount of jobs and it also constantly probes the cluster to see if they have completed. We will see if we can get the users to tweak the polling parameters.

Thanks!
Comment 5 Tim McMullan 2021-06-16 13:32:37 MDT
(In reply to lhuang from comment #4)
> This is the version we are on.
> [root@pe2-slurm01 ~]# sinfo -V
> slurm 20.11.0

Thanks for the confirmation!

From the logs it appears that you are running into a couple of bugs that have been fixed/improved upon in later 20.11 releases.  I would suggest upgrading to the latest 20.11 release, but note the further behavior change in 20.11.3 regarding srun.

These errors are ones I think have seen improvements:
error: _slurm_rpc_complete_batch_script: Could not find batch step for...
error: _find_node_record(764): lookup failure for...

> I suspect the large rpc counts might be coming from jobs such as cromwell
> and snakemake. They assist users to submit large amount of jobs and it also
> constantly probes the cluster to see if they have completed. We will see if
> we can get the users to tweak the polling parameters.

This sounds like a promising lead on those RPCs!


Looking at the sdiag + slurm.conf together, I think the backfill scheduler could use some tuning.  Its running for a very long time, which can again cause some issues with user commands appearing slow or not responding.  We can try these changes before an upgrade to see how much this changes things as well.

I'd like to start with:
SchedulerParameters=bf_continue,bf_interval=45,bf_window=20160,bf_resolution=300,bf_yield_interval=1000000,bf_max_job_test=1000,nohold_on_prolog_fail

The notable changes here are (with a little bit of why):

Addition of bf_interval=45 and removal of bf_max_time=120: a long max time like this can starve out other RPCs.  Changing bf_interval to 45 slightly decreases the frequency that the backfill scheduler will run and limits it to 45 seconds run time which should let job submissions/user commands have more time to get processed.

Addition of bf_yield_interval=1000000: This has the backfill scheduler pause briefly more often to let other RPCs get serviced.

Addition of bf_resolution=300:  This changes the number of tests we need to do for each job to see if it will run.  The trade-off is that it is slightly less precise about when a job can actually start, but should make getting through a long queue much faster and less potentially problematic.


Sorry about the novella-length response!
--Tim
Comment 6 lhuang 2021-06-17 08:19:42 MDT
Hi Tim,

Thanks for the detailed information. Good to understand what we are changing. Since the change, we've not seen any outages and the cluster is more responsive. We are also seeing much shorter backfill cycles.

Here's the latest sdiag. Feel free to close the ticket.

*******************************************************
sdiag output at Thu Jun 17 10:16:08 2021 (1623939368)
Data since      Wed Jun 16 20:36:30 2021 (1623890190)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 4589
Jobs started:   5165
Jobs completed: 4677
Jobs canceled:  37
Jobs failed:    0

Job states ts:  Thu Jun 17 10:15:44 2021 (1623939344)
Jobs pending:   1941
Jobs running:   320

Main schedule statistics (microseconds):
	Last cycle:   142908
	Max cycle:    432444
	Total cycles: 7103
	Mean cycle:   167257
	Mean depth cycle:  91
	Cycles per minute: 8
	Last queue length: 253

Backfilling stats
	Total backfilled jobs (since last slurm start): 1774
	Total backfilled jobs (since last stats cycle start): 1774
	Total backfilled heterogeneous job components: 0
	Total cycles: 1048
	Last cycle when: Thu Jun 17 10:15:38 2021 (1623939338)
	Last cycle: 707272
	Max cycle:  16825802
	Mean cycle: 1361145
	Last depth cycle: 253
	Last depth cycle (try sched): 253
	Depth Mean: 331
	Depth Mean (try depth): 331
	Last queue length: 253
	Queue length mean: 328
	Last table size: 4
	Mean table size: 2

Latency for 1000 calls to gettimeofday(): 712 microseconds

Remote Procedure Call statistics by message type
	REQUEST_PARTITION_INFO                  ( 2009) count:140747 ave_time:500    total_time:70496373
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:125488 ave_time:10226  total_time:1283270778
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:49393  ave_time:14888  total_time:735383074
	REQUEST_NODE_INFO                       ( 2007) count:9858   ave_time:17710  total_time:174587734
	REQUEST_STEP_COMPLETE                   ( 5016) count:5297   ave_time:188412 total_time:998018906
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:5288   ave_time:692767 total_time:3663354051
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:5280   ave_time:304777 total_time:1609222582
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:5165   ave_time:150587 total_time:777782389
	REQUEST_JOB_INFO                        ( 2003) count:4938   ave_time:190220 total_time:939306855
	REQUEST_STATS_INFO                      ( 2035) count:4919   ave_time:320    total_time:1578565
	REQUEST_UPDATE_NODE                     ( 3002) count:4734   ave_time:6474   total_time:30651187
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:562    ave_time:7706   total_time:4331231
	REQUEST_FED_INFO                        ( 2049) count:501    ave_time:206    total_time:103627
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:388    ave_time:50136  total_time:19452826
	REQUEST_JOB_USER_INFO                   ( 2039) count:111    ave_time:54517  total_time:6051405
	REQUEST_KILL_JOB                        ( 5032) count:37     ave_time:19598  total_time:725148
	REQUEST_JOB_READY                       ( 4019) count:21     ave_time:23653  total_time:496720
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:15     ave_time:18968  total_time:284533
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:14     ave_time:7087   total_time:99230
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:13     ave_time:55337  total_time:719388
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:10     ave_time:4886   total_time:48863
	REQUEST_HET_JOB_ALLOC_INFO              ( 4027) count:3      ave_time:108276 total_time:324828
	REQUEST_RECONFIGURE                     ( 1003) count:1      ave_time:819124 total_time:819124
	REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:1      ave_time:178    total_time:178

Remote Procedure Call statistics by user
	root            (       0) count:360569 ave_time:28513  total_time:10281049961
	zgajic          (   50121) count:879    ave_time:22058  total_time:19389533
	jrahman         (   50867) count:558    ave_time:11242  total_time:6273469
	tlanganay       (   50994) count:261    ave_time:7024   total_time:1833438
	schoudhary      (   50929) count:209    ave_time:7052   total_time:1474002
	mbyrska-bishop  (   20158) count:104    ave_time:11614  total_time:1207945
	agalianese      (   50993) count:20     ave_time:14052  total_time:281043
	wliao           (    5042) count:19     ave_time:2858   total_time:54318
	cdurmaz         (   50800) count:18     ave_time:15311  total_time:275604
	lhuang          (   20116) count:18     ave_time:62308  total_time:1121557
	rmurray         (   91015) count:16     ave_time:80080  total_time:1281295
	skasela         (   50484) count:14     ave_time:11931  total_time:167044
	lkluegel        (   50879) count:13     ave_time:1796   total_time:23356
	tprieto         (   50986) count:12     ave_time:29925  total_time:359104
	awidman         (   50507) count:11     ave_time:85422  total_time:939652
	abasile         (   20198) count:11     ave_time:6835   total_time:75192
	ylee            (   20052) count:10     ave_time:30881  total_time:308815
	fgaiti          (   50293) count:10     ave_time:22775  total_time:227751
	fizzo           (   50130) count:7      ave_time:3457   total_time:24199
	jmorris         (   50680) count:4      ave_time:136179 total_time:544719
	dyuan           (   50990) count:4      ave_time:4759   total_time:19037
	vle             (   50574) count:3      ave_time:47363  total_time:142091
	smarks          (   20054) count:3      ave_time:1521   total_time:4565
	pchamely        (   50807) count:3      ave_time:1454   total_time:4362
	daknowles       (   30842) count:3      ave_time:280    total_time:840
	ashaiber        (   50890) count:3      ave_time:2041   total_time:6124
	tlin            (   91013) count:2      ave_time:10289  total_time:20579
Comment 7 Tim McMullan 2021-06-17 09:22:51 MDT
Awesome!  I'm glad to hear that things are running better now.

I'll close this ticket out, but let us know if you encounter any other issues!

Thanks!
--Tim