Ticket 9482

Summary: very large processing time
Product: Slurm Reporter: Sophie Créno <sophie.creno>
Component: slurmctldAssignee: Chad Vizino <chad>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 19.05.7   
Hardware: Linux   
OS: Linux   
Site: Institut Pasteur Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
slurmctld logs
sinfo output
sdiag output
sdiag after second scontrol reconfigure
sdiag output done on the 4th of August 4:25pm

Description Sophie Créno 2020-07-29 08:52:30 MDT
Created attachment 15215 [details]
slurm.conf

Hello,

  We notice slowness of commands such as 

scontrol reconfigure 

(up to 35s) on our preproduction cluster maestro (19.05.7). 
We also have warnings and errors in slurmctld logs such as
                                                                                                  
2020-07-29T16:27:23+02:00 maestro-sched slurmctld[31760]: Warning: Note very large processing time from _slurm_rpc_node_registration: usec=5005840 began=16:27:18.964                                                                                                
2020-07-29T16:27:23+02:00 maestro-sched slurmctld[31760]: Warning: Note very large processing time from _slurmctld_background: usec=4017503 began=16:27:19.952                                                                                                       
2020-07-29T16:27:24+02:00 maestro-sched slurmctld[31760]: Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=3025470 began=16:27:20.975   

or

2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1033 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation                                                                                                  
2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1033 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation                                                                                                  
2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1026 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation                                                                                                  
2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1022 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation                                                                                                  
2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1008 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation 

At the moment, we have 31 nodes with 96 cores each that are fully used 
(see sinfo output in attachment) by 2 jobs arrays. But the array tasks
are rather long jobs (1 to 3 days) with different duration. So they
definitely don't complete all at the same time. They are not CPU intensive
all the time either.

I also add a sdiag output. maestro-sched is a bare metal machine running
both slurmctld and slurmdbd and absolutely not overloaded.

  Could you please give us advice to avoid these errors and warnings?
As you can see the SchedulerParameters are more or less those suggested
for High Throughput Computing.

Thanks in advance,
Comment 1 Sophie Créno 2020-07-29 08:53:27 MDT
Created attachment 15216 [details]
slurmctld logs
Comment 2 Sophie Créno 2020-07-29 08:53:59 MDT
Created attachment 15217 [details]
sinfo output
Comment 3 Sophie Créno 2020-07-29 08:54:28 MDT
Created attachment 15218 [details]
sdiag output
Comment 6 Chad Vizino 2020-07-29 15:57:52 MDT
I'm looking into this.

 -Chad
Comment 7 Chad Vizino 2020-07-29 18:35:20 MDT
The info you sent was helpful and from your sdiag output the main thing that stands out is this:

---
*******************************************************
sdiag output at Wed Jul 29 16:26:58 2020 (1596032818)
Data since      Wed Jul 29 02:00:00 2020 (1595980800)
*******************************************************
  ...
Remote Procedure Call statistics by message type
        MESSAGE_COMPOSITE                       (11001) count:650729 ave_time:137186 total_time:89271001097
---

This call relates to message aggregation (see MsgAggregationParams in your slurm.conf) and shows slurmctld is doing a lot of work handling it. Note that the time units are in microseconds.

There's been some recent work with a message aggregation fix that went into 19.05.7 so you should have that. However, there may be additional issues that have not been addressed.

Would it be possible for you to disable message aggregation to see if this helps your situation? This will require restarting all the Slurm daemons on all nodes and is not advisable while jobs are running so you may want to wait until you have some down time to do this.
Comment 8 Sophie Créno 2020-07-30 02:49:06 MDT
Hello and thanks for your quick answer,

Given the number of tasks of the current running job array and 
the time they need to complete, I'll try to find a trade off
with the user to restart the daemons over the weekend.

Have a nice day,
Comment 9 Sophie Créno 2020-08-03 02:50:45 MDT
Hell,

I've changed slurm.conf,

$ scontrol show conf | grep -i aggregation                                                                              
MsgAggregationParams    = (null)

restarted slurmctld and the slurmd on all nodes and did a
scontrol reconfigure.

I've then released the jobs. But if I launch a new scontrol reconfigure 
(without any change in slurm.conf), it still takes a lot of time

$ time scontrol reconfigure                                                                                       

real    1m20.187s
user    0m0.002s
sys     0m0.003s


$time scontrol reconfigure                                                                                       

real    0m35.119s
user    0m0.003s
sys     0m0.001s


So of course I have the corresponding warnings in the logs:

2020-08-03T10:27:15+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _set_slurmd_addr: usec=35050418 began=10:26:40.948                                            
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from read_slurm_conf: usec=35097971 began=10:26:40.945                                             
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_job_pack_alloc_info: usec=17300526 began=10:26:58.742                              
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_job_pack_alloc_info: usec=15521191 began=10:27:00.521                              
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_reconfigure_controller: usec=35106756 began=10:26:40.945                           
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from dump_all_job_state: usec=32080389 began=10:26:44.001                                          
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=200,sched_min_interval=2000000
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=34919542 began=10:26:41.162                                    
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_epilog_complete: usec=32967448 began=10:26:43.114                                  
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=34894060 began=10:26:41.188                            
2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_epilog_complete: usec=33004328 began=10:26:43.115            


But it seems that it causes killing of some jobs (happened
at the same time at the second scontrol reconfigure):

2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30110(285126) on node maestro-1008                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_29670(284686) on node maestro-1008                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30342(285358) on node maestro-1012                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_28956(283972) on node maestro-1011                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_32769(287785) on node maestro-1004                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_32723(287739) on node maestro-1007                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30036(285052) on node maestro-1006                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30022(285038) on node maestro-1006                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_28893(283909) on node maestro-1020                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_28855(283871) on node maestro-1016                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_28907(283923) on node maestro-1017                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_32512(287528) on node maestro-1015                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_28804(283820) on node maestro-1014                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30731(285747) on node maestro-1034                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30363(285379) on node maestro-1033                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_29076(284092) on node maestro-1023                                                                             
2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug:  Killing JobId=204075_30724(285740) on node maestro-1029      


I attach a new sdiag output. 

Thanks for your help,
Comment 10 Sophie Créno 2020-08-03 02:58:15 MDT
Created attachment 15272 [details]
sdiag after second scontrol reconfigure

The load of the machine was near 0, slurmctld was at 2 or 3% of CPU.
Comment 11 Chad Vizino 2020-08-03 10:04:08 MDT
It appears that message aggregation is no longer the issue.

The two biggest time consumers seem to be related to job ends (and maybe starts) from the sdiag output you supplied (resorted sdiag RPC by message type largest total time to smallest):

>Remote Procedure Call statistics by message type
>
>        REQUEST_STEP_COMPLETE                   ( 5016) count: 18792  ave_time: 61077  total_time: 1147775841
>        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count: 9396   ave_time: 104812 total_time:  984816565
>        REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count: 12463  ave_time: 9331   total_time:  116303904
>        REQUEST_RECONFIGURE                    ( 1003) count: 2      ave_time: 57658152 total_time: 115316304
>        REQUEST_JOB_STEP_CREATE                 ( 5001) count: 12463  ave_time: 7552   total_time:   94121707
>        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count: 950    ave_time: 97993  total_time:   93093454
>        MESSAGE_EPILOG_COMPLETE                 ( 6012) count: 9396   ave_time: 8236   total_time:   77391652
>        REQUEST_COMPLETE_PROLOG                 ( 6018) count: 12468  ave_time: 525    total_time:    6551868
>        REQUEST_JOB_INFO                        ( 2003) count: 2      ave_time: 5058   total_time:      10117
>        REQUEST_NODE_INFO                       ( 2007) count: 14     ave_time: 454    total_time:       6362
>        REQUEST_PARTITION_INFO                  ( 2009) count: 16     ave_time: 299    total_time:       4785
>        MESSAGE_COMPOSITE                      ( 11001) count: 1      ave_time: 1640   total_time:       1640
>        REQUEST_UPDATE_NODE                     ( 3002) count: 1      ave_time: 895    total_time:        895
>        REQUEST_BUILD_INFO                      ( 2001) count: 1      ave_time: 426    total_time:        426
>        REQUEST_STATS_INFO                      ( 2035) count: 1      ave_time: 338    total_time:        338
>        REQUEST_PING                            ( 1008) count: 1      ave_time: 162    total_time:        162

Could you send a full slurmctld.log so I can look closer? Also, if you could send another sdiag output, I'd be interested to see if/how the numbers have changed since your last one.

Thanks.
Comment 12 Sophie Créno 2020-08-04 08:28:34 MDT
Hello,

  I had to wait because I discovered that all the new job array tasks
completed with FAILED status. I checked several things to understand
the reason of that new behavior and put a hold on the job arrays while
investigating.
I figured out that it was due to a change made by the user in the config
file of the program run by the job array tasks to try to prevent them
from running.
  Now that this config file is back to normal and job array tasks complete 
successfully, I notice that the scontrol reconfigure are instantaneous:

$ time scontrol reconfigure

real    0m0.070s
user    0m0.003s
sys     0m0.001s

$ time scontrol reconfigure

real    0m0.073s
user    0m0.002s
sys     0m0.002s

I don't have any warning anymore in the logs either. I attach another
sdiag output.

  I don't see how it could have interfered with the Slurm processing time 
but it's the only thing that has changed since my last message.
Comment 13 Sophie Créno 2020-08-04 08:29:30 MDT
Created attachment 15297 [details]
sdiag output done on the 4th of August 4:25pm
Comment 14 Chad Vizino 2020-08-04 16:03:06 MDT
I'm glad things are better. Let's have you continue running without the message aggregation for a few more days and see how things go.
Comment 15 Sophie Créno 2020-08-05 02:36:55 MDT
Okay, let's do that.

For my information, do you know approximately when the next 20.02 release
is supposed to be announced? I would like to upgrade our preproduction
cluster when I have the opportunity.

Thanks and have a nice day,
Comment 16 Chad Vizino 2020-08-05 09:23:09 MDT
We expect 20.02.4 to be released soon but we don't have a specific date yet as we are waiting on a few important fixes to be worked out first so we can include them. I'll provide an update once we have something more precise.

Also note that when 20.02.3 (latest version) was released, this statement was made:

--
Message Aggregation is off in Slurm by default, and is only enabled by
MsgAggregationParams=WindowMsgs=<msgs>, where <msgs> is greater than 1.
(Using Message Aggregation on your systems is not a recommended
configuration at this time, and we may retire this subsystem in a future
Slurm release in favor of other RPC aggregation techniques. Although
care must be taken before disabling this to avoid communication issues.)
--

So, if things continue well for you without it, you may want to leave it that way.

 -Chad
Comment 17 Sophie Créno 2020-08-05 09:33:05 MDT
> So, if things continue well for you without it, you may want to leave
> it that way.

  Definitely.

  Thanks for highlighting that statement. Hope that 20.02.4 will be released
by the end of August. Finger crossed.

  Have a nice day,
Comment 18 Chad Vizino 2020-08-05 14:18:35 MDT
(In reply to Sophie Créno from comment #17)
> Hope that 20.02.4 will be released
> by the end of August. Finger crossed.
Good news: 20.02.4 is scheduled to be released today! If you are not already, you may want to subscribe to the slurm-users mailing list. Releases announcements are posted on it. More info here:

https://slurm.schedmd.com/mail.html

 -Chad
Comment 19 Chad Vizino 2020-08-11 08:48:14 MDT
I'm going to close this out for now. Feel free to submit a new ticket if you encounter future problems.

 -Chad