Created attachment 15215 [details] slurm.conf Hello, We notice slowness of commands such as scontrol reconfigure (up to 35s) on our preproduction cluster maestro (19.05.7). We also have warnings and errors in slurmctld logs such as 2020-07-29T16:27:23+02:00 maestro-sched slurmctld[31760]: Warning: Note very large processing time from _slurm_rpc_node_registration: usec=5005840 began=16:27:18.964 2020-07-29T16:27:23+02:00 maestro-sched slurmctld[31760]: Warning: Note very large processing time from _slurmctld_background: usec=4017503 began=16:27:19.952 2020-07-29T16:27:24+02:00 maestro-sched slurmctld[31760]: Warning: Note very large processing time from _slurm_rpc_job_step_create: usec=3025470 began=16:27:20.975 or 2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1033 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation 2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1033 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation 2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1026 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation 2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1022 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation 2020-07-29T16:28:49+02:00 maestro-sched slurmctld[31760]: agent/is_node_resp: node:maestro-1008 RPC:REQUEST_TERMINATE_JOB : Socket timed out on send/recv operation At the moment, we have 31 nodes with 96 cores each that are fully used (see sinfo output in attachment) by 2 jobs arrays. But the array tasks are rather long jobs (1 to 3 days) with different duration. So they definitely don't complete all at the same time. They are not CPU intensive all the time either. I also add a sdiag output. maestro-sched is a bare metal machine running both slurmctld and slurmdbd and absolutely not overloaded. Could you please give us advice to avoid these errors and warnings? As you can see the SchedulerParameters are more or less those suggested for High Throughput Computing. Thanks in advance,
Created attachment 15216 [details] slurmctld logs
Created attachment 15217 [details] sinfo output
Created attachment 15218 [details] sdiag output
I'm looking into this. -Chad
The info you sent was helpful and from your sdiag output the main thing that stands out is this: --- ******************************************************* sdiag output at Wed Jul 29 16:26:58 2020 (1596032818) Data since Wed Jul 29 02:00:00 2020 (1595980800) ******************************************************* ... Remote Procedure Call statistics by message type MESSAGE_COMPOSITE (11001) count:650729 ave_time:137186 total_time:89271001097 --- This call relates to message aggregation (see MsgAggregationParams in your slurm.conf) and shows slurmctld is doing a lot of work handling it. Note that the time units are in microseconds. There's been some recent work with a message aggregation fix that went into 19.05.7 so you should have that. However, there may be additional issues that have not been addressed. Would it be possible for you to disable message aggregation to see if this helps your situation? This will require restarting all the Slurm daemons on all nodes and is not advisable while jobs are running so you may want to wait until you have some down time to do this.
Hello and thanks for your quick answer, Given the number of tasks of the current running job array and the time they need to complete, I'll try to find a trade off with the user to restart the daemons over the weekend. Have a nice day,
Hell, I've changed slurm.conf, $ scontrol show conf | grep -i aggregation MsgAggregationParams = (null) restarted slurmctld and the slurmd on all nodes and did a scontrol reconfigure. I've then released the jobs. But if I launch a new scontrol reconfigure (without any change in slurm.conf), it still takes a lot of time $ time scontrol reconfigure real 1m20.187s user 0m0.002s sys 0m0.003s $time scontrol reconfigure real 0m35.119s user 0m0.003s sys 0m0.001s So of course I have the corresponding warnings in the logs: 2020-08-03T10:27:15+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _set_slurmd_addr: usec=35050418 began=10:26:40.948 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from read_slurm_conf: usec=35097971 began=10:26:40.945 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_job_pack_alloc_info: usec=17300526 began=10:26:58.742 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_job_pack_alloc_info: usec=15521191 began=10:27:00.521 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_reconfigure_controller: usec=35106756 began=10:26:40.945 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from dump_all_job_state: usec=32080389 began=10:26:44.001 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=200,sched_min_interval=2000000 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=34919542 began=10:26:41.162 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_epilog_complete: usec=32967448 began=10:26:43.114 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=34894060 began=10:26:41.188 2020-08-03T10:27:16+02:00 maestro-sched slurmctld[16342]: Warning: Note very large processing time from _slurm_rpc_epilog_complete: usec=33004328 began=10:26:43.115 But it seems that it causes killing of some jobs (happened at the same time at the second scontrol reconfigure): 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30110(285126) on node maestro-1008 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_29670(284686) on node maestro-1008 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30342(285358) on node maestro-1012 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_28956(283972) on node maestro-1011 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_32769(287785) on node maestro-1004 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_32723(287739) on node maestro-1007 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30036(285052) on node maestro-1006 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30022(285038) on node maestro-1006 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_28893(283909) on node maestro-1020 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_28855(283871) on node maestro-1016 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_28907(283923) on node maestro-1017 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_32512(287528) on node maestro-1015 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_28804(283820) on node maestro-1014 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30731(285747) on node maestro-1034 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30363(285379) on node maestro-1033 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_29076(284092) on node maestro-1023 2020-08-03T10:27:51+02:00 maestro-sched slurmctld[16342]: debug: Killing JobId=204075_30724(285740) on node maestro-1029 I attach a new sdiag output. Thanks for your help,
Created attachment 15272 [details] sdiag after second scontrol reconfigure The load of the machine was near 0, slurmctld was at 2 or 3% of CPU.
It appears that message aggregation is no longer the issue. The two biggest time consumers seem to be related to job ends (and maybe starts) from the sdiag output you supplied (resorted sdiag RPC by message type largest total time to smallest): >Remote Procedure Call statistics by message type > > REQUEST_STEP_COMPLETE ( 5016) count: 18792 ave_time: 61077 total_time: 1147775841 > REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count: 9396 ave_time: 104812 total_time: 984816565 > REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count: 12463 ave_time: 9331 total_time: 116303904 > REQUEST_RECONFIGURE ( 1003) count: 2 ave_time: 57658152 total_time: 115316304 > REQUEST_JOB_STEP_CREATE ( 5001) count: 12463 ave_time: 7552 total_time: 94121707 > MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count: 950 ave_time: 97993 total_time: 93093454 > MESSAGE_EPILOG_COMPLETE ( 6012) count: 9396 ave_time: 8236 total_time: 77391652 > REQUEST_COMPLETE_PROLOG ( 6018) count: 12468 ave_time: 525 total_time: 6551868 > REQUEST_JOB_INFO ( 2003) count: 2 ave_time: 5058 total_time: 10117 > REQUEST_NODE_INFO ( 2007) count: 14 ave_time: 454 total_time: 6362 > REQUEST_PARTITION_INFO ( 2009) count: 16 ave_time: 299 total_time: 4785 > MESSAGE_COMPOSITE ( 11001) count: 1 ave_time: 1640 total_time: 1640 > REQUEST_UPDATE_NODE ( 3002) count: 1 ave_time: 895 total_time: 895 > REQUEST_BUILD_INFO ( 2001) count: 1 ave_time: 426 total_time: 426 > REQUEST_STATS_INFO ( 2035) count: 1 ave_time: 338 total_time: 338 > REQUEST_PING ( 1008) count: 1 ave_time: 162 total_time: 162 Could you send a full slurmctld.log so I can look closer? Also, if you could send another sdiag output, I'd be interested to see if/how the numbers have changed since your last one. Thanks.
Hello, I had to wait because I discovered that all the new job array tasks completed with FAILED status. I checked several things to understand the reason of that new behavior and put a hold on the job arrays while investigating. I figured out that it was due to a change made by the user in the config file of the program run by the job array tasks to try to prevent them from running. Now that this config file is back to normal and job array tasks complete successfully, I notice that the scontrol reconfigure are instantaneous: $ time scontrol reconfigure real 0m0.070s user 0m0.003s sys 0m0.001s $ time scontrol reconfigure real 0m0.073s user 0m0.002s sys 0m0.002s I don't have any warning anymore in the logs either. I attach another sdiag output. I don't see how it could have interfered with the Slurm processing time but it's the only thing that has changed since my last message.
Created attachment 15297 [details] sdiag output done on the 4th of August 4:25pm
I'm glad things are better. Let's have you continue running without the message aggregation for a few more days and see how things go.
Okay, let's do that. For my information, do you know approximately when the next 20.02 release is supposed to be announced? I would like to upgrade our preproduction cluster when I have the opportunity. Thanks and have a nice day,
We expect 20.02.4 to be released soon but we don't have a specific date yet as we are waiting on a few important fixes to be worked out first so we can include them. I'll provide an update once we have something more precise. Also note that when 20.02.3 (latest version) was released, this statement was made: -- Message Aggregation is off in Slurm by default, and is only enabled by MsgAggregationParams=WindowMsgs=<msgs>, where <msgs> is greater than 1. (Using Message Aggregation on your systems is not a recommended configuration at this time, and we may retire this subsystem in a future Slurm release in favor of other RPC aggregation techniques. Although care must be taken before disabling this to avoid communication issues.) -- So, if things continue well for you without it, you may want to leave it that way. -Chad
> So, if things continue well for you without it, you may want to leave > it that way. Definitely. Thanks for highlighting that statement. Hope that 20.02.4 will be released by the end of August. Finger crossed. Have a nice day,
(In reply to Sophie Créno from comment #17) > Hope that 20.02.4 will be released > by the end of August. Finger crossed. Good news: 20.02.4 is scheduled to be released today! If you are not already, you may want to subscribe to the slurm-users mailing list. Releases announcements are posted on it. More info here: https://slurm.schedmd.com/mail.html -Chad
I'm going to close this out for now. Feel free to submit a new ticket if you encounter future problems. -Chad