Hi, The Slurm compute state was marked to down and reason was set to not responding. All the jobs which were running on that node was cancelled and marked as node fail. We are seeing below message in slurmd logs. [2023-01-05T07:42:14.713] active_threads == MAX_THREADS(256) [2023-01-05T07:42:15.768] active_threads == MAX_THREADS(256) [2023-01-05T07:42:28.266] active_threads == MAX_THREADS(256) [2023-01-05T07:42:36.839] active_threads == MAX_THREADS(256) [2023-01-05T07:42:41.140] active_threads == MAX_THREADS(256) [2023-01-05T07:42:45.439] active_threads == MAX_THREADS(256) [2023-01-05T07:42:49.778] active_threads == MAX_THREADS(256) [2023-01-05T07:42:54.065] active_threads == MAX_THREADS(256) [2023-01-05T07:42:58.363] active_threads == MAX_THREADS(256) [2023-01-05T07:42:58.374] active_threads == MAX_THREADS(256) [2023-01-05T07:43:02.674] active_threads == MAX_THREADS(256) [2023-01-05T07:43:06.975] active_threads == MAX_THREADS(256) [2023-01-05T07:43:11.285] active_threads == MAX_THREADS(256) [2023-01-05T07:43:19.885] active_threads == MAX_THREADS(256) [2023-01-05T07:43:24.198] active_threads == MAX_THREADS(256) [2023-01-05T07:43:28.517] active_threads == MAX_THREADS(256) [2023-01-05T07:43:32.795] active_threads == MAX_THREADS(256) [2023-01-05T07:43:33.842] error: Timeout waiting for slurmstepd [2023-01-05T07:43:33.842] error: gathering job accounting: 0 And many repeated messages like error: Timeout waiting for slurmstepd error: gathering job accounting: 0 Please let me know, if you need any logs. Slurm version is 20.11.8 OS is RHEL 7.9
Could you please share your slurm.conf and the output of sdiag? My first advice would be to: 1) Set max_rpc_cnt[1] to 100. 2) Increase MessageTimeout[2] if you have it at default value of 10s. cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_max_rpc_cnt=# [2]https://slurm.schedmd.com/slurm.conf.html#OPT_MessageTimeout
Created attachment 28341 [details] sdiag
Created attachment 28342 [details] slurm.conf
Hi, I have uploaded slurm.conf and sdiag on this ticket. Please check and let us know the Root Cause for this issue.
I will have Marcin follow up with you on this. It would help to also see the slurmd.log and slurmctld.log however, here is what I can tell based on the information you have provided. The nodes seem to be under heavy load. I am curious as to the type of job running on these nodes. Do they use all of the available CPU? o, or is it a combination of heavy network and CPU usage by these jobs? Do these jobs span off many threads such as sruns or many mpi ranks? Slurmctld will reach out and connect to nodes at regular intervals. This interval is controlled by the slurmdtimeout. https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout You can set a higher value however it might be better for us to understand why the node was so busy first. We have other options that can help in this case such as reserving a CPU for system work which is excluded from the job.
Created attachment 28356 [details] slurmd and slurmctld logs
Hi Jason, Thanks for your email! Our Slurm cluster is only used to launch R-Studio session which not the cpu / memory intensive and also will not create huge number of processes as well. When we faced this issue, actual utilization of slurm compute was not even 15%. Also, its non-business hour as well. I have attached slurmd and slurmctld logs. I request you to have a look at the logs and share your feedback.
Based on the logs and related code it looks like slurmd is overloaded by RPCs comming from sstat command. Is sstat called frequently? Gathering details for sstat may take more than a few seconds - your MessageTimeout is at default of 10s, which is very likely to result in sstat timeout before results are returned back to the tool. I see you have JobAcctGatherParams=NoOverMemoryKill, which is depreciated since Slurm 19.05[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-19.05/RELEASE_NOTES#L50-L58
Hi, As I mentioned, these slurm cluster is used only to launch R-Studio sessions and end user don't login into slurm master/compute and they don't run any sstat kind of slurm commands. I am sure that no one has executed stat command. I request your support to find the root cause of the issue. Please let me know, if you need more logs.
Error messages in slurmd log like: >[2023-01-05T07:44:48.995] error: gathering job accounting: 0 comes from stepd_stat_jobacct[1]. This function is called from _enforce_job_mem_limit and _rpc_stat_jobacct. Looking at _enforce_job_mem_limit it's not effective in your case since you don't have JobAcctGatherParams = OverMemoryKill, and the function returns early[2]. In this case the only place that may result in the above log is _rpc_stat_jobacct, which is responsible for handling of reply to sstat query. Because those errors comes together with >active_threads == MAX_THREADS(256) I suppose that sstat was executed multiple times querying slurmd and resulting in its overload. I'm still looking into the details, but I don't see other way to trigger the error messages in question in your configuration. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/common/stepd_api.c#L1117 [2]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/slurmd/slurmd/req.c#L2998
Any update from your side?
(In reply to Marcin Stolarek from comment #12) > Any update from your side? Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns?
>Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns? Oh, sorry from my side too. I did check things twice and I'm pretty sure that analysis in comment 11 is complete. I don't see any other way to get into[1] - effectively printing message like: >[2023-01-05T07:44:48.995] error: gathering job accounting: 0 in your configuration, but over a code path for handling of REQUEST_JOB_STEP_STAT RPC, which is generated by the call to API function slurm_job_step_stat. The only standard Slurm tool that makes use of it is sstat. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-22-05-7-1/src/common/stepd_api.c#L1193-L1196
Is there anything I can help you with in the case? cheers, Marcin
Please let me know if you have any questions. In case of no reply I'll close the case as information given. cheers, Marcin
I'm closing the case as information given. Should you have any questions please reopen. cheers, Marcin