Ticket 15734

Summary: Slurm compute went to not responding state
Product: Slurm Reporter: Bangarusamy <bangarusamy.kumarasamy_ext>
Component: slurmdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: cinek
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=14710
Site: Novartis Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sdiag
slurm.conf
slurmd and slurmctld logs

Description Bangarusamy 2023-01-05 02:13:21 MST
Hi,

The Slurm compute state was marked to down and reason was set to not responding. 
All the jobs which were running on that node was cancelled and marked as node fail.

We are seeing below message in slurmd logs.

[2023-01-05T07:42:14.713] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:15.768] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:28.266] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:36.839] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:41.140] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:45.439] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:49.778] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:54.065] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:58.363] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:58.374] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:02.674] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:06.975] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:11.285] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:19.885] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:24.198] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:28.517] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:32.795] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:33.842] error: Timeout waiting for slurmstepd
[2023-01-05T07:43:33.842] error: gathering job accounting: 0

And many repeated messages like 
error: Timeout waiting for slurmstepd
error: gathering job accounting: 0

Please let me know, if you need any logs.

Slurm version is 20.11.8
OS is RHEL 7.9
Comment 1 Marcin Stolarek 2023-01-05 03:38:14 MST
Could you please share your slurm.conf and the output of sdiag?

My first advice would be to:
1) Set max_rpc_cnt[1] to 100.
2) Increase MessageTimeout[2] if you have it at default value of 10s.

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_max_rpc_cnt=#
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_MessageTimeout
Comment 2 Bangarusamy 2023-01-05 08:01:26 MST
Created attachment 28341 [details]
sdiag
Comment 3 Bangarusamy 2023-01-05 08:01:57 MST
Created attachment 28342 [details]
slurm.conf
Comment 4 Bangarusamy 2023-01-05 08:02:41 MST
Hi, I have uploaded slurm.conf and sdiag on this ticket. Please check and let us know the Root Cause for this issue.
Comment 5 Jason Booth 2023-01-05 15:03:07 MST
I will have Marcin follow up with you on this. It would help to also see the slurmd.log and slurmctld.log however, here is what I can tell based on the information you have provided.

The nodes seem to be under heavy load. I am curious as to the type of job running on these nodes. Do they use all of the available CPU? o, or is it a combination of heavy network and CPU usage by these jobs? Do these jobs span off many threads such as sruns or many mpi ranks?


Slurmctld will reach out and connect to nodes at regular intervals. This interval is controlled by the slurmdtimeout. 

https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout

You can set a higher value however it might be better for us to understand why the node was so busy first. We have other options that can help in this case such as reserving a CPU for system work which is excluded from the job.
Comment 6 Bangarusamy 2023-01-06 03:52:28 MST
Created attachment 28356 [details]
slurmd and slurmctld logs
Comment 7 Bangarusamy 2023-01-06 03:55:17 MST
Hi Jason,

Thanks for your email!

Our Slurm cluster is only used to launch R-Studio session which not the cpu / memory intensive and also will not create huge number of processes as well.

When we faced this issue, actual utilization of slurm compute was not even 15%. Also, its non-business hour as well. 

I have attached slurmd and slurmctld logs. I request you to have a look at the logs and share your feedback.
Comment 8 Marcin Stolarek 2023-01-09 03:14:23 MST
Based on the logs and related code it looks like slurmd is overloaded by RPCs comming from sstat command. Is sstat called frequently? Gathering details for sstat may take more than a few seconds - your MessageTimeout is at default of 10s, which is very likely to result in sstat timeout before results are returned back to the tool.

I see you have JobAcctGatherParams=NoOverMemoryKill, which is depreciated since Slurm 19.05[1].

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-19.05/RELEASE_NOTES#L50-L58
Comment 9 Bangarusamy 2023-01-09 03:39:52 MST
Hi,

As I mentioned, these slurm cluster is used only to launch R-Studio sessions and end user don't login into slurm master/compute and they don't run any sstat kind of slurm commands.

I am sure that no one has executed stat command.

I request your support to find the root cause of the issue. Please let me know, if you need more logs.
Comment 11 Marcin Stolarek 2023-01-09 04:45:09 MST
Error messages in slurmd log like:
>[2023-01-05T07:44:48.995] error: gathering job accounting: 0

comes from stepd_stat_jobacct[1]. This function is called from _enforce_job_mem_limit and _rpc_stat_jobacct. Looking at _enforce_job_mem_limit it's not effective in your case since you don't have JobAcctGatherParams = OverMemoryKill, and the function returns early[2]. In this case the only place that may result in the above log is _rpc_stat_jobacct, which is responsible for handling of reply to sstat query.

Because those errors comes together with
>active_threads == MAX_THREADS(256)
I suppose that sstat was executed multiple times querying slurmd and resulting in its overload.

I'm still looking into the details, but I don't see other way to trigger the error messages in question in your configuration.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/common/stepd_api.c#L1117
[2]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/slurmd/slurmd/req.c#L2998
Comment 12 Marcin Stolarek 2023-01-20 07:53:04 MST
Any update from your side?
Comment 13 Bangarusamy 2023-01-20 09:08:24 MST
(In reply to Marcin Stolarek from comment #12)
> Any update from your side?

Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns?
Comment 14 Marcin Stolarek 2023-01-23 05:25:54 MST
>Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns?

Oh, sorry from my side too. I did check things twice and I'm pretty sure that analysis in comment 11 is complete. I don't see any other way to get into[1] - effectively printing message like:
>[2023-01-05T07:44:48.995] error: gathering job accounting: 0
in your configuration, but over a code path for handling of REQUEST_JOB_STEP_STAT RPC, which is generated by the call to API function slurm_job_step_stat. The only standard Slurm tool that makes use of it is sstat.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-22-05-7-1/src/common/stepd_api.c#L1193-L1196
Comment 15 Marcin Stolarek 2023-02-02 04:44:23 MST
Is there anything I can help you with in the case?

cheers,
Marcin
Comment 16 Marcin Stolarek 2023-02-09 01:38:57 MST
Please let me know if you have any questions. In case of no reply I'll close the case as information given.

cheers,
Marcin
Comment 17 Marcin Stolarek 2023-02-14 02:33:52 MST
I'm closing the case as information given.

Should you have any questions please reopen.

cheers,
Marcin