Ticket 15734

Summary:	Slurm compute went to not responding state
Product:	Slurm	Reporter:	Bangarusamy <bangarusamy.kumarasamy_ext>
Component:	slurmd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	cinek
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=14710
Site:	Novartis	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	sdiag slurm.conf slurmd and slurmctld logs

Description Bangarusamy 2023-01-05 02:13:21 MST

Hi,

The Slurm compute state was marked to down and reason was set to not responding. 
All the jobs which were running on that node was cancelled and marked as node fail.

We are seeing below message in slurmd logs.

[2023-01-05T07:42:14.713] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:15.768] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:28.266] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:36.839] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:41.140] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:45.439] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:49.778] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:54.065] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:58.363] active_threads == MAX_THREADS(256)
[2023-01-05T07:42:58.374] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:02.674] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:06.975] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:11.285] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:19.885] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:24.198] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:28.517] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:32.795] active_threads == MAX_THREADS(256)
[2023-01-05T07:43:33.842] error: Timeout waiting for slurmstepd
[2023-01-05T07:43:33.842] error: gathering job accounting: 0

And many repeated messages like 
error: Timeout waiting for slurmstepd
error: gathering job accounting: 0

Please let me know, if you need any logs.

Slurm version is 20.11.8
OS is RHEL 7.9

Comment 1 Marcin Stolarek 2023-01-05 03:38:14 MST

Could you please share your slurm.conf and the output of sdiag?

My first advice would be to:
1) Set max_rpc_cnt[1] to 100.
2) Increase MessageTimeout[2] if you have it at default value of 10s.

cheers,
Marcin

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_max_rpc_cnt=#
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_MessageTimeout

Comment 2 Bangarusamy 2023-01-05 08:01:26 MST

Created attachment 28341 [details]
sdiag

Comment 3 Bangarusamy 2023-01-05 08:01:57 MST

Created attachment 28342 [details]
slurm.conf

Comment 4 Bangarusamy 2023-01-05 08:02:41 MST

Hi, I have uploaded slurm.conf and sdiag on this ticket. Please check and let us know the Root Cause for this issue.

Comment 5 Jason Booth 2023-01-05 15:03:07 MST

I will have Marcin follow up with you on this. It would help to also see the slurmd.log and slurmctld.log however, here is what I can tell based on the information you have provided.

The nodes seem to be under heavy load. I am curious as to the type of job running on these nodes. Do they use all of the available CPU? o, or is it a combination of heavy network and CPU usage by these jobs? Do these jobs span off many threads such as sruns or many mpi ranks?


Slurmctld will reach out and connect to nodes at regular intervals. This interval is controlled by the slurmdtimeout. 

https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout

You can set a higher value however it might be better for us to understand why the node was so busy first. We have other options that can help in this case such as reserving a CPU for system work which is excluded from the job.

Comment 6 Bangarusamy 2023-01-06 03:52:28 MST

Created attachment 28356 [details]
slurmd and slurmctld logs

Comment 7 Bangarusamy 2023-01-06 03:55:17 MST

Hi Jason,

Thanks for your email!

Our Slurm cluster is only used to launch R-Studio session which not the cpu / memory intensive and also will not create huge number of processes as well.

When we faced this issue, actual utilization of slurm compute was not even 15%. Also, its non-business hour as well. 

I have attached slurmd and slurmctld logs. I request you to have a look at the logs and share your feedback.

Comment 8 Marcin Stolarek 2023-01-09 03:14:23 MST

Based on the logs and related code it looks like slurmd is overloaded by RPCs comming from sstat command. Is sstat called frequently? Gathering details for sstat may take more than a few seconds - your MessageTimeout is at default of 10s, which is very likely to result in sstat timeout before results are returned back to the tool.

I see you have JobAcctGatherParams=NoOverMemoryKill, which is depreciated since Slurm 19.05[1].

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-19.05/RELEASE_NOTES#L50-L58

Comment 9 Bangarusamy 2023-01-09 03:39:52 MST

Hi,

As I mentioned, these slurm cluster is used only to launch R-Studio sessions and end user don't login into slurm master/compute and they don't run any sstat kind of slurm commands.

I am sure that no one has executed stat command.

I request your support to find the root cause of the issue. Please let me know, if you need more logs.

Comment 11 Marcin Stolarek 2023-01-09 04:45:09 MST

Error messages in slurmd log like:
>[2023-01-05T07:44:48.995] error: gathering job accounting: 0

comes from stepd_stat_jobacct[1]. This function is called from _enforce_job_mem_limit and _rpc_stat_jobacct. Looking at _enforce_job_mem_limit it's not effective in your case since you don't have JobAcctGatherParams = OverMemoryKill, and the function returns early[2]. In this case the only place that may result in the above log is _rpc_stat_jobacct, which is responsible for handling of reply to sstat query.

Because those errors comes together with
>active_threads == MAX_THREADS(256)
I suppose that sstat was executed multiple times querying slurmd and resulting in its overload.

I'm still looking into the details, but I don't see other way to trigger the error messages in question in your configuration.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/common/stepd_api.c#L1117
[2]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/slurmd/slurmd/req.c#L2998

Comment 12 Marcin Stolarek 2023-01-20 07:53:04 MST

Any update from your side?

Comment 13 Bangarusamy 2023-01-20 09:08:24 MST

(In reply to Marcin Stolarek from comment #12)
> Any update from your side?

Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns?

Comment 14 Marcin Stolarek 2023-01-23 05:25:54 MST

>Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns?

Oh, sorry from my side too. I did check things twice and I'm pretty sure that analysis in comment 11 is complete. I don't see any other way to get into[1] - effectively printing message like:
>[2023-01-05T07:44:48.995] error: gathering job accounting: 0
in your configuration, but over a code path for handling of REQUEST_JOB_STEP_STAT RPC, which is generated by the call to API function slurm_job_step_stat. The only standard Slurm tool that makes use of it is sstat.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-22-05-7-1/src/common/stepd_api.c#L1193-L1196

Comment 15 Marcin Stolarek 2023-02-02 04:44:23 MST

Is there anything I can help you with in the case?

cheers,
Marcin

Comment 16 Marcin Stolarek 2023-02-09 01:38:57 MST

Please let me know if you have any questions. In case of no reply I'll close the case as information given.

cheers,
Marcin

Comment 17 Marcin Stolarek 2023-02-14 02:33:52 MST

I'm closing the case as information given.

Should you have any questions please reopen.

cheers,
Marcin