| Summary: | Slurm compute went to not responding state | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bangarusamy <bangarusamy.kumarasamy_ext> |
| Component: | slurmd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | cinek |
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=14710 | ||
| Site: | Novartis | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
sdiag
slurm.conf slurmd and slurmctld logs |
||
|
Description
Bangarusamy
2023-01-05 02:13:21 MST
Could you please share your slurm.conf and the output of sdiag? My first advice would be to: 1) Set max_rpc_cnt[1] to 100. 2) Increase MessageTimeout[2] if you have it at default value of 10s. cheers, Marcin [1]https://slurm.schedmd.com/slurm.conf.html#OPT_max_rpc_cnt=# [2]https://slurm.schedmd.com/slurm.conf.html#OPT_MessageTimeout Created attachment 28341 [details]
sdiag
Created attachment 28342 [details]
slurm.conf
Hi, I have uploaded slurm.conf and sdiag on this ticket. Please check and let us know the Root Cause for this issue. I will have Marcin follow up with you on this. It would help to also see the slurmd.log and slurmctld.log however, here is what I can tell based on the information you have provided. The nodes seem to be under heavy load. I am curious as to the type of job running on these nodes. Do they use all of the available CPU? o, or is it a combination of heavy network and CPU usage by these jobs? Do these jobs span off many threads such as sruns or many mpi ranks? Slurmctld will reach out and connect to nodes at regular intervals. This interval is controlled by the slurmdtimeout. https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout You can set a higher value however it might be better for us to understand why the node was so busy first. We have other options that can help in this case such as reserving a CPU for system work which is excluded from the job. Created attachment 28356 [details]
slurmd and slurmctld logs
Hi Jason, Thanks for your email! Our Slurm cluster is only used to launch R-Studio session which not the cpu / memory intensive and also will not create huge number of processes as well. When we faced this issue, actual utilization of slurm compute was not even 15%. Also, its non-business hour as well. I have attached slurmd and slurmctld logs. I request you to have a look at the logs and share your feedback. Based on the logs and related code it looks like slurmd is overloaded by RPCs comming from sstat command. Is sstat called frequently? Gathering details for sstat may take more than a few seconds - your MessageTimeout is at default of 10s, which is very likely to result in sstat timeout before results are returned back to the tool. I see you have JobAcctGatherParams=NoOverMemoryKill, which is depreciated since Slurm 19.05[1]. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-19.05/RELEASE_NOTES#L50-L58 Hi, As I mentioned, these slurm cluster is used only to launch R-Studio sessions and end user don't login into slurm master/compute and they don't run any sstat kind of slurm commands. I am sure that no one has executed stat command. I request your support to find the root cause of the issue. Please let me know, if you need more logs. Error messages in slurmd log like: >[2023-01-05T07:44:48.995] error: gathering job accounting: 0 comes from stepd_stat_jobacct[1]. This function is called from _enforce_job_mem_limit and _rpc_stat_jobacct. Looking at _enforce_job_mem_limit it's not effective in your case since you don't have JobAcctGatherParams = OverMemoryKill, and the function returns early[2]. In this case the only place that may result in the above log is _rpc_stat_jobacct, which is responsible for handling of reply to sstat query. Because those errors comes together with >active_threads == MAX_THREADS(256) I suppose that sstat was executed multiple times querying slurmd and resulting in its overload. I'm still looking into the details, but I don't see other way to trigger the error messages in question in your configuration. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/common/stepd_api.c#L1117 [2]https://github.com/SchedMD/slurm/blob/slurm-21-08-8-2/src/slurmd/slurmd/req.c#L2998 Any update from your side? (In reply to Marcin Stolarek from comment #12) > Any update from your side? Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns? >Sorry! I thought you are still analyzing the issue in detail and will come back with more details. Do you have any other findigns? Oh, sorry from my side too. I did check things twice and I'm pretty sure that analysis in comment 11 is complete. I don't see any other way to get into[1] - effectively printing message like: >[2023-01-05T07:44:48.995] error: gathering job accounting: 0 in your configuration, but over a code path for handling of REQUEST_JOB_STEP_STAT RPC, which is generated by the call to API function slurm_job_step_stat. The only standard Slurm tool that makes use of it is sstat. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-22-05-7-1/src/common/stepd_api.c#L1193-L1196 Is there anything I can help you with in the case? cheers, Marcin Please let me know if you have any questions. In case of no reply I'll close the case as information given. cheers, Marcin I'm closing the case as information given. Should you have any questions please reopen. cheers, Marcin |