9876 – Node failures on the campus-wide Cluster

Ticket 9876 - Node failures on the campus-wide Cluster

Summary: Node failures on the campus-wide Cluster

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.02.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-09-22 13:31 MDT by Sarvani Chadalapaka
Modified:	2020-10-14 15:09 MDT (History)
CC List:	2 users (show)

See Also:
Site:	UC Merced
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
tail on Slurmd.log from failed nodes (1.57 KB, text/plain) 2020-09-22 13:42 MDT, Sarvani Chadalapaka	Details
tail on slurmctld on head node (7.65 KB, text/plain) 2020-09-22 13:45 MDT, Sarvani Chadalapaka	Details
sdiag output (9.46 KB, text/plain) 2020-09-22 13:48 MDT, Sarvani Chadalapaka	Details
sinfo output (2.86 KB, text/plain) 2020-09-22 13:48 MDT, Sarvani Chadalapaka	Details
slurm.conf (7.72 KB, text/plain) 2020-09-22 14:01 MDT, Sarvani Chadalapaka	Details
cgroup.conf (222 bytes, text/plain) 2020-09-22 16:18 MDT, Sarvani Chadalapaka	Details
updated_slurmctld.log (2.35 KB, text/plain) 2020-09-22 17:00 MDT, Sarvani Chadalapaka	Details
updated_slurmd_mrcd69 (13.41 KB, text/plain) 2020-09-22 17:02 MDT, Sarvani Chadalapaka	Details
complete_log_failed_node (29.37 MB, text/plain) 2020-09-22 17:30 MDT, Sarvani Chadalapaka	Details
Complete_slurmctld_log (1.47 MB, text/plain) 2020-09-22 18:55 MDT, Sarvani Chadalapaka	Details
dmesg_22Sep_mrcd69.log (361 bytes, text/plain) 2020-09-23 12:52 MDT, Sarvani Chadalapaka	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Sarvani Chadalapaka 2020-09-22 13:31:42 MDT

Hi,

Slurm randomly decides that the nodes are down
[2020-09-22T08:10:31.020] error: Nodes [xx] not responding, setting DOWN
[2020-09-22T08:37:11.813] error: Nodes [xx] not responding, setting DOWN
[2020-09-22T09:02:33.971] error: Nodes xx not responding, setting DOWN

and then the user's jobs are cancelled and the users see an error message as shown -

slurmstepd: error: *** JOB 1505514 ON xx CANCELLED AT 2020-xx-xxT16:41:31 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

slurmstepd: error: *** STEP 1505514.0 ON xx CANCELLED AT 2020-xx-xxT16:41:31 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

This error is impacting *all* the users using MERCED cluster and needs to be resolved ASAP.

Please let me know if there's any further information I might provide.

Thanks!
Sarvani Chadalapaka
HPC Manager
University of California, Merced.

Comment 1 Jason Booth 2020-09-22 13:34:19 MDT

Please attach the output of:

sdiag
sinfo


Also, please attach the slurmctld.log and the slurmd.log from a few of the failed compute nodes.

Comment 2 Sarvani Chadalapaka 2020-09-22 13:42:17 MDT

Created attachment 15987 [details]
tail on Slurmd.log from failed nodes

Note that the uptime on these nodes indicate that the node hasn't been down after all. It is slurm daemon assuming the nodes are down -

[root@mrcd40 Slurm]# uptime
 12:41:58 up 11 days, 21:25,  1 user,  load average: 0.00, 0.22, 5.10

Comment 3 Sarvani Chadalapaka 2020-09-22 13:45:08 MDT

Created attachment 15988 [details]
tail on slurmctld on head node

Slurmctld tail on the head node.

Comment 4 Sarvani Chadalapaka 2020-09-22 13:48:09 MDT

Created attachment 15989 [details]
sdiag output

Comment 5 Sarvani Chadalapaka 2020-09-22 13:48:28 MDT

Created attachment 15990 [details]
sinfo output

Comment 6 Sarvani Chadalapaka 2020-09-22 13:48:45 MDT

(In reply to Jason Booth from comment #1)
> Please attach the output of:
> 
> sdiag
> sinfo
> 
> 
> Also, please attach the slurmctld.log and the slurmd.log from a few of the
> failed compute nodes.

Done.

Comment 7 Jason Booth 2020-09-22 13:57:26 MDT

Would you also please attach your slurm.conf?

Comment 8 Sarvani Chadalapaka 2020-09-22 13:59:45 MDT

We are able to provide the SchedMD support engineers access to MERCED cluster as needed if it helps expedite the diagnosis and solution process.

Also, if a conference call is helpful to resolve this in real-time I am able to facilitate that.

Please let us know. 

Thanks
Sarvani Chadalapaka
HPC Manager 
University of California Merced

Comment 9 Sarvani Chadalapaka 2020-09-22 14:01:56 MDT

Created attachment 15991 [details]
slurm.conf

Comment 10 Jason Booth 2020-09-22 14:03:49 MDT

>We are able to provide the SchedMD support engineers access to MERCED cluster as needed if it helps expedite the diagnosis and solution process. 

We suspect that you will need to modify the "SlurmdTimeout" and increase this value on your cluster. It is still not clear to us why the hickup in communication happened and this is why we would also need to see the slurmd logs from a few of the compute nodes (mrcd48, mrcd37). Also, if you wanted to provide access then we will have 2 engineers look at that cluster in a read-only mannor: meaning we will still have you make any changes necessary.

Comment 11 Sarvani Chadalapaka 2020-09-22 14:07:54 MDT

Yes, having your engineers inspect the logs and cluster in real-time on a call will be super.

Here's the zoom - call information for 1:45 pm Pacific Time.

https://ucmerced.zoom.us/j/94962776888?pwd=UC84bFIxL21qZXlMVEd6QnFsanROZz09

Comment 12 Sarvani Chadalapaka 2020-09-22 14:12:42 MDT

(In reply to Sarvani Chadalapaka from comment #11)
> Yes, having your engineers inspect the logs and cluster in real-time on a
> call will be super.
> 
> Here's the zoom - call information for 1:45 pm Pacific Time.
> 
> https://ucmerced.zoom.us/j/94962776888?pwd=UC84bFIxL21qZXlMVEd6QnFsanROZz09

Does that time work for you all?

-Sarvani

Comment 13 Marshall Garey 2020-09-22 14:13:57 MDT

Can you increase SlurmdTimeout to 300? And can you also set

SlurmctldDebug=debug
SlurmdDebug=debug

Currently they're both set to 3 (which translates to "info"), and aren't giving us helpful logs. debug isn't too verbose and gives us a lot more useful information.


To make these changes propagate you'll need to run this command:

scontrol reconfigure



(We're figuring out the zoom call stuff right now, and I'll let Jason respond about that.)

Comment 14 Jason Booth 2020-09-22 14:15:30 MDT

Sarvani - That time is fine. I am also waiting in the lobby if you wanted to start the call sooner.

Comment 16 Marshall Garey 2020-09-22 15:02:49 MDT

Hi Sarvani,

This was getting a little long so here's a summary:
* Since making the suggested changes, are the slurmd's staying up? Are jobs getting killed?
* Can you upload cgroup.conf?
* What's the difference between RealMemory (in slurm.conf) and the actual system memory on the compute nodes? We recommend a few GB to give space for slurm daemons and the OS.
* If you change this, you'll need to restart all slurmd's and slurmctld. First stop slurmctld, then stop and restart all slurmd's, then start slurmctld. Doing it in this order avoids timeouts and downed nodes (like you have been seeing).
* CoreSpecCount is the parameter to reserve one or more CPUs for use by Slurm daemons.
* Don't set CoreSpecCount yet though. I'd like to see where the cluster is at first.
* Where are the Slurm binaries located - local or network filesystem?

A few more details about the above:

The parameter Jason mentioned on the call relating to reserving CPUs for slurmd is CoreSpecCount. From man slurm.conf:

(https://slurm.schedmd.com/slurm.conf.html#OPT_CoreSpecCount)

CoreSpecCount
Number of cores reserved for system use. These cores will not be available for allocation to user jobs. Depending upon the TaskPluginParam option of SlurmdOffSpec, Slurm daemons (i.e. slurmd and slurmstepd) may either be confined to these resources (the default) or prevented from using these resources. Isolation of the Slurm daemons from user jobs may improve application performance. If this option and CpuSpecList are both designated for a node, an error is generated. For information on the algorithm used by Slurm to select the cores refer to the core specialization documentation ( https://slurm.schedmd.com/core_spec.html ).

To reserve memory for slurmd/slurmstepd and the OS on the compute nodes, we typically recommend just reducing RealMemory by a few GB from the actual system memory. It looks like you've already rounded RealMemory in slurm.conf for the compute nodes. What's the difference between the system memory and "RealMemory" in slurm.conf?

It's also important to use cgroups to constrain jobs to their CPUs and memory that they were allocated. I see that you have task/cgroup and proctrack/cgroup set in slurm.conf. Can you upload cgroup.conf? I'd like to make sure that cgroups are configured correctly to constrain CPUs and memory.

Where are the Slurm binaries located - local or network filesystem? Slow access to Slurm binaries can be problematic, so we usually recommend storing the binaries on a local filesystem.

Comment 17 Marshall Garey 2020-09-22 15:43:12 MDT

Do you have any updates? Is this still a sev-1 issue, or can we downgrade it to sev-2?

Comment 18 Sarvani Chadalapaka 2020-09-22 16:12:33 MDT

Hi Jason, Marshall,

Apologies. My home internet was breaking and had to find a spot with reliable network.
I am restarting the call while I look at your notes.

Thanks!
Sarvani

Comment 19 Sarvani Chadalapaka 2020-09-22 16:18:43 MDT

Created attachment 15993 [details]
cgroup.conf

cgroup.conf

Comment 20 Sarvani Chadalapaka 2020-09-22 16:22:47 MDT

I still see node failures after the changes we made.

[2020-09-22T14:10:22.522] error: Nodes mrcd[87-88] not responding, setting DOWN
[2020-09-22T14:17:02.218] error: Nodes mrcd86 not responding, setting DOWN
[2020-09-22T14:18:42.993] error: Nodes mrcd[79,101] not responding, setting DOWN
[2020-09-22T14:20:22.595] error: Nodes mrcd99 not responding, setting DOWN
[2020-09-22T14:38:42.275] error: Nodes mrcdg04 not responding, setting DOWN
[2020-09-22T15:12:02.964] error: Nodes mrcd[80,89,94,100,104] not responding, setting DOWN


* What's the difference between RealMemory (in slurm.conf) and the actual system memory on the compute nodes? We recommend a few GB to give space for slurm daemons and the OS.

- I don't think we have left any GB to give space for slurm daemon. Is 10GB enough space for the daemon?

  * If you change this, you'll need to restart all slurmd's and slurmctld. First stop slurmctld, then stop and restart all slurmd's, then start slurmctld. Doing it in this order avoids timeouts and downed nodes (like you have been seeing).

Ok - 

* CoreSpecCount is the parameter to reserve one or more CPUs for use by Slurm daemons.
  * Don't set CoreSpecCount yet though. I'd like to see where the cluster is at first.
Ok -

* Where are the Slurm binaries located - local or network filesystem?
Slurm binaries are local

Comment 21 Sarvani Chadalapaka 2020-09-22 16:24:04 MDT

Marshall,

I still think this is a sev-1 issue until resolved..

Thanks,
Sarvani

Comment 22 Sarvani Chadalapaka 2020-09-22 16:26:41 MDT

Marshall,

I still think this is a sev-1 issue until resolved..

Thanks,
Sarvani

Comment 23 Marshall Garey 2020-09-22 16:37:00 MDT

RE cgroup.conf:

Currently:
CgroupAutomount=yes

ConstrainCores=no
ConstrainRAMSpace=no



Can you set the following in cgroup.conf?

ConstrainCores=yes
ConstrainRamSpace=yes
ConstrainSwapSpace=yes

Setting these values to "yes" ensures that cgroups are enforced and will prevent jobs from using CPUs and memory that aren't allocated to the job. If a job does try to use more memory than it is allocated then it will immediately be OOM-killed.

If you don't want jobs to use swap space, you should also set AllowedSwapSpace=0.

I will also recommend setting ConstrainDevices=yes in the near future, but don't do it right now. Let's get the current issues resolved first. ConstrainDevices=yes will ensure that jobs can't use devices such as GPUs that they weren't allocated.

> - I don't think we have left any GB to give space for slurm daemon. Is 10GB
> enough space for the daemon?
10 GB is plenty of space for the OS. You will probably be fine with just 5 GB.


Can you make these changes to cgroup.conf and slurm.conf, then restart the Slurm daemons? Then please resume the downed nodes (scontrol update nodename=<names of down nodes> state=resume).

Then let's wait for 5-10 minutes. If you're still seeing node unresponsive problems after this, we can try setting CoreSpecCount=1.


Finally, can you upload the slurmctld log file and slurmd log files from a couple downed nodes for the last few hours?

Comment 24 Marshall Garey 2020-09-22 16:41:48 MDT

RE the severity of the bug - Since you are the admin and know the state of your cluster better than I do, I ask you what the severity is. But I think it doesn't have to stay sev-1 for the entirety of the time that it's open - I'm hoping we can at least get your cluster stable or semi-stable so that things are working (or mostly working) overnight. Perhaps we can get it to the point where nodes are mostly staying up, even if a handful are occasionally going down. In that case it would be better classified as a sev-2 or sev-3.

Here's how we define severity levels: (https://www.schedmd.com/support.php)

    Severity 1 — Major Impact
    A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists.

    Severity 2 — High Impact
    A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system.

    Severity 3 — Medium Impact
    A Severity 3 issue is a medium-to-low impact problem that includes partial non-critical loss of system access or which impairs some operations on the system but allows the end user to continue to function on the system with workarounds.

    Severity 4 — Minor Issues
    A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications.


Anyway, please make the changes I recommended, upload the log files I requested, and let's see if the situation improves. In the meantime I'll look at the log files.

Comment 25 Marshall Garey 2020-09-22 16:50:13 MDT

Sorry, a clarification on my last post - I said I'm hoping we can get things mostly stable, but I really hope that we get it all the way stable. :)

Also, a couple clarification questions about the cluster:

How many nodes are down? Are you manually resuming them, or just leaving them down?
How often do nodes go down? In the call it wasn't clear to me if nodes are constantly going down or occasionally going down?

Comment 26 Sarvani Chadalapaka 2020-09-22 17:00:51 MDT

Created attachment 15994 [details]
updated_slurmctld.log

Still seem like there are node failures.

Comment 27 Sarvani Chadalapaka 2020-09-22 17:02:30 MDT

Created attachment 15995 [details]
updated_slurmd_mrcd69

This is the node on which there's a node failure after modifying cgroup.conf and slurm.conf (to account for OS and slurm)

Comment 28 Marshall Garey 2020-09-22 17:13:39 MDT

These log files only have logs from a couple of seconds. Can you upload the log file over the last few hours?

Comment 30 Sarvani Chadalapaka 2020-09-22 17:16:03 MDT

Thanks for the notes Marshall - yes please let's get to a semi-stable state before dial this down to sev-2 :) 

How many nodes are down? Are you manually resuming them, or just leaving them down?
- Slurm automatically shows them back online. 

How often do nodes go down? In the call it wasn't clear to me if nodes are constantly going down or occasionally going down?
- It varies. There are days when many nodes would go down and some days there are not many.
For example, in the past one hour there were a LOT of failures while on the 14th, there was one node failure and the 15th there were 4. On the 19th 29 separate node failure instances.

- Also we realized that we did indeed account for slurm.conf memory and actual-on-the-node-system memory.

Comment 31 Sarvani Chadalapaka 2020-09-22 17:18:22 MDT

- Setting the ticket to sev-2.

- Setting cgroup.conf - CoreSpecCount=1

- Will update slurmctld.conf and slurmd.conf for the past few hours.

Comment 32 Marshall Garey 2020-09-22 17:29:20 MDT

The fact that Slurm automatically marks the nodes up means that slurmd does eventually respond to a ping from the slurmctld, but sometimes takes a long time.

Also, CoreSpecCount is in slurm.conf on the NodeName line, not cgroup.conf. Make sure to restart all slurmd's and slurmctld after applying that setting.

Looking forward to the new logs

Comment 33 Sarvani Chadalapaka 2020-09-22 17:30:11 MDT

Created attachment 15997 [details]
complete_log_failed_node

Comment 34 Marshall Garey 2020-09-22 17:53:56 MDT

So far in the slurmd log I just see it occasionally failing to contact slurmctld for a few seconds, but that doesn't tell me why.

At 15:47:19 is when a batch slurmstepd can't contact slurmctld.
At 15:48:47 slurmd appears to receive a registration RPC from slurmctld and tries to respond. It can't connect to slurmctld for over a minute, but then succeeds at 15:49:59.


So the time period we're interested in is 15:47:19 to 15:49:59.

Do you have the slurmctld log file? That would also help get a more complete picture of what's going on. Also can you get logs from dmesg from 15:47:19 to 15:49:59 on mrcd69? Also are there any logs about the network at this time?

I'm wondering if slurmctld is really busy at these times, or if there's some sort of network issue, or if the slurmd node was really busy, or something else.

Comment 37 Sarvani Chadalapaka 2020-09-22 18:55:22 MDT

Created attachment 15998 [details]
Complete_slurmctld_log

Comment 38 Sarvani Chadalapaka 2020-09-22 19:02:35 MDT

Marshall,

I am heading out for the day. I will get the dmesg logs from mrcd69 early tomorrow.

Appreciate your support today - I will touch base with you tomorrow.

-Cheers!
Sarvani

Comment 39 Sarvani Chadalapaka 2020-09-23 12:52:12 MDT

Created attachment 16012 [details]
dmesg_22Sep_mrcd69.log

Note that there are no dmesg logs on the required time slot on the compute nodes..

Comment 41 Marshall Garey 2020-09-25 17:15:52 MDT

What's it been like the last couple of days since you made the changes?

I've looked at the logs and unfortunately have only one theory at the moment. I researched the "Communication connection failure" messages - they can mean potentially three things:

(1) The port is not open on the destination machine.

This seems unlikely, since slurmctld does respond to messages.

(2) The port is open on the destination machine, but its backlog of pending connections is full.

This seems the most likely to me - basically, slurmctld has a lot of clients trying to talk to it at the same time and it maxes out the connections. Even though you don't have a particularly large cluster, I'd like to point you to our "big_sys" guide:

https://slurm.schedmd.com/big_sys.html

From that page:

"
/proc/sys/fs/file-max: The maximum number of concurrently open files. We recommend a limit of at least 32,832.
/proc/sys/net/ipv4/tcp_max_syn_backlog: Maximum number of remembered connection requests, which still have not received an acknowledgment from the connecting client. The default value is 1024 for systems with more than 128Mb of memory, and 128 for low memory machines. If server suffers of overload, try to increase this number.
/proc/sys/net/core/somaxconn: Limit of socket listen() backlog, known in userspace as SOMAXCONN. Defaults to 128. The value should be raised substantially to support bursts of request. For example, to support a burst of 1024 requests, set somaxconn to 1024.
"

What values do you have set for these three parameters? I recommend following the advice on our webpage.

(3) A firewall between the client and server is blocking access (also check local firewalls).

This also seems unlikely, since some messages are getting through. Just to check though - do you have any firewalls on slurmctld and compute nodes?

Comment 42 Marshall Garey 2020-09-30 14:27:03 MDT

I'm downgrading this to a sev-3 due since I haven't heard from you for the last few days.

Comment 43 Marshall Garey 2020-10-08 11:36:36 MDT

Are you still experiencing this issue?

Comment 44 Marshall Garey 2020-10-14 15:09:21 MDT

I'm closing this as timedout. Please re-open it if you still have issues.