8851 – Node not responding

Ticket 8851 - Node not responding

Summary: Node not responding

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	20.02.1
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-04-13 17:03 MDT by Anthony DelSorbo
Modified:	2020-05-04 06:36 MDT (History)
CC List:	1 user (show)

See Also:	7839 8395 8826
Site:	NOAA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	NESCC
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Niagara system slurm.conf (5.85 KB, text/plain) 2020-04-13 17:10 MDT, Anthony DelSorbo	Details
n1c03 syslogs (1.69 MB, text/plain) 2020-04-13 17:12 MDT, Anthony DelSorbo	Details
n1c07 syslogs (1.53 MB, text/plain) 2020-04-13 17:13 MDT, Anthony DelSorbo	Details
Niagara slurmctld logs (406.42 KB, application/x-gzip) 2020-04-13 17:15 MDT, Anthony DelSorbo	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Anthony DelSorbo 2020-04-13 17:03:32 MDT

Although nodes not responding is not new to us, we're concerned because our Niagara system (only 25 nodes) has experienced all of its nodes crash over the past 48 hours.  These are the messages in our slurm controller server, which we've all seen before:

66314 Apr 11 20:53:27 nbqs1 slurmctld[146523]: error: slurm_receive_msgs: Socket timed out on send/recv operation
66315 Apr 11 20:53:27 nbqs1 slurmctld[146523]: agent/is_node_resp: node:n1c03 RPC:REQUEST_PING : Socket timed out on send/recv operation
66316 Apr 11 20:53:28 nbqs1 slurmctld[146523]: error: Nodes n1c03 not responding
66317 Apr 11 20:55:01 nbqs1 systemd: Started Session 78228 of user root.
66318 Apr 11 20:55:01 nbqs1 systemd: Started Session 78229 of user root.
66319 Apr 11 20:55:07 nbqs1 slurmctld[146523]: error: slurm_receive_msgs: Socket timed out on send/recv operation
66320 Apr 11 20:55:08 nbqs1 slurmctld[146523]: error: Nodes n1c03 not responding

I've mapped out the sequence of events for the job as it crosses nodes:

Apr 11 20:31:37 - User submits job (nbqs1 slurmctld[146523]: _slurm_rpc_submit_batch_job: JobId=114773 InitPrio=20000020 usec=2764)
Apr 11 20:31:37 - Job gets backfilled (backfill: Started JobId=114773 in niagara on n1c03)
Apr 11 20:31:39 - User's job output acknowledges job start (Start processing request.template50 at 2020-04-11 20:31:39)
Apr 11 20:35:48 - User's job phase 1 completes (Processing took: 0 days, 0 hours, 4 minutes, 9 seconds)
Apr 11 20:35:48 - User's job phase 2 starts (Start processing request.template.interp50 at 2020-04-11 20:35:48)
Apr 11 20:45:47 - Node fails NHC
Apr 11 20:45:47 - Node is set to DRAINING
Apr 11 20:53:27 - Slurmctld gets first socket timeout
Apr 11 20:55:07 - Slurmctld gets second socket timeout
Apr 11 20:56:50 - Slurmctld gets third socket timeout
Apr 11 20:57:29 - Job gets requeued (requeue job JobId=114773 due to failure of node n1c03)
Apr 11 20:57:29 - Node is set to DOWN (error: Nodes n1c03 not responding, setting DOWN)
Apr 11 20:57:29 - EPILOG-CTLD runs on slurmctld
Apr 11 21:05:08 - Job gets new node (sched: Allocate JobId=114773 NodeList=n1c07 #CPUs=40 Partition=niagara)

Apr 11 21:05:09 - Job starts all over on node n1c07 (Start processing request.template50 at 2020-04-11 21:05:09)

Apr 11 21:05:49 - User's job phase 2 completes on node n1c03 (assuming),(Processing took: 0 days, 0 hours, 28 minutes, 1 seconds)


Apr 11 21:08:09 - User's job Ends (Start Epilog v19.07.23 on node n1c07 for job 114773 :: Sat Apr 11 21:08:09 UTC 2020)
Apr 11 21:08:09 - slurmctld agrees job ended ( _job_complete: JobId=114773 WEXITSTATUS 0)

It is particularly troublesome to see the timeouts being identified by the slurm controller, when in fact the original node (n1c03) did actually print out to the user's output file at 21:05:49 after the new node (n1c07) had restarted the job.  Still it did not continue to print as it was the second node that actually brought the job to a close (at least was the last writer to the output file).  

So what I need to know is what happened to my n1c03 node?  

Why did it stop responding?  In the node's log, I do see a gap between 20:50:44 and 21:00:01

What can I change to prevent this issue from recurring?  

Why did the original node write to the output file when the new node should have owned that job?  Shouldn't the original node be smart enough to know the job is owned by another node?  

If the issue is job related in that the job is causing the node to be overloaded and thus disrupt communication, what changes can I make to the slurm configuration or system configuration(s) that would leave sufficient resources available to the slurmd to ensure communication between the slurm controller and the node are not disrupted.  

I'll upload a few files including the slurm.conf and the slurmctld.log files for your review.  Let me know what else you would like to have to better troubleshoot this issue.

Best,

Tony.

Comment 1 Anthony DelSorbo 2020-04-13 17:10:44 MDT

Created attachment 13769 [details]
Niagara system slurm.conf

Comment 2 Anthony DelSorbo 2020-04-13 17:12:54 MDT

Created attachment 13770 [details]
n1c03 syslogs

Comment 3 Anthony DelSorbo 2020-04-13 17:13:30 MDT

Created attachment 13771 [details]
n1c07 syslogs

Comment 4 Anthony DelSorbo 2020-04-13 17:15:18 MDT

Created attachment 13772 [details]
Niagara slurmctld logs

Comment 6 Felip Moll 2020-04-17 06:47:33 MDT

Hey Tony, how are you doing on this tough days?

It seems you are continuing seeing this issue, like a continuation of bug 7839 (and others).

> It is particularly troublesome to see the timeouts being identified by the
> slurm controller, when in fact the original node (n1c03) did actually print
> out to the user's output file at 21:05:49 after the new node (n1c07) had
> restarted the job.

A stuck slurmd doesn't mean another process may still be alive and writting to the output.
In fact slurmd can be dead, slurmstepd still alive and running the job.

Something like this would explain the output after the timeout seen by ctld.


> So what I need to know is what happened to my n1c03 node?

These messages:
Apr 11 20:45:47 n1c03 nhc[257956]: Health check failed:  check_cmd_output:  139 returned by "/usr/sbin/ip a show ib0".
..
Apr 11 21:01:32 n1c03 nhc[280681]: Health check failed:  Script timed out while executing "check_cmd_status -t 1 -r 1 selinuxenabled".

could make me think NHC was stuck and maybe also slurmd. From the info I have now I can only foresee a I/O or network issue.

> Why did it stop responding?  In the node's log, I do see a gap between
> 20:50:44 and 21:00:01

I'd like to see a "dmesg -T" like we saw in bug 7839 comment 15.

Note also I see that:


Apr 11 20:50:44 n1c03 slurmd[185579]: debug:  attempting to run health_check [/apps/slurm/tools/bin/nhc-slurm]
- - - GAP GAP GAP GAP GAP GAP - - - 
Apr 11 21:00:01 n1c03 pam_slurm_adopt[278621]: Ignoring root user
Apr 11 21:00:01 n1c03 sshd[278621]: Accepted publickey for root from 10.190.16.238 port 58218 ssh2: RSA ..
..
Apr 11 21:00:02 n1c03 systemd: Stopping Slurm node daemon...
Apr 11 21:00:31 n1c03 sshd[278621]: Received disconnect from 10.190.16.238 port 58218:11: disconnected by user
Apr 11 21:00:31 n1c03 sshd[278621]: Disconnected from 10.190.16.238 port 58218
Apr 11 21:00:31 n1c03 sshd[278621]: pam_unix(sshd:session): session closed for user root
...
Apr 11 21:01:32 n1c03 systemd: slurmd.service stop-sigterm timed out. Killing.
Apr 11 21:01:32 n1c03 systemd: slurmd.service: main process exited, code=killed, status=9/KILL
Apr 11 21:01:32 n1c03 systemd: Stopped Slurm node daemon.
Apr 11 21:01:32 n1c03 systemd: Unit slurmd.service entered failed state.
Apr 11 21:01:32 n1c03 systemd: slurmd.service failed.
Apr 11 21:01:32 n1c03 systemd: Starting Slurm node daemon...

Which makes me think:
1. Slurmd was running NHC and got stuck (because it could not read from /apps maybe?)
2. A root access the system 10 minutes after, and restarts slurmd daemon.
3. He logouts.
4. Systemd timeouts the stop of slurmd, it doesn't respond to sigterm, so does a kill -9.
5. Then starts the slurmd again.

It would be very convenient if root, which seems to have accessed at 21:00, had analyzed what was going on at that very moment. Probably getting a backtrace of slurmd, and checking filesystems would have given us an important lead.

If this happens again, I encourage to do this analysis steps.

> What can I change to prevent this issue from recurring?  

Our first action is to analyze dmesg output.
Our second action, I propose to move NHC script in a crontab ande remove it from slurm.conf.
At the same time, ask your network and filesystem engineers to check if there was any issue at that time.

This script runs from here:

HealthCheckProgram=/apps/slurm/tools/bin/nhc-slurm

If the filesystem is stuck, slurmd can try to read nhc-slurm and get stuck, but I cannot ensure that this is the cause.

> Why did the original node write to the output file when the new node should
> have owned that job?  Shouldn't the original node be smart enough to know
> the job is owned by another node?  

Yes, but if slurmd is not responding it won't communicate with ctld.
As I said it may be an still alive slurmstepd.

> If the issue is job related in that the job is causing the node to be
> overloaded and thus disrupt communication, what changes can I make to the
> slurm configuration or system configuration(s) that would leave sufficient
> resources available to the slurmd to ensure communication between the slurm
> controller and the node are not disrupted.  

If that is the case, we should analyze which resource is being exhausted due to the job.
I think you have the memory well set. Let's wait for the dmesg.

> I'll upload a few files including the slurm.conf and the slurmctld.log files
> for your review.  Let me know what else you would like to have to better
> troubleshoot this issue.

Bug 6285 comment 36 and others, which happened back 2018 also suggests issues with nodes being stateless.
Do you have swap? 

Because of that bug you added

From your slurm.conf:
> # Per SchedMD #6285 - 20190226 - TD
> LaunchParameters=slurmstepd_memlock

Which then Nate commented (in comment 41): This won't change anything on stateless nodes.

So I would encourage to remove this parameter.

Just a fact, this is not something new in your system, it is happening in different clusters, with different versions (18.x, 19.x and 20.x).
Question: how often does it happen?

Comment 7 Felip Moll 2020-04-17 12:10:59 MDT

Tony,

Just for your info:

I partially reproduced the issue with a stuck filesystem and a NHC script in this filesystem.

When slurmd tries to access the file it gets stuck, and after a while I start seeing this in ctld:

slurmctld: error: slurm_receive_msgs: Socket timed out on send/recv operation

and the node is put to down.

That happens after a slurmd restart, with a job that can still be running there.
The difference is I don't see a restart in your logs, but this can be a lead.

Comment 8 Felip Moll 2020-04-24 04:32:06 MDT

To summarize what I am waiting for:

1. Our first action is to analyze dmesg output. Waiting for attachment.

2. Our second action, I propose to move NHC script in a crontab and remove it from slurm.conf.

3. At the same time, ask your network and filesystem engineers to check if there was any issue at that time.

4. Remove this parameter-> LaunchParameters=slurmstepd_memlock

5. Question: how often does it happen?

Comment 9 Anthony DelSorbo 2020-04-27 13:17:35 MDT

(In reply to Felip Moll from comment #8)

2nd time I'm writing this.  Message must have gotten lost ....

> To summarize what I am waiting for:
> 
> 1. Our first action is to analyze dmesg output. Waiting for attachment.

This is a production system.  The nodes needed to be rebooted in order to get them back in service.  I doubt the latest will provide you any answers.

> 
> 2. Our second action, I propose to move NHC script in a crontab and remove
> it from slurm.conf.

We don't run cron on our compute nodes.  Not sure why you're suggesting this.  NHC is the same as what SchedMD suggests from Michael Jennings.  Has been running quite well on our nearly 1600 node production system.  

> 3. At the same time, ask your network and filesystem engineers to check if
> there was any issue at that time.

No news as of yet.  Have just re-asked the question.

> 
> 4. Remove this parameter-> LaunchParameters=slurmstepd_memlock

Please provide justification for this change.  This is a (smaller) production system and making changes is a lengthy involved process.  What would the risk an impact be for making this change?

> 
> 5. Question: how often does it happen?

This is the first time it's happened since the system has been put in production.  A few months now.

Comment 10 Felip Moll 2020-04-28 10:48:19 MDT

(In reply to Anthony DelSorbo from comment #9)
> (In reply to Felip Moll from comment #8)
> 
> 2nd time I'm writing this.  Message must have gotten lost ....
> 
> > To summarize what I am waiting for:
> > 
> > 1. Our first action is to analyze dmesg output. Waiting for attachment.
> 
> This is a production system.  The nodes needed to be rebooted in order to
> get them back in service.  I doubt the latest will provide you any answers.

I would ask you to get the dmesg info when the issue happens again. I don't really have much now to properly diagnose it.
Please try to check the other actions I suggest and let me know if you have more information.

> > 
> > 2. Our second action, I propose to move NHC script in a crontab and remove
> > it from slurm.conf.
> 
> We don't run cron on our compute nodes.  Not sure why you're suggesting
> this.  NHC is the same as what SchedMD suggests from Michael Jennings.  Has
> been running quite well on our nearly 1600 node production system.  

NHC is a great piece of software but you are running it from:

/apps/slurm/tools/bin/nhc-slurm

I wanted to remove possible runtime references to this shared filesystem, and NHC is one of them.
That would help us to discard at least one piece of the puzzle.

-Is /apps/ the same filesystem than the one in bug 8826 ?
-Is that a Lustre filesystem under a Dell Isilon NAS?
-Which kernel versions do you use in this system?

> > 3. At the same time, ask your network and filesystem engineers to check if
> > there was any issue at that time.
> 
> No news as of yet.  Have just re-asked the question.

That's important. Is your test cluster (juno) using the same network, switches or filesystem than the production system?

> > 4. Remove this parameter-> LaunchParameters=slurmstepd_memlock
> 
> Please provide justification for this change.  This is a (smaller)
> production system and making changes is a lengthy involved process.  What
> would the risk an impact be for making this change?

This was an option added incorrectly as I commented before.
This makes the slurmstepd memory to always be in RAM and never swapped.
Your nodes are diskless so it is an unneeded feature which won't make any difference. Removing this is 

This was introduced in 17.02, bug 2334 commit bd3f87e947c:

 -- Optionally lock slurmstepd in memory for performance reasons and to avoid
    possible SIGBUS if the daemon is paged out at the time of a Slurm upgrade
    (changing plugins). Controlled via new LaunchParameters options of
    slurmstepd_memlock and slurmstepd_memlock_all.

The technical difference is that with this option, during slurmstepd init there will be a call to mlock_all() with MCL_CURRENT and MCL_FUTURE flags.

Accordingly to the manpage of mlockall():

mlockall()  locks  all  pages mapped into the address space of the calling process.  This includes the pages of the code,
data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files.
All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay
in RAM until later unlocked.

MCL_CURRENT Lock all pages which are currently mapped into the address space of the process.
MCL_FUTURE  Lock all pages which will become mapped into the address space of the process in the future.  These could be,
            for instance, new pages required by a growing heap and stack as well as new memory-mapped files or shared memory regions.

If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number
of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will
deny stack expansion and deliver a SIGSEGV signal to the process.

There are other implications that I think are not good at all, and this option should be used only for some very specific situations.
Specifically, if slurmstepd memory goes beyond RLIMIT_MEMLOCK limit, there will be segfaults.

> > 5. Question: how often does it happen?
> 
> This is the first time it's happened since the system has been put in
> production.  A few months now.

Interesting, I though this was happening more often. That will be more difficult to catch.
Please get the dmesg or reserve the node for our analysis in the next failure.


I'd like also to know about your feelings in what regards to this issue, do you think it it's been something only happened one time? Do you have any particular suspicions?

Comment 11 Anthony DelSorbo 2020-04-29 08:56:32 MDT

(In reply to Felip Moll from comment #10)

> I would ask you to get the dmesg info when the issue happens again....

Will do.


 
> /apps/slurm/tools/bin/nhc-slurm
...
> -Is /apps/ the same filesystem than the one in bug 8826 ?

No.  Niagara is a completely separate cluster and does not utilize any file system utilized by the test cluster (Juno) or bigger production cluster (hera).

> -Is that a Lustre filesystem under a Dell Isilon NAS?

It is a lustre filesystem on a Exascalar 14KX based DDN system.

> -Which kernel versions do you use in this system?

CentOS with 3.10.0-1062.18.1.el7.x86_64

> 
> > > 3. At the same time, ask your network and filesystem engineers to check 

No root cause was able to be determined.  Nothing in the logs at all that would even lead anyone toward a particular troubleshooting path.
> 
> That's important. Is your test cluster (juno) using the same network,
> switches or filesystem than the production system?

Completely isolated (except for login and log management networks, of course.

> 
> > > 4. Remove this parameter-> LaunchParameters=slurmstepd_memlock

Thanks for the explanation.  We'll submit a change request and remove it.  As I read your explanation, it didn't seem to be a high impact change.

> 
> > > 5. Question: how often does it happen?
... 
> Interesting, I though this was happening more often. That will be more
> difficult to catch.

Agreed.  My main concern with this is that it affect the entire cluster and only after recently updating to 20.02.1.  

> I'd like also to know about your feelings in what regards to this issue, do
> you think it it's been something only happened one time? Do you have any
> particular suspicions?

Honestly, anytime I deal with anything on Lustre the playbook goes out the window.  My opinion is that Lustre is not reliable and any glitch in the system will throw it off.  Especially since it communicates on InfiniBand.  All our communication is also over InfiniBand on this cluster - which is different from out Juno and Hera clusters.  

However, there's usually some clue that Lustre had a reaction of some sort.  This is not evident in the logs we have.  So, we're all puzzled.  Hence the reason for this ticket.  Had it been evident that it was a Lustre or InfiniBand issue, we would not have involved SchedMD.

At this point, I don't see a reason to keep this ticket open - unless there's additional information you want me to provide you.  We can close it it you'd like and I'll just re-open it should the issue recur once I take the actions you suggested above.

Best,

Tony.

Comment 12 Felip Moll 2020-05-04 06:36:11 MDT

> Honestly, anytime I deal with anything on Lustre the playbook goes out the
> window.  My opinion is that Lustre is not reliable and any glitch in the
> system will throw it off.  Especially since it communicates on InfiniBand. 
> All our communication is also over InfiniBand on this cluster - which is
> different from out Juno and Hera clusters.  
> 
> However, there's usually some clue that Lustre had a reaction of some sort. 
> This is not evident in the logs we have.  So, we're all puzzled.  Hence the
> reason for this ticket.  Had it been evident that it was a Lustre or
> InfiniBand issue, we would not have involved SchedMD.
> 
> At this point, I don't see a reason to keep this ticket open - unless
> there's additional information you want me to provide you.  We can close it
> it you'd like and I'll just re-open it should the issue recur once I take
> the actions you suggested above.
> 
> Best,
> 
> Tony.

Sadly I am out of ideas at the moment. I am concerned on the NHC scripts getting stuck and I think these living in /apps may have something to do with it. I am not blaming NHC but it interacting with IB or opening/reading/writing something in the Lustre filesystem which may have put the process into I/O wait.

Let me know if it happens again and just re-open.

Thanks and sorry for not being of more help in this one.