5810 – GPU nodes state unstable

Ticket 5810 - GPU nodes state unstable

Summary: GPU nodes state unstable

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.0
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-10-03 20:08 MDT by Anthony DelSorbo
Modified:	2019-02-07 15:08 MST (History)
CC List:	0 users

See Also:
Site:	NOAA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	NESCC
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	Theia
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
bqs3 slurmctld.log file. (7.80 MB, application/x-bzip) 2018-10-03 20:08 MDT, Anthony DelSorbo	Details
slurm.conf file (5.89 KB, text/plain) 2018-10-03 20:09 MDT, Anthony DelSorbo	Details
bqs3 sdiag output (1.97 KB, text/plain) 2018-10-04 15:24 MDT, Anthony DelSorbo	Details
bqs3 slurmctld.log reduced output (790.35 KB, text/plain) 2018-10-04 15:27 MDT, Anthony DelSorbo	Details
slurmd extract from syslog for node tg001 (74.64 KB, text/plain) 2018-10-04 15:32 MDT, Anthony DelSorbo	Details
sdiag output (1.77 KB, text/plain) 2018-11-07 17:43 MST, Anthony DelSorbo	Details
sinfo (3.54 KB, text/plain) 2018-11-07 17:44 MST, Anthony DelSorbo	Details
slurmctld log since restart yesterday. (1.53 MB, application/x-bzip) 2018-11-07 17:45 MST, Anthony DelSorbo	Details
Theia slurm.conf (5.77 KB, text/plain) 2018-12-18 10:34 MST, Anthony DelSorbo	Details
Theia topology.conf (15.33 KB, text/plain) 2018-12-18 10:36 MST, Anthony DelSorbo	Details
Node tg002 sylog contains slurmd logs (728.19 KB, text/plain) 2018-12-19 17:53 MST, Anthony DelSorbo	Details
slurmctld log (45.24 MB, text/plain) 2018-12-19 17:58 MST, Anthony DelSorbo	Details
Show Obsolete (3) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Anthony DelSorbo 2018-10-03 20:08:23 MDT

Created attachment 7943 [details]
bqs3 slurmctld.log file.

We have installed Slurm on our production system, Theia, in preparation for production testing.  We have a set of nodes, all GPU, assigned to the  "fge" partition, with node designation tg[001-100].  

These gpu nodes will consistently oscillate between the "Idle" and "Down" states.  There is no user work/jobs on these nodes. The inconsistency occurs within minutes of each other.  Here is a snippet showing sinfo output over several minutes:

fge          up    8:00:00     67  idle* tg[002,018-027,029-034,036-058,060-068,070-087]
fge          up    8:00:00     33   idle tg[001,003-017,028,035,059,069,088-100]

fge          up    8:00:00      4  down* tg[054-057]
fge          up    8:00:00     54   idle tg[004,008-023,028,035,058-092]
fge          up    8:00:00     42   down tg[001-003,005-007,024-027,029-034,036-053,093-100]

fge          up    8:00:00     21  down* tg[043-046,048-051,053-057,093-100]
fge          up    8:00:00     54   idle tg[004,008-023,028,035,058-092]
fge          up    8:00:00     25   down tg[001-003,005-007,024-027,029-034,036-042,047,052]


When slurmd initiates on the node:
CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=257846 TmpDisk=128923 Uptime=9763 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

The nodes are defined in slurm.conf as follows:
NodeName=tg[001-100]            CPUs=20 CoresPerSocket=10 RealMemory=256000 ThreadsPerCore=1 

The entire system also contains over 1200 other nodes.  These are stable.  

It appears to be a communication issue as I see entries in the log of the form:

Error connecting slurm stream socket at <_IP_>:6818: Connection refused

Please assist in isolating this issue.

I've included the slurmctld log file and will also upload the slurm.conf file, shortly.

Thanks,

Tony.

Comment 1 Anthony DelSorbo 2018-10-03 20:09:25 MDT

Created attachment 7944 [details]
slurm.conf file

Comment 2 Marshall Garey 2018-10-04 09:22:55 MDT

Hi Anthony,

Thanks for the slurmctld log file and slurm.conf. I'm reviewing those. I agree that it looks like a communication problem.

Can you also upload a slurmd log file from one of the afflicted nodes?

- Marshall

Comment 3 Marshall Garey 2018-10-04 09:26:27 MDT

Can you also try turning slurmctlddebug=debug2? debug3 is extremely verbose and usually doesn't really help us much. It might help with responsiveness in the slurmctld. In fact, many sites run at verbose or debug. For now, I think it will be useful to try setting it to debug2.

Can you also upload the output of the sdiag command?

Comment 4 Anthony DelSorbo 2018-10-04 15:24:46 MDT

Created attachment 7959 [details]
bqs3 sdiag output

Comment 5 Anthony DelSorbo 2018-10-04 15:27:58 MDT

Created attachment 7960 [details]
bqs3 slurmctld.log reduced output

After sending out an scontrol reconfigure (around 20:43 UTC) I pruned this log to remove most of the outdated info (you have a good chunk of it in the upload I provided yesterday).

Comment 6 Anthony DelSorbo 2018-10-04 15:32:36 MDT

Created attachment 7961 [details]
slurmd extract from syslog for node tg001

This data is not from a slurmd.log as that's not currently enabled.  However, I do have the same data (?) going to the syslog (see slurm.conf).  So I grepped only the slurmd lines from tg001's syslog output.  I'm hopeful this is what' you're looking for.

Comment 7 Anthony DelSorbo 2018-10-04 15:34:33 MDT

Any chance there's a daemon performance test we can run between the controller and the node(s) in question?

Comment 8 Marshall Garey 2018-10-04 16:53:53 MDT

I'm not sure if there's such a test, but I'll look into it more.

Can you ping the nodes? Are all the problem nodes on the same rack or close together physically? Have you checked into the possibility of a network problem?

Have you investigated a potential firewall issue? (I doubt this is yet, but it's an idea.)

Comment 9 Anthony DelSorbo 2018-10-05 18:38:56 MDT

I've reduced the priority of this ticket to a minor issue until we resolve it.  We use IPoIB for communicating with the nodes.  We've been using datagram mode to date and noted that now we are seeing extremely poor performance since our most recent maintenance period which just happened to coincide with the first slurm installation on this production system.  Strangely, this doesn't seem to be affecting the existing Moab/Torque installation.

Today, we detected a kernel issue as identified in  https://access.redhat.com/solutions/3568891 which is preventing us from switching to connected mode where we expect to significantly increase the system's IPoIB performance.  We suspect the current kernel is also affecting the stability of datagram mode but have no real proof.

So, we've asked the customer for an emergency maintenance session to back-rev the kernel.  It's possible we'll be able to implement such a change next Thursday, 11 Oct.

At this time we ask of you for any information you may have with respect to IpoIB performance in datagram vs connected mode as it relates to Slurm.  Further, if you have any additional tuning efforts we should employ other than those found in the "High Throughput Computing Administration Guide" (none of which seemed to help in our case), we would appreciate hearing about them.

Thanks,

Tony.

Comment 10 Marshall Garey 2018-10-09 10:25:53 MDT

(In reply to Anthony DelSorbo from comment #9)
> At this time we ask of you for any information you may have with respect to
> IpoIB performance in datagram vs connected mode as it relates to Slurm.

I need to research this more. I'll get back to you.


> Further, if you have any additional tuning efforts we should employ other
> than those found in the "High Throughput Computing Administration Guide"
> (none of which seemed to help in our case), we would appreciate hearing
> about them.

We absolutely have tuning suggestions. However, many tuning suggestions are highly dependent on the workload. Will they be running mostly short jobs? Small jobs? Longer jobs? Bigger jobs? A mix of everything? Whenever you run performance tests would be a good time to get information from sdiag - run sdiag periodically (once or twice an hour) for a day or two during performance tests, and we can pursue more specific tuning options, particularly for SchedulerParameters.

Here are a few points of general advice, though I leave it up to you to decide what you want to or can incorporate. This is by no means comprehensive, so for more specific tuning advice we'd need to see sdiag and logs during performance tests.

StateSaveLocation should be on a fast filesystem, since slurmctld writes to and reads from it frequently.

Even if they aren't running a high-throughput environment, CommitDelay=1 in slurmdbd.conf will still help when there are a large number of job completion messages from the slurmctld to the slurmdbd at the same time.

MaxQueryTimeRange in slurmdbd.conf is also useful.

Slurmd log files should ideally be on a fast local filesystem rather than a shared filesystem. We recommend the debug level for all daemons (slurmctld, slurmd, slurmdbd) to be set to debug2 or less; usually debug is fine for slurmctld and gives us plenty of information. debug3 is extremely verbose and rarely gives us helpful debug information.

Optimizing your various prolog and epilog scripts and your job submit plugin is important - basically, avoid large queries when smaller, more specific ones will work.

Ensure that all jobs have a time limit, ideally that they have various time limits. If all jobs have the same time limit, or no time limit, the backfill scheduler won't be able to backfill jobs. Setting a short DefaultTime limit in partition definitions can encourage users to set their own time limits. You could also use the job submit plugin to reject jobs without a time limit explicitly set. You don't have to do either of these things, but however you do it, time limits of varying lengths are important for the backfill scheduler to do its job.

Comment 11 Marshall Garey 2018-10-09 10:51:22 MDT

(In reply to Anthony DelSorbo from comment #9)
> At this time we ask of you for any information you may have with respect to
> IpoIB performance in datagram vs connected mode as it relates to Slurm. 

Whether you use datagram or connected mode is totally independent of Slurm.

I'd like to close this ticket until the network problems are resolved. Once the network problems are resolved, if you're still seeing the same issues with node state, we can re-open the ticket. Is that okay with you?

For further tuning guidance, I suggest you open a new ticket after you've run performance tests (or just normal daily work) and have the sdiag results and the slurmctld log file during that time.

Comment 12 Anthony DelSorbo 2018-10-12 10:35:36 MDT

Agreed.  Please close ticket.  Will re-open if it turns out to be an alternate issue.

Tony.

Comment 13 Marshall Garey 2018-10-12 10:39:41 MDT

Sounds good. Closing as resolved/infogiven for now.

Comment 14 Anthony DelSorbo 2018-11-07 09:09:50 MST

I'd like to reopen this case as we've considerably improved the throughput from the nodes and still we get these particular nodes changing states.  

How can we determine what the slurmctld is detecting that's forcing these nodes to change state.  Perhaps if we start there, we can backtrack the issue to the root cause.

Comment 15 Marshall Garey 2018-11-07 09:32:42 MST

Hi Tony,

Just to clarify, those nodes are still oscillating between idle and down, correct? Can you upload squeue output like you did in the original post? If it says down* (with an asterisk), that means the slurmctld can't contact the node after slurmdtimeout seconds (default 300), and marked it down.

Have you made any changes to the slurm.conf for that cluster? If so, can you re-upload it?

Can you also upload a slurmctld log file? Just enough time to show the state of the nodes oscillating - if that's just 10 minutes, then that's great.

Thanks

- Marshall

Comment 16 Anthony DelSorbo 2018-11-07 17:42:13 MST

I'll attach the data you asked for and here's the additional data that was asked for at the meeting today.


The Driver version is:  MLNX_OFED_LINUX-4.4-2.0.7.0-rhel7.5-x86_64 on our GPU nodes with the mellanox cards. And the infor for the connectIB driver:

filename:       /lib/modules/3.10.0-862.14.4.el7.x86_64/weak-updates/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
license:        Dual BSD/GPL
description:    Mellanox Connect-IB HCA IB driver
author:         Eli Cohen <eli@mellanox.com>
retpoline:      Y
rhelversion:    7.5
srcversion:     7908D47E05F39523F10BB57
depends:        mlx5_core,ib_core,mlx_compat
vermagic:       3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions 
parm:           dc_cnak_qp_depth:DC CNAK QP depth (uint)


Here's the info on the mellanox connectIB driver on the controller node.

[root@bqs5 ~]# modinfo mlx5_ib
filename:       /lib/modules/2.6.32-754.6.3.el6.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
version:        2.2-1
license:        Dual BSD/GPL
description:    Mellanox Connect-IB HCA IB driver
author:         Eli Cohen <eli@mellanox.com>
retpoline:      Y
srcversion:     FA83AF4451243374C1940EC
depends:        mlx5_core,ib_core
vermagic:       2.6.32-754.6.3.el6.x86_64 SMP mod_unload modversions 
parm:           prof_sel:profile selector. Deprecated here. Moved to module mlx5_core (int)

Comment 17 Anthony DelSorbo 2018-11-07 17:43:22 MST

Created attachment 8249 [details]
sdiag output

Comment 18 Anthony DelSorbo 2018-11-07 17:44:17 MST

Created attachment 8250 [details]
sinfo

Comment 19 Anthony DelSorbo 2018-11-07 17:45:39 MST

Created attachment 8251 [details]
slurmctld log since restart yesterday.

Comment 22 Marshall Garey 2018-11-09 09:45:12 MST

Thanks, the drivers/firmware versions look like they're supposedly compatible according to Mellanox:
http://www.mellanox.com/page/mlnx_ofed_matrix?mtag=linux_sw_drivers


It looks like there may be more than one problem here.

First, there are lots of munge errors:

[2018-11-07T00:12:15.814] debug:  Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2018-11-07T00:12:17.817] error: If munged is up, restart with --num-threads=10
[2018-11-07T00:12:17.817] error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
[2018-11-07T00:12:17.817] error: authentication: Munged communication error

[2018-11-07T00:12:25.964] debug:  Munge decode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2018-11-07T00:12:25.966] error: If munged is up, restart with --num-threads=10
[2018-11-07T00:12:25.966] error: Munge decode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory
[2018-11-07T00:12:25.966] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Munged communication error 


Can you verify that directory exists, that file exists, and that it is owned by user munge? Can you also restart the munge service with --num-threads=10 if you haven't already done that?


Next, it looks like there are indeed still communication issues, indicated by sinfo with the asterisk next to the unresponsive nodes. Also indicated by log messages such as these:

453088 [2018-11-07T09:54:13.213] debug2: Tree head got back 26
453089 [2018-11-07T09:54:13.214] debug2: Tree head got back 27
453090 [2018-11-07T09:54:13.214] debug2: slurm_connect failed: Connection refused
453091 [2018-11-07T09:54:13.214] debug2: Error connecting slurm stream socket at 10.183.19.221:6818: Connection refused

[2018-11-07T16:43:09.060] error: Nodes tg[001,003-010,012,017,056,060-063] not responding
[2018-11-07T16:43:09.060] debug2: slurm_connect failed: Connection refused
[2018-11-07T16:43:09.060] debug2: Error connecting slurm stream socket at 10.183.19.217:6818: Connection refused


I noticed that the slurmctld log shows it loaded the topology/tree plugin, but the slurm.conf shows topology/tree commented out. Can you upload the slurm.conf file that matches this log, and your topology.conf file?

Can you verify that you can ping each node from the slurmctld, and that you can connect to the slurmd port (with something like telnet <ipaddress> <port>)? Also, that you can connect to the switches in topology.conf, and each switch in topology.conf can connect to the leaf nodes it connects to?

traceroute <bad node>

could be useful to verify that the route to the compute nodes from the slurmctld works, and that topology.conf is correct.


Also, can you upload a slurmd log file from the same time period as this slurmctld log file?

I know this is a lot of requests. Thanks for your patience in debugging this.

- Marshall

Comment 23 Marshall Garey 2018-11-16 09:23:35 MST

Hi, I'm following up on comment 22.

Comment 25 Marshall Garey 2018-11-26 12:25:27 MST

Hi, I'm just following up again. I realize that you all might have been busy because of Thanksgiving, so I'll keep this open for another week before timing it out.

Comment 26 Marshall Garey 2018-12-03 09:20:06 MST

I'm closing this ticket as timed out for now. Please re-open when you have the requested information.

Comment 27 Anthony DelSorbo 2018-12-05 11:31:17 MST

Marshall,

Sorry for the long delay in response.  We've been working a lot of issues and details - some have fallen off the radar.

This one has been a real pain.  We're not sure what's causing this issue.  Our network performance is quite good between these nodes.  It does seem to be some sort of timeout as these nodes continue to switch between idle, idle*, and down* at random:

fge          up    8:00:00    100   idle tg[001-100]

a few minutes later:

fge          up    8:00:00      8  idle* tg[023,056-062]
fge          up    8:00:00     92   idle tg[001-022,024-055,063-100]

Then a few minutes after that: 

fge          up    8:00:00     18  idle* tg[002,025-039,041-042]
fge          up    8:00:00      6  down* tg[001,011,023,056-058]
fge          up    8:00:00     76   idle tg[003-010,012-022,024,040,043-055,059-100]


We've changed the TCPTimeout value to 60 from its default of 2.  This didn't make any difference other than the initial state changes aren't seen until a later time.  The above info is with the setting at 60.

So, at this point we need help in understanding what exactly isn't communicating to what. What's the process between the slurmd and slurmctld I should be monitoring to determine where it's failing?

To be clear, this occurs whether there is work (non slurm) being performed on those nodes or not.

This is a High Importance item now.  Without a resolution to this issue we can put Slurm in production.  

Thanks,

Tony.

Comment 28 Marshall Garey 2018-12-05 16:10:45 MST

No worries on the delayed response.

> What's the process between the slurmd and slurmctld I should
> be monitoring to determine where it's failing?

The slurmctld pings the nodes every SlurmdTimeout/3 seconds, or every 100 seconds if SlurmdTimeout is not defined. (ping_nodes() is called by the _slurmctld_background thread.) Most of these pings are simply a REQUEST_PING rpc, in which the slurmd simply responds to the slurmctld. Every 20 pings, the slurmctld will send a REQUEST_NODE_REGISTRATION_STATUS rpc, which involves a bit more back-and-forth communication between the slurmd's and the slurmctld. The routes these messages take between the slurmctld and slurmd's depends on the route plugin defined in slurm.conf. If a node doesn't respond successfully within SlurmdTimeout seconds, the slurmctld marks the node as down.

The previous log files you uploaded indicated there were communication issues between the slurmctld and certain IP addresses.

Can you use the ping command from the slurmctld to each of the problematic compute nodes? (I'm guessing yes.)
Can you connect to the slurmd port on each of those nodes from the slurmctld and vice versa (from the slurmd's to the slurmctld and between each other) with something like telnet <ipaddress> <port>? If not, can you also verify that a firewall isn't blocking ports?
Can you also use traceroute <bad node> to see the route from the slurmctld to the slurmd?


A current slurm.conf file would also be helpful, as well as the topology.conf file if you're using a topology plugin.

Comment 29 Anthony DelSorbo 2018-12-06 11:25:07 MST

Marshall,

Thanks for the info.  We're working to get additional data for you.  In the meantime, is there a means to tell slurm to use TCP instead of UDP?

Tony.

Comment 30 Marshall Garey 2018-12-06 12:00:54 MST

Slurm already uses TCP. There isn't any way to change it regardless.

Comment 31 Marshall Garey 2018-12-17 09:53:56 MST

I realize that a lot of people will be gone during the holiday season, and it may take some time to do the tests, so I'm closing this as timed out for now. It will automatically re-open when you reply. We can't see anything in Slurm that will cause the behavior you're seeing; it still looks like a network problem. Do the tests to debug the network and make sure it isn't a network problem. If it isn't, when you re-open it we can take a look at the results.

Comment 32 Anthony DelSorbo 2018-12-18 10:32:33 MST

Marshall,

I'm including the slurm.conf and topology.conf files you requested.  We're still troubleshooting this issue.  

While troubleshooting one of the nodes (see below, for example), I noted that none of them are sending out "debug5" level to the logs.  The node shown below has even had its slurmd started anew.  As you can see, there's basic information shown only.  

Dec 18 15:45:53 tg017 slurmd[33502]: Message aggregation disabled
Dec 18 15:45:53 tg017 slurmd[33502]: topology tree plugin loaded
Dec 18 15:45:53 tg017 slurmd[33502]: Warning: Note very large processing time from slurm_topo_build_config: usec=106532 began=15:45:53.108
Dec 18 15:45:53 tg017 slurmd[33502]: route default plugin loaded
Dec 18 15:45:53 tg017 slurmd[33502]: Resource spec: system cgroup memory limit set to 2000 MB
Dec 18 15:45:53 tg017 slurmd[33502]: task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000
Dec 18 15:45:53 tg017 slurmd[33502]: Munge cryptographic signature plugin loaded
Dec 18 15:45:53 tg017 slurmd[33506]: slurmd version 18.08.3 started
Dec 18 15:45:53 tg017 slurmd[33506]: slurmd started on Tue, 18 Dec 2018 15:45:53 +0000
Dec 18 15:45:53 tg017 slurmd[33506]: CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=257846 TmpDisk=128923 Uptime=564255 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)


After tailing the syslogs for an extended period of time we see not additional slurmd debug information.  I would have expected at least some information regarding the reason why the node goes on or off line.  At the very least, one would think that the slurmd on that node should indicate that it hasn't been contacted by the slurmctld or even acknowledge that it's in debug5 mode.

Also, there is not one entry in the /var/log/slurm/slurmd.log.  Of course, we'd rather that the entries go directly to the syslogs.   

[root@tg017 ~]# ll /var/log/slurm/
total 0
-rw-r--r-- 1 slurm slurm 0 Dec 12 03:04 slurmd.log


Tue Dec 18 17:06:58 UTC 2018
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
fge          up    8:00:00      1  down* tg017
fge          up    8:00:00     98   idle tg[001-016,018-022,024-100]
fge          up    8:00:00	1   down tg023

To answer some of your other questions:

Yes, we can ping the nodes.

[root@tg017 ~]# ping bqs3
PING bqs3 (10.183.0.170) 56(84) bytes of data.
64 bytes from bqs3 (10.183.0.170): icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from bqs3 (10.183.0.170): icmp_seq=2 ttl=64 time=0.144 ms
...
[root@bqs3 slurm]# ping tg017
PING tg017 (10.183.35.59) 56(84) bytes of data.
64 bytes from tg017 (10.183.35.59): icmp_seq=1 ttl=64 time=0.161 ms
64 bytes from tg017 (10.183.35.59): icmp_seq=2 ttl=64 time=0.189 ms
...


Yes traceroute works just fine and is not routed through any router.  It's a single hop from an IP standpoint.  There is a hierarchy of InfiniBand switches as it traverses through the IB tree.

[root@bqs3 slurm]# traceroute tg017
traceroute to tg017 (10.183.35.59), 30 hops max, 60 byte packets
 1  tg017 (10.183.35.59)  0.119 ms  0.108 ms  0.084 ms

[root@tg017 ~]# traceroute bqs3
traceroute to bqs3 (10.183.0.170), 30 hops max, 60 byte packets
 1  bqs3 (10.183.0.170)  0.155 ms  0.111 ms  0.109 ms


We have no firewalls between nodes.

Can I telnet to the host?  Yes:

[root@bqs3 slurm]# telnet 10.183.35.59 6818
Trying 10.183.35.59...
Connected to 10.183.35.59.
Escape character is '^]'.

and, just to confirm munge:

[root@bqs3 slurm]# munge -n | ssh tg017 unmunge
STATUS:           Success (0)
ENCODE_HOST:      bqs3-1814 (10.181.4.20)
ENCODE_TIME:      2018-12-18 17:30:50 +0000 (1545154250)
DECODE_TIME:      2018-12-18 17:30:53 +0000 (1545154253)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

[root@bqs3 slurm]# su - slurm
[slurm@bqs3 ~]$ munge -n | ssh tg017 unmunge
STATUS:           Success (0)
ENCODE_HOST:      bqs3-1814 (10.181.4.20)
ENCODE_TIME:      2018-12-18 17:31:19 +0000 (1545154279)
DECODE_TIME:      2018-12-18 17:31:21 +0000 (1545154281)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha1 (3)
ZIP:              none (0)
UID:              slurm (2035)
GID:              slurm (517)
LENGTH:           0


When you stated the slurmctld performs a "REQUEST_PING" I wrongly assumed it was using UDP in the process.  Thank you for clarifying that it's a TCP process, not UDP.  Since it's TCP, there's all the more reason it should respond to a request 

So we still need your guidance on this issue.  Without information to go on, we're in the dark as to how to even troubleshoot this issue. Perhaps if we can "see" the debug messages in the node's syslogs, it might provide some clues as to how to proceed.

Thanks,

Tony.

Comment 33 Anthony DelSorbo 2018-12-18 10:34:14 MST

Created attachment 8692 [details]
Theia slurm.conf

Comment 34 Anthony DelSorbo 2018-12-18 10:36:14 MST

Created attachment 8693 [details]
Theia topology.conf

Comment 35 Anthony DelSorbo 2018-12-18 12:34:18 MST

Marshall,

Disregard the note in reference to no messages in the syslogs.  While it's true that they're not reaching the /var/log/messages file, the syslogs are being routed to the centralized log server.


A colleague of mine just made me aware of these entries in the syslogs:

Dec 18 17:28:18 tg017 slurmd[33506]: error: slurm_receive_msg_and_forward: Insane message length
Dec 18 17:28:18 tg017 slurmd[33506]: error: service_connection: slurm_receive_msg: Insane message length
Dec 18 17:30:01 tg017 slurmd[33506]: error: slurm_receive_msg_and_forward: Socket timed out on send/recv operation
Dec 18 17:30:01 tg017 slurmd[33506]: error: service_connection: slurm_receive_msg: Socket timed out on send/recv operation

what do yo make of that?

Thanks,

Tony.

Comment 37 Marshall Garey 2018-12-18 14:55:09 MST

Thanks, that's very helpful. In particular, the "Insane message length" error message is interesting - it probably indicates a corrupted TCP header. We've seen it by having a mix of slurmd versions. Are all of your Slurm binaries the same version? (slurmctld, slurmdbd, slurmd, slurmstepd, and client commands.) Have you ever switched Slurm versions on this cluster?

A more complete slurmd log file would be helpful. Can you upload an hour of a slurmd log file, starting with startup of the slurmd, and time where the slurmd doesn't respond? I'd like to see "Insane message length" in that log file, and the context surrounding it. I'd like it at debug2. debug3 and up usually aren't helpful and make the log file much more verbose and difficult to parse.

Comment 38 Anthony DelSorbo 2018-12-19 10:46:43 MST

Marshall, 

All nodes and controllers are running the same version of slurm:

/apps/slurm/18.08.3.p2/sbin/slurmd
/apps/slurm/18.08.3.p2/sbin/slurmctld

I just restarted slurmd on tg017 and several other nodes that were in the down state.  At this point in time, all is up.  We'll see what happens in the next hour.

Wed Dec 19 17:45:08 UTC 2018
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
theia*       up    8:00:00   1184   idle t[0001-1012,1027-1198]
shared       up    8:00:00   1184   idle t[0001-1012,1027-1198]
service      up 1-00:00:00     10   idle tfe[01-10]
bigmem       up    8:00:00     14   idle tb[01-14]
fge          up    8:00:00    100   idle tg[001-100]


All nodes (except controller) are running at debug5.  

Tony.

Comment 39 Anthony DelSorbo 2018-12-19 17:53:51 MST

Created attachment 8723 [details]
Node tg002 sylog contains slurmd logs

Comment 40 Anthony DelSorbo 2018-12-19 17:58:22 MST

Created attachment 8724 [details]
slurmctld log

Comment 41 Anthony DelSorbo 2018-12-19 18:14:08 MST

Marshall,

I attached two more files to the case. One is the slurmctld log and the other the syslogs for one of the nodes that went off line.  It long after after I sent the last message that it started show tg002 and tg003 going in a state of idle* or down*, then finally down as they are now:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
theia*       up    8:00:00   1184   idle t[0001-1012,1027-1198]
shared       up    8:00:00   1184   idle t[0001-1012,1027-1198]
service      up 1-00:00:00     10   idle tfe[01-10]
bigmem       up    8:00:00     14   idle tb[01-14]
fge          up    8:00:00     98   idle tg[001,004-100]
fge          up    8:00:00      2   down tg[002-003]

This time tg017 and tg023 stayed up - as you can see.

I can't make heads or tails out of the logs.  Let me know if any of it makes sense.

Tony.

Comment 44 Marshall Garey 2018-12-20 15:23:35 MST

Thanks, those logs are helpful. I have a few observations and questions:

I can see that the slurmd does receive and respond to both ping and registration requests a lot of the time. When the slurmctld doesn't get a response from the slurmd, the slurmd never even received the RPC from  the slurmctld. Maybe there is sometimes a problem with message routing. Do you have multiple network interfaces (for example, both ethernet and IB)?


To clarify, it's just the GPU nodes that are having problems, correct? And this happens whether or not you run any jobs with Slurm? Are there things running on the GPU nodes that aren't running on the other nodes? Maybe the GPU nodes are too busy to even receive or respond to messages from the slurmctld. I noticed from the syslog file that nvidia.util.sh runs about once per minute. I also noticed a few logs with the label "pbs_mom." What's that about?

Does each node log to a local filesystem? Or to a shared filesystem? I also recommend turning slurmddebug to debug2 down from debug5. debug3 on up isn't helpful for debugging, and will just make slurmd and all the slurmstepd's slower. I actually recommend setting slurmddebug to debug once this issue is resolved.


Finally, have you investigated the possibility of bad routing tables to these GPU nodes?

Comment 47 Anthony DelSorbo 2018-12-21 06:23:51 MST

Marshall,

"... Maybe there is sometimes a problem with message routing..." 

Perhaps, but how do we troubleshoot that? It would be good to know when the Slurmd responds and to where it is sending its response.  

"Do you have multiple network interfaces (for example, both ethernet and IB)?"

Yes - this is an HPC system and has management and IB interfaces.  Management network is 


To clarify, it's just the GPU nodes that are having problems, correct? And this happens whether or not you run any jobs with Slurm? 

"That's correct.  There are no slurm jobs running on these nodes at this time."

Are there things running on the GPU nodes that aren't running on the other nodes? 

Yes - This is a Mellanox network, which has the Mellanox OFED/IB stack.  The 

"Maybe the GPU nodes are too busy to even receive or respond to messages from the slurmctld."  

Doubtful.  This issue occurs whether or not there is any activity on the nodes.  

"I noticed from the syslog file that nvidia.util.sh runs about once per minute."

That utility produced minimal traffic and simply logs node statistics.  If slurm is very sensitive to this type of activity, we're going to have enormous issues when there are real jobs running on these systems.

"I also noticed a few logs with the label "pbs_mom." What's that about?"

This is a production system.  ALL nodes at this time are running both torque and slurm.  Jobs are only being processed by torque at this time using the Moab scheduler.  No jobs are being submitted to these nodes via Slurm yet.  Too many issues to overcome - some political.

"Does each node log to a local filesystem? Or to a shared filesystem?"

Logs are sent to a syslog server.  The logs are rather sparse.  This is minimal traffic.  I have even more traffic on the compute nodes since they are continuously running jobs and job info is tracked regularly


"I also recommend turning slurmddebug to debug2 down from debug5. debug3 on up isn't helpful for debugging, and will just make slurmd and all the slurmstepd's slower. I actually recommend setting slurmddebug to debug once this issue is resolved."

Sorry about that - I misread your "comment 37" thinking it stated you wanted debug3 and up so I set it to debug5.  Even so, the system is actually more stable of the past few days with only those two nodes being out.  

fge          up    8:00:00     98   idle tg[001,004-100]
fge          up    8:00:00      2   down tg[002-003]

"Finally, have you investigated the possibility of bad routing tables to these GPU nodes?"

Define "bad."  There are no routers on the IB network between the ctld (10.183.0.170) and the nodes 10.183.X.Y as it is flat. 

default via 10.182.0.199 dev eth0 
10.181.1.0/24 proto static 
	nexthop via 10.183.0.11 dev ib0 weight 1 
	nexthop via 10.183.0.12 dev ib0 weight 1 
	nexthop via 10.183.0.13 dev ib0 weight 1 
	nexthop via 10.183.0.14 dev ib0 weight 1 
10.181.4.0/24 proto static 
	nexthop via 10.183.0.11 dev ib0 weight 1 
	nexthop via 10.183.0.12 dev ib0 weight 1 
	nexthop via 10.183.0.13 dev ib0 weight 1 
	nexthop via 10.183.0.14 dev ib0 weight 1 
10.181.8.0/22 via 10.183.0.156 dev ib0 advmss 2004 
10.181.12.0/22 via 10.183.0.157 dev ib0 advmss 2004 
10.182.0.0/16 dev eth0 proto kernel scope link src 10.182.35.44 
10.183.0.0/16 dev ib0 proto kernel scope link src 10.183.35.44 

So, how can we better understand what is happening between the slurmd and ctld for these event. I need to know if the packet is being received or not.  If not why not - especially if we're using TCP connections between the two, there's should be some sort of error message if the connection is lost.

If there's something awry with the network I should see drop appreciably during my pings.  I can run the following all day long without a problem. 

[root@tg002 ~]# ping 10.183.0.170
PING 10.183.0.170 (10.183.0.170) 56(84) bytes of data.
64 bytes from 10.183.0.170: icmp_seq=1 ttl=64 time=0.111 ms
64 bytes from 10.183.0.170: icmp_seq=2 ttl=64 time=0.157 ms
64 bytes from 10.183.0.170: icmp_seq=3 ttl=64 time=0.207 ms
64 bytes from 10.183.0.170: icmp_seq=4 ttl=64 time=0.097 ms

I also performed a flood ping from tg002 to bqs5 with no loss over 23 seconds (could go longer but it's a production system).  If there were issues, I should see some packet loss.  

[root@tg002 ~]# ping -f bqs3
PING bqs3 (10.183.0.170) 56(84) bytes of data.
.^ 
--- bqs3 ping statistics ---
43355 packets transmitted, 43355 received, 0% packet loss, time 23908ms
rtt min/avg/max/mdev = 0.013/0.058/2.425/0.108 ms, ipg/ewma 0.551/0.095 ms

At this point I just don't understand why tg002 and tg003 are being shown as consistently down.  slurmd is up on these nodes and they've all been "RESUME'ed" since a few days ago. All nodes have been up for nearly 17 days.  Yes they are busy, but so are 40+ other nodes.

[root@bqs3 slurm]# pdsh -w tg[002-004] 'uptime;ps -aux | egrep "PID|slurm" | grep -v grep' | dshbak -c
----------------
tg002
----------------
 13:19:53 up 16 days, 17:56,  0 users,  load average: 20.32, 20.33, 20.42
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      31744  0.0  0.0 3490288 7400 ?        S    Dec06   0:04 /apps/slurm/18.08.3.p2/sbin/slurmd
----------------
tg003
----------------
 13:19:52 up 16 days, 17:56,  0 users,  load average: 20.25, 20.43, 20.51
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      24594  0.0  0.0 3359216 7296 ?        S    Dec06   0:04 /apps/slurm/18.08.3.p2/sbin/slurmd
----------------
tg004
----------------
 13:19:53 up 16 days, 17:56,  0 users,  load average: 20.43, 20.46, 20.64
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      37271  0.0  0.0 3490288 7364 ?        S    Dec06   0:07 /apps/slurm/18.08.3.p2/sbin/slurmd


My biggest frustration at this time is that I don't get enough information in the logs that tells me what's wrong.  What else can we do?

Thanks,

Tony.

Comment 48 Anthony DelSorbo 2018-12-21 08:02:01 MST

Pushed out another RESUME @ 1455 UTC.  All nodes are back in now.  Will let you know how long that lasts.

[root@bqs3 slurm]# date;scontrol update NodeName=t[0039,0053,0291,0326,0372,0434,0572,0574,0576,0621,0900,0939,0973,1166],tg[002-003] State=RESUME
Fri Dec 21 14:55:04 UTC 2018

Comment 49 Marshall Garey 2018-12-21 10:05:46 MST

Thanks, that helps eliminate some things. The nvidia utility is totally fine - I wanted to check if the nodes had close to 100% CPU or were tied up in some sort of I/O. It sounds like this is not the case. I also think that routing tables are not the issue.

The logs seem to indicate that the nodes occasionally miss these ping/registration messages for a few minutes at a time. Maybe there is occasional network contention on these GPU nodes, leading to TCP timeout and packet retransmissions, and the ping/registration messages timeout. Try increasing MessageTimeout in slurm.conf to 30 or even 60 seconds to give the ping/registration messages more time to go through. (The default for MessageTimeout is 10 seconds, which is quite short.) You can just set MessageTimeout in slurm.conf on all the nodes and issue scontrol reconfigure - no need to restart the daemons.

These kinds of issues are very difficult to debug, so thanks for your patience.

Comment 54 Marshall Garey 2019-02-01 09:18:28 MST

I realize with the government shutdown things may have been crazy over there. I wanted to follow up on how increasing MessageTimeout worked for the site.

Comment 55 Marshall Garey 2019-02-07 15:08:38 MST

I'm closing this as timed out. Whenever you respond, the ticket will automatically re-open.