| Summary: | GPU nodes state unstable | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Anthony DelSorbo <anthony.delsorbo> |
| Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 18.08.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NOAA | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | NESCC | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | Theia | CLE Version: | |
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
bqs3 slurmctld.log file.
slurm.conf file bqs3 sdiag output bqs3 slurmctld.log reduced output slurmd extract from syslog for node tg001 sdiag output sinfo slurmctld log since restart yesterday. Theia slurm.conf Theia topology.conf Node tg002 sylog contains slurmd logs slurmctld log |
||
Created attachment 7944 [details]
slurm.conf file
Hi Anthony, Thanks for the slurmctld log file and slurm.conf. I'm reviewing those. I agree that it looks like a communication problem. Can you also upload a slurmd log file from one of the afflicted nodes? - Marshall Can you also try turning slurmctlddebug=debug2? debug3 is extremely verbose and usually doesn't really help us much. It might help with responsiveness in the slurmctld. In fact, many sites run at verbose or debug. For now, I think it will be useful to try setting it to debug2. Can you also upload the output of the sdiag command? Created attachment 7959 [details]
bqs3 sdiag output
Created attachment 7960 [details]
bqs3 slurmctld.log reduced output
After sending out an scontrol reconfigure (around 20:43 UTC) I pruned this log to remove most of the outdated info (you have a good chunk of it in the upload I provided yesterday).
Created attachment 7961 [details]
slurmd extract from syslog for node tg001
This data is not from a slurmd.log as that's not currently enabled. However, I do have the same data (?) going to the syslog (see slurm.conf). So I grepped only the slurmd lines from tg001's syslog output. I'm hopeful this is what' you're looking for.
Any chance there's a daemon performance test we can run between the controller and the node(s) in question? I'm not sure if there's such a test, but I'll look into it more. Can you ping the nodes? Are all the problem nodes on the same rack or close together physically? Have you checked into the possibility of a network problem? Have you investigated a potential firewall issue? (I doubt this is yet, but it's an idea.) I've reduced the priority of this ticket to a minor issue until we resolve it. We use IPoIB for communicating with the nodes. We've been using datagram mode to date and noted that now we are seeing extremely poor performance since our most recent maintenance period which just happened to coincide with the first slurm installation on this production system. Strangely, this doesn't seem to be affecting the existing Moab/Torque installation. Today, we detected a kernel issue as identified in https://access.redhat.com/solutions/3568891 which is preventing us from switching to connected mode where we expect to significantly increase the system's IPoIB performance. We suspect the current kernel is also affecting the stability of datagram mode but have no real proof. So, we've asked the customer for an emergency maintenance session to back-rev the kernel. It's possible we'll be able to implement such a change next Thursday, 11 Oct. At this time we ask of you for any information you may have with respect to IpoIB performance in datagram vs connected mode as it relates to Slurm. Further, if you have any additional tuning efforts we should employ other than those found in the "High Throughput Computing Administration Guide" (none of which seemed to help in our case), we would appreciate hearing about them. Thanks, Tony. (In reply to Anthony DelSorbo from comment #9) > At this time we ask of you for any information you may have with respect to > IpoIB performance in datagram vs connected mode as it relates to Slurm. I need to research this more. I'll get back to you. > Further, if you have any additional tuning efforts we should employ other > than those found in the "High Throughput Computing Administration Guide" > (none of which seemed to help in our case), we would appreciate hearing > about them. We absolutely have tuning suggestions. However, many tuning suggestions are highly dependent on the workload. Will they be running mostly short jobs? Small jobs? Longer jobs? Bigger jobs? A mix of everything? Whenever you run performance tests would be a good time to get information from sdiag - run sdiag periodically (once or twice an hour) for a day or two during performance tests, and we can pursue more specific tuning options, particularly for SchedulerParameters. Here are a few points of general advice, though I leave it up to you to decide what you want to or can incorporate. This is by no means comprehensive, so for more specific tuning advice we'd need to see sdiag and logs during performance tests. StateSaveLocation should be on a fast filesystem, since slurmctld writes to and reads from it frequently. Even if they aren't running a high-throughput environment, CommitDelay=1 in slurmdbd.conf will still help when there are a large number of job completion messages from the slurmctld to the slurmdbd at the same time. MaxQueryTimeRange in slurmdbd.conf is also useful. Slurmd log files should ideally be on a fast local filesystem rather than a shared filesystem. We recommend the debug level for all daemons (slurmctld, slurmd, slurmdbd) to be set to debug2 or less; usually debug is fine for slurmctld and gives us plenty of information. debug3 is extremely verbose and rarely gives us helpful debug information. Optimizing your various prolog and epilog scripts and your job submit plugin is important - basically, avoid large queries when smaller, more specific ones will work. Ensure that all jobs have a time limit, ideally that they have various time limits. If all jobs have the same time limit, or no time limit, the backfill scheduler won't be able to backfill jobs. Setting a short DefaultTime limit in partition definitions can encourage users to set their own time limits. You could also use the job submit plugin to reject jobs without a time limit explicitly set. You don't have to do either of these things, but however you do it, time limits of varying lengths are important for the backfill scheduler to do its job. (In reply to Anthony DelSorbo from comment #9) > At this time we ask of you for any information you may have with respect to > IpoIB performance in datagram vs connected mode as it relates to Slurm. Whether you use datagram or connected mode is totally independent of Slurm. I'd like to close this ticket until the network problems are resolved. Once the network problems are resolved, if you're still seeing the same issues with node state, we can re-open the ticket. Is that okay with you? For further tuning guidance, I suggest you open a new ticket after you've run performance tests (or just normal daily work) and have the sdiag results and the slurmctld log file during that time. Agreed. Please close ticket. Will re-open if it turns out to be an alternate issue. Tony. Sounds good. Closing as resolved/infogiven for now. I'd like to reopen this case as we've considerably improved the throughput from the nodes and still we get these particular nodes changing states. How can we determine what the slurmctld is detecting that's forcing these nodes to change state. Perhaps if we start there, we can backtrack the issue to the root cause. Hi Tony, Just to clarify, those nodes are still oscillating between idle and down, correct? Can you upload squeue output like you did in the original post? If it says down* (with an asterisk), that means the slurmctld can't contact the node after slurmdtimeout seconds (default 300), and marked it down. Have you made any changes to the slurm.conf for that cluster? If so, can you re-upload it? Can you also upload a slurmctld log file? Just enough time to show the state of the nodes oscillating - if that's just 10 minutes, then that's great. Thanks - Marshall I'll attach the data you asked for and here's the additional data that was asked for at the meeting today. The Driver version is: MLNX_OFED_LINUX-4.4-2.0.7.0-rhel7.5-x86_64 on our GPU nodes with the mellanox cards. And the infor for the connectIB driver: filename: /lib/modules/3.10.0-862.14.4.el7.x86_64/weak-updates/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko license: Dual BSD/GPL description: Mellanox Connect-IB HCA IB driver author: Eli Cohen <eli@mellanox.com> retpoline: Y rhelversion: 7.5 srcversion: 7908D47E05F39523F10BB57 depends: mlx5_core,ib_core,mlx_compat vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions parm: dc_cnak_qp_depth:DC CNAK QP depth (uint) Here's the info on the mellanox connectIB driver on the controller node. [root@bqs5 ~]# modinfo mlx5_ib filename: /lib/modules/2.6.32-754.6.3.el6.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko version: 2.2-1 license: Dual BSD/GPL description: Mellanox Connect-IB HCA IB driver author: Eli Cohen <eli@mellanox.com> retpoline: Y srcversion: FA83AF4451243374C1940EC depends: mlx5_core,ib_core vermagic: 2.6.32-754.6.3.el6.x86_64 SMP mod_unload modversions parm: prof_sel:profile selector. Deprecated here. Moved to module mlx5_core (int) Created attachment 8249 [details]
sdiag output
Created attachment 8250 [details]
sinfo
Created attachment 8251 [details]
slurmctld log since restart yesterday.
Thanks, the drivers/firmware versions look like they're supposedly compatible according to Mellanox: http://www.mellanox.com/page/mlnx_ofed_matrix?mtag=linux_sw_drivers It looks like there may be more than one problem here. First, there are lots of munge errors: [2018-11-07T00:12:15.814] debug: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2018-11-07T00:12:17.817] error: If munged is up, restart with --num-threads=10 [2018-11-07T00:12:17.817] error: Munge encode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory [2018-11-07T00:12:17.817] error: authentication: Munged communication error [2018-11-07T00:12:25.964] debug: Munge decode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...) [2018-11-07T00:12:25.966] error: If munged is up, restart with --num-threads=10 [2018-11-07T00:12:25.966] error: Munge decode failed: Unable to access "/var/run/munge/munge.socket.2": No such file or directory [2018-11-07T00:12:25.966] error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Munged communication error Can you verify that directory exists, that file exists, and that it is owned by user munge? Can you also restart the munge service with --num-threads=10 if you haven't already done that? Next, it looks like there are indeed still communication issues, indicated by sinfo with the asterisk next to the unresponsive nodes. Also indicated by log messages such as these: 453088 [2018-11-07T09:54:13.213] debug2: Tree head got back 26 453089 [2018-11-07T09:54:13.214] debug2: Tree head got back 27 453090 [2018-11-07T09:54:13.214] debug2: slurm_connect failed: Connection refused 453091 [2018-11-07T09:54:13.214] debug2: Error connecting slurm stream socket at 10.183.19.221:6818: Connection refused [2018-11-07T16:43:09.060] error: Nodes tg[001,003-010,012,017,056,060-063] not responding [2018-11-07T16:43:09.060] debug2: slurm_connect failed: Connection refused [2018-11-07T16:43:09.060] debug2: Error connecting slurm stream socket at 10.183.19.217:6818: Connection refused I noticed that the slurmctld log shows it loaded the topology/tree plugin, but the slurm.conf shows topology/tree commented out. Can you upload the slurm.conf file that matches this log, and your topology.conf file? Can you verify that you can ping each node from the slurmctld, and that you can connect to the slurmd port (with something like telnet <ipaddress> <port>)? Also, that you can connect to the switches in topology.conf, and each switch in topology.conf can connect to the leaf nodes it connects to? traceroute <bad node> could be useful to verify that the route to the compute nodes from the slurmctld works, and that topology.conf is correct. Also, can you upload a slurmd log file from the same time period as this slurmctld log file? I know this is a lot of requests. Thanks for your patience in debugging this. - Marshall Hi, I'm following up on comment 22. Hi, I'm just following up again. I realize that you all might have been busy because of Thanksgiving, so I'll keep this open for another week before timing it out. I'm closing this ticket as timed out for now. Please re-open when you have the requested information. Marshall, Sorry for the long delay in response. We've been working a lot of issues and details - some have fallen off the radar. This one has been a real pain. We're not sure what's causing this issue. Our network performance is quite good between these nodes. It does seem to be some sort of timeout as these nodes continue to switch between idle, idle*, and down* at random: fge up 8:00:00 100 idle tg[001-100] a few minutes later: fge up 8:00:00 8 idle* tg[023,056-062] fge up 8:00:00 92 idle tg[001-022,024-055,063-100] Then a few minutes after that: fge up 8:00:00 18 idle* tg[002,025-039,041-042] fge up 8:00:00 6 down* tg[001,011,023,056-058] fge up 8:00:00 76 idle tg[003-010,012-022,024,040,043-055,059-100] We've changed the TCPTimeout value to 60 from its default of 2. This didn't make any difference other than the initial state changes aren't seen until a later time. The above info is with the setting at 60. So, at this point we need help in understanding what exactly isn't communicating to what. What's the process between the slurmd and slurmctld I should be monitoring to determine where it's failing? To be clear, this occurs whether there is work (non slurm) being performed on those nodes or not. This is a High Importance item now. Without a resolution to this issue we can put Slurm in production. Thanks, Tony. No worries on the delayed response.
> What's the process between the slurmd and slurmctld I should
> be monitoring to determine where it's failing?
The slurmctld pings the nodes every SlurmdTimeout/3 seconds, or every 100 seconds if SlurmdTimeout is not defined. (ping_nodes() is called by the _slurmctld_background thread.) Most of these pings are simply a REQUEST_PING rpc, in which the slurmd simply responds to the slurmctld. Every 20 pings, the slurmctld will send a REQUEST_NODE_REGISTRATION_STATUS rpc, which involves a bit more back-and-forth communication between the slurmd's and the slurmctld. The routes these messages take between the slurmctld and slurmd's depends on the route plugin defined in slurm.conf. If a node doesn't respond successfully within SlurmdTimeout seconds, the slurmctld marks the node as down.
The previous log files you uploaded indicated there were communication issues between the slurmctld and certain IP addresses.
Can you use the ping command from the slurmctld to each of the problematic compute nodes? (I'm guessing yes.)
Can you connect to the slurmd port on each of those nodes from the slurmctld and vice versa (from the slurmd's to the slurmctld and between each other) with something like telnet <ipaddress> <port>? If not, can you also verify that a firewall isn't blocking ports?
Can you also use traceroute <bad node> to see the route from the slurmctld to the slurmd?
A current slurm.conf file would also be helpful, as well as the topology.conf file if you're using a topology plugin.
Marshall, Thanks for the info. We're working to get additional data for you. In the meantime, is there a means to tell slurm to use TCP instead of UDP? Tony. Slurm already uses TCP. There isn't any way to change it regardless. I realize that a lot of people will be gone during the holiday season, and it may take some time to do the tests, so I'm closing this as timed out for now. It will automatically re-open when you reply. We can't see anything in Slurm that will cause the behavior you're seeing; it still looks like a network problem. Do the tests to debug the network and make sure it isn't a network problem. If it isn't, when you re-open it we can take a look at the results. Marshall, I'm including the slurm.conf and topology.conf files you requested. We're still troubleshooting this issue. While troubleshooting one of the nodes (see below, for example), I noted that none of them are sending out "debug5" level to the logs. The node shown below has even had its slurmd started anew. As you can see, there's basic information shown only. Dec 18 15:45:53 tg017 slurmd[33502]: Message aggregation disabled Dec 18 15:45:53 tg017 slurmd[33502]: topology tree plugin loaded Dec 18 15:45:53 tg017 slurmd[33502]: Warning: Note very large processing time from slurm_topo_build_config: usec=106532 began=15:45:53.108 Dec 18 15:45:53 tg017 slurmd[33502]: route default plugin loaded Dec 18 15:45:53 tg017 slurmd[33502]: Resource spec: system cgroup memory limit set to 2000 MB Dec 18 15:45:53 tg017 slurmd[33502]: task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000080000 Dec 18 15:45:53 tg017 slurmd[33502]: Munge cryptographic signature plugin loaded Dec 18 15:45:53 tg017 slurmd[33506]: slurmd version 18.08.3 started Dec 18 15:45:53 tg017 slurmd[33506]: slurmd started on Tue, 18 Dec 2018 15:45:53 +0000 Dec 18 15:45:53 tg017 slurmd[33506]: CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=257846 TmpDisk=128923 Uptime=564255 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) After tailing the syslogs for an extended period of time we see not additional slurmd debug information. I would have expected at least some information regarding the reason why the node goes on or off line. At the very least, one would think that the slurmd on that node should indicate that it hasn't been contacted by the slurmctld or even acknowledge that it's in debug5 mode. Also, there is not one entry in the /var/log/slurm/slurmd.log. Of course, we'd rather that the entries go directly to the syslogs. [root@tg017 ~]# ll /var/log/slurm/ total 0 -rw-r--r-- 1 slurm slurm 0 Dec 12 03:04 slurmd.log Tue Dec 18 17:06:58 UTC 2018 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST fge up 8:00:00 1 down* tg017 fge up 8:00:00 98 idle tg[001-016,018-022,024-100] fge up 8:00:00 1 down tg023 To answer some of your other questions: Yes, we can ping the nodes. [root@tg017 ~]# ping bqs3 PING bqs3 (10.183.0.170) 56(84) bytes of data. 64 bytes from bqs3 (10.183.0.170): icmp_seq=1 ttl=64 time=0.141 ms 64 bytes from bqs3 (10.183.0.170): icmp_seq=2 ttl=64 time=0.144 ms ... [root@bqs3 slurm]# ping tg017 PING tg017 (10.183.35.59) 56(84) bytes of data. 64 bytes from tg017 (10.183.35.59): icmp_seq=1 ttl=64 time=0.161 ms 64 bytes from tg017 (10.183.35.59): icmp_seq=2 ttl=64 time=0.189 ms ... Yes traceroute works just fine and is not routed through any router. It's a single hop from an IP standpoint. There is a hierarchy of InfiniBand switches as it traverses through the IB tree. [root@bqs3 slurm]# traceroute tg017 traceroute to tg017 (10.183.35.59), 30 hops max, 60 byte packets 1 tg017 (10.183.35.59) 0.119 ms 0.108 ms 0.084 ms [root@tg017 ~]# traceroute bqs3 traceroute to bqs3 (10.183.0.170), 30 hops max, 60 byte packets 1 bqs3 (10.183.0.170) 0.155 ms 0.111 ms 0.109 ms We have no firewalls between nodes. Can I telnet to the host? Yes: [root@bqs3 slurm]# telnet 10.183.35.59 6818 Trying 10.183.35.59... Connected to 10.183.35.59. Escape character is '^]'. and, just to confirm munge: [root@bqs3 slurm]# munge -n | ssh tg017 unmunge STATUS: Success (0) ENCODE_HOST: bqs3-1814 (10.181.4.20) ENCODE_TIME: 2018-12-18 17:30:50 +0000 (1545154250) DECODE_TIME: 2018-12-18 17:30:53 +0000 (1545154253) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 [root@bqs3 slurm]# su - slurm [slurm@bqs3 ~]$ munge -n | ssh tg017 unmunge STATUS: Success (0) ENCODE_HOST: bqs3-1814 (10.181.4.20) ENCODE_TIME: 2018-12-18 17:31:19 +0000 (1545154279) DECODE_TIME: 2018-12-18 17:31:21 +0000 (1545154281) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: slurm (2035) GID: slurm (517) LENGTH: 0 When you stated the slurmctld performs a "REQUEST_PING" I wrongly assumed it was using UDP in the process. Thank you for clarifying that it's a TCP process, not UDP. Since it's TCP, there's all the more reason it should respond to a request So we still need your guidance on this issue. Without information to go on, we're in the dark as to how to even troubleshoot this issue. Perhaps if we can "see" the debug messages in the node's syslogs, it might provide some clues as to how to proceed. Thanks, Tony. Created attachment 8692 [details]
Theia slurm.conf
Created attachment 8693 [details]
Theia topology.conf
Marshall, Disregard the note in reference to no messages in the syslogs. While it's true that they're not reaching the /var/log/messages file, the syslogs are being routed to the centralized log server. A colleague of mine just made me aware of these entries in the syslogs: Dec 18 17:28:18 tg017 slurmd[33506]: error: slurm_receive_msg_and_forward: Insane message length Dec 18 17:28:18 tg017 slurmd[33506]: error: service_connection: slurm_receive_msg: Insane message length Dec 18 17:30:01 tg017 slurmd[33506]: error: slurm_receive_msg_and_forward: Socket timed out on send/recv operation Dec 18 17:30:01 tg017 slurmd[33506]: error: service_connection: slurm_receive_msg: Socket timed out on send/recv operation what do yo make of that? Thanks, Tony. Thanks, that's very helpful. In particular, the "Insane message length" error message is interesting - it probably indicates a corrupted TCP header. We've seen it by having a mix of slurmd versions. Are all of your Slurm binaries the same version? (slurmctld, slurmdbd, slurmd, slurmstepd, and client commands.) Have you ever switched Slurm versions on this cluster? A more complete slurmd log file would be helpful. Can you upload an hour of a slurmd log file, starting with startup of the slurmd, and time where the slurmd doesn't respond? I'd like to see "Insane message length" in that log file, and the context surrounding it. I'd like it at debug2. debug3 and up usually aren't helpful and make the log file much more verbose and difficult to parse. Marshall, All nodes and controllers are running the same version of slurm: /apps/slurm/18.08.3.p2/sbin/slurmd /apps/slurm/18.08.3.p2/sbin/slurmctld I just restarted slurmd on tg017 and several other nodes that were in the down state. At this point in time, all is up. We'll see what happens in the next hour. Wed Dec 19 17:45:08 UTC 2018 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST theia* up 8:00:00 1184 idle t[0001-1012,1027-1198] shared up 8:00:00 1184 idle t[0001-1012,1027-1198] service up 1-00:00:00 10 idle tfe[01-10] bigmem up 8:00:00 14 idle tb[01-14] fge up 8:00:00 100 idle tg[001-100] All nodes (except controller) are running at debug5. Tony. Created attachment 8723 [details]
Node tg002 sylog contains slurmd logs
Created attachment 8724 [details]
slurmctld log
Marshall, I attached two more files to the case. One is the slurmctld log and the other the syslogs for one of the nodes that went off line. It long after after I sent the last message that it started show tg002 and tg003 going in a state of idle* or down*, then finally down as they are now: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST theia* up 8:00:00 1184 idle t[0001-1012,1027-1198] shared up 8:00:00 1184 idle t[0001-1012,1027-1198] service up 1-00:00:00 10 idle tfe[01-10] bigmem up 8:00:00 14 idle tb[01-14] fge up 8:00:00 98 idle tg[001,004-100] fge up 8:00:00 2 down tg[002-003] This time tg017 and tg023 stayed up - as you can see. I can't make heads or tails out of the logs. Let me know if any of it makes sense. Tony. Thanks, those logs are helpful. I have a few observations and questions: I can see that the slurmd does receive and respond to both ping and registration requests a lot of the time. When the slurmctld doesn't get a response from the slurmd, the slurmd never even received the RPC from the slurmctld. Maybe there is sometimes a problem with message routing. Do you have multiple network interfaces (for example, both ethernet and IB)? To clarify, it's just the GPU nodes that are having problems, correct? And this happens whether or not you run any jobs with Slurm? Are there things running on the GPU nodes that aren't running on the other nodes? Maybe the GPU nodes are too busy to even receive or respond to messages from the slurmctld. I noticed from the syslog file that nvidia.util.sh runs about once per minute. I also noticed a few logs with the label "pbs_mom." What's that about? Does each node log to a local filesystem? Or to a shared filesystem? I also recommend turning slurmddebug to debug2 down from debug5. debug3 on up isn't helpful for debugging, and will just make slurmd and all the slurmstepd's slower. I actually recommend setting slurmddebug to debug once this issue is resolved. Finally, have you investigated the possibility of bad routing tables to these GPU nodes? Marshall, "... Maybe there is sometimes a problem with message routing..." Perhaps, but how do we troubleshoot that? It would be good to know when the Slurmd responds and to where it is sending its response. "Do you have multiple network interfaces (for example, both ethernet and IB)?" Yes - this is an HPC system and has management and IB interfaces. Management network is To clarify, it's just the GPU nodes that are having problems, correct? And this happens whether or not you run any jobs with Slurm? "That's correct. There are no slurm jobs running on these nodes at this time." Are there things running on the GPU nodes that aren't running on the other nodes? Yes - This is a Mellanox network, which has the Mellanox OFED/IB stack. The "Maybe the GPU nodes are too busy to even receive or respond to messages from the slurmctld." Doubtful. This issue occurs whether or not there is any activity on the nodes. "I noticed from the syslog file that nvidia.util.sh runs about once per minute." That utility produced minimal traffic and simply logs node statistics. If slurm is very sensitive to this type of activity, we're going to have enormous issues when there are real jobs running on these systems. "I also noticed a few logs with the label "pbs_mom." What's that about?" This is a production system. ALL nodes at this time are running both torque and slurm. Jobs are only being processed by torque at this time using the Moab scheduler. No jobs are being submitted to these nodes via Slurm yet. Too many issues to overcome - some political. "Does each node log to a local filesystem? Or to a shared filesystem?" Logs are sent to a syslog server. The logs are rather sparse. This is minimal traffic. I have even more traffic on the compute nodes since they are continuously running jobs and job info is tracked regularly "I also recommend turning slurmddebug to debug2 down from debug5. debug3 on up isn't helpful for debugging, and will just make slurmd and all the slurmstepd's slower. I actually recommend setting slurmddebug to debug once this issue is resolved." Sorry about that - I misread your "comment 37" thinking it stated you wanted debug3 and up so I set it to debug5. Even so, the system is actually more stable of the past few days with only those two nodes being out. fge up 8:00:00 98 idle tg[001,004-100] fge up 8:00:00 2 down tg[002-003] "Finally, have you investigated the possibility of bad routing tables to these GPU nodes?" Define "bad." There are no routers on the IB network between the ctld (10.183.0.170) and the nodes 10.183.X.Y as it is flat. default via 10.182.0.199 dev eth0 10.181.1.0/24 proto static nexthop via 10.183.0.11 dev ib0 weight 1 nexthop via 10.183.0.12 dev ib0 weight 1 nexthop via 10.183.0.13 dev ib0 weight 1 nexthop via 10.183.0.14 dev ib0 weight 1 10.181.4.0/24 proto static nexthop via 10.183.0.11 dev ib0 weight 1 nexthop via 10.183.0.12 dev ib0 weight 1 nexthop via 10.183.0.13 dev ib0 weight 1 nexthop via 10.183.0.14 dev ib0 weight 1 10.181.8.0/22 via 10.183.0.156 dev ib0 advmss 2004 10.181.12.0/22 via 10.183.0.157 dev ib0 advmss 2004 10.182.0.0/16 dev eth0 proto kernel scope link src 10.182.35.44 10.183.0.0/16 dev ib0 proto kernel scope link src 10.183.35.44 So, how can we better understand what is happening between the slurmd and ctld for these event. I need to know if the packet is being received or not. If not why not - especially if we're using TCP connections between the two, there's should be some sort of error message if the connection is lost. If there's something awry with the network I should see drop appreciably during my pings. I can run the following all day long without a problem. [root@tg002 ~]# ping 10.183.0.170 PING 10.183.0.170 (10.183.0.170) 56(84) bytes of data. 64 bytes from 10.183.0.170: icmp_seq=1 ttl=64 time=0.111 ms 64 bytes from 10.183.0.170: icmp_seq=2 ttl=64 time=0.157 ms 64 bytes from 10.183.0.170: icmp_seq=3 ttl=64 time=0.207 ms 64 bytes from 10.183.0.170: icmp_seq=4 ttl=64 time=0.097 ms I also performed a flood ping from tg002 to bqs5 with no loss over 23 seconds (could go longer but it's a production system). If there were issues, I should see some packet loss. [root@tg002 ~]# ping -f bqs3 PING bqs3 (10.183.0.170) 56(84) bytes of data. .^ --- bqs3 ping statistics --- 43355 packets transmitted, 43355 received, 0% packet loss, time 23908ms rtt min/avg/max/mdev = 0.013/0.058/2.425/0.108 ms, ipg/ewma 0.551/0.095 ms At this point I just don't understand why tg002 and tg003 are being shown as consistently down. slurmd is up on these nodes and they've all been "RESUME'ed" since a few days ago. All nodes have been up for nearly 17 days. Yes they are busy, but so are 40+ other nodes. [root@bqs3 slurm]# pdsh -w tg[002-004] 'uptime;ps -aux | egrep "PID|slurm" | grep -v grep' | dshbak -c ---------------- tg002 ---------------- 13:19:53 up 16 days, 17:56, 0 users, load average: 20.32, 20.33, 20.42 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 31744 0.0 0.0 3490288 7400 ? S Dec06 0:04 /apps/slurm/18.08.3.p2/sbin/slurmd ---------------- tg003 ---------------- 13:19:52 up 16 days, 17:56, 0 users, load average: 20.25, 20.43, 20.51 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 24594 0.0 0.0 3359216 7296 ? S Dec06 0:04 /apps/slurm/18.08.3.p2/sbin/slurmd ---------------- tg004 ---------------- 13:19:53 up 16 days, 17:56, 0 users, load average: 20.43, 20.46, 20.64 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 37271 0.0 0.0 3490288 7364 ? S Dec06 0:07 /apps/slurm/18.08.3.p2/sbin/slurmd My biggest frustration at this time is that I don't get enough information in the logs that tells me what's wrong. What else can we do? Thanks, Tony. Pushed out another RESUME @ 1455 UTC. All nodes are back in now. Will let you know how long that lasts. [root@bqs3 slurm]# date;scontrol update NodeName=t[0039,0053,0291,0326,0372,0434,0572,0574,0576,0621,0900,0939,0973,1166],tg[002-003] State=RESUME Fri Dec 21 14:55:04 UTC 2018 Thanks, that helps eliminate some things. The nvidia utility is totally fine - I wanted to check if the nodes had close to 100% CPU or were tied up in some sort of I/O. It sounds like this is not the case. I also think that routing tables are not the issue. The logs seem to indicate that the nodes occasionally miss these ping/registration messages for a few minutes at a time. Maybe there is occasional network contention on these GPU nodes, leading to TCP timeout and packet retransmissions, and the ping/registration messages timeout. Try increasing MessageTimeout in slurm.conf to 30 or even 60 seconds to give the ping/registration messages more time to go through. (The default for MessageTimeout is 10 seconds, which is quite short.) You can just set MessageTimeout in slurm.conf on all the nodes and issue scontrol reconfigure - no need to restart the daemons. These kinds of issues are very difficult to debug, so thanks for your patience. I realize with the government shutdown things may have been crazy over there. I wanted to follow up on how increasing MessageTimeout worked for the site. I'm closing this as timed out. Whenever you respond, the ticket will automatically re-open. |
Created attachment 7943 [details] bqs3 slurmctld.log file. We have installed Slurm on our production system, Theia, in preparation for production testing. We have a set of nodes, all GPU, assigned to the "fge" partition, with node designation tg[001-100]. These gpu nodes will consistently oscillate between the "Idle" and "Down" states. There is no user work/jobs on these nodes. The inconsistency occurs within minutes of each other. Here is a snippet showing sinfo output over several minutes: fge up 8:00:00 67 idle* tg[002,018-027,029-034,036-058,060-068,070-087] fge up 8:00:00 33 idle tg[001,003-017,028,035,059,069,088-100] fge up 8:00:00 4 down* tg[054-057] fge up 8:00:00 54 idle tg[004,008-023,028,035,058-092] fge up 8:00:00 42 down tg[001-003,005-007,024-027,029-034,036-053,093-100] fge up 8:00:00 21 down* tg[043-046,048-051,053-057,093-100] fge up 8:00:00 54 idle tg[004,008-023,028,035,058-092] fge up 8:00:00 25 down tg[001-003,005-007,024-027,029-034,036-042,047,052] When slurmd initiates on the node: CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=257846 TmpDisk=128923 Uptime=9763 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) The nodes are defined in slurm.conf as follows: NodeName=tg[001-100] CPUs=20 CoresPerSocket=10 RealMemory=256000 ThreadsPerCore=1 The entire system also contains over 1200 other nodes. These are stable. It appears to be a communication issue as I see entries in the log of the form: Error connecting slurm stream socket at <_IP_>:6818: Connection refused Please assist in isolating this issue. I've included the slurmctld log file and will also upload the slurm.conf file, shortly. Thanks, Tony.