| Summary: | request ping, cant find an address | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | current slurm config | ||
What does your network look like between the slurmctld and the slurmd nodes? One flat network, one network with switches, multiple networks (like one IB and one ethernet), etc. Just a general description is fine. I'm wondering if there are some weird network shenanigans going on. Is it possible for the same node name to resolve to two different addresses? I noticed in slurm.conf you have: # Temporary for upgrade SlurmdTimeout=1200 I assume that SlurmdTimeout isn't supposed to be this high for normal use, and we recommend reducing that to something smaller. This should also make any node resolution problems pop up more often, but hopefully make it easier to debug them when they happen. Also, you seem to have found that adding/removing nodes in Slurm can be tricky. Here's our recommended process: * Stop slurmctld. * Stop all slurmd's. * Add/remove the nodes to/from slurm.conf. * Start all slurmd's. * Start slurmctld. Do you remember the process used for adding the nodes to slurm.conf? You verified that all daemons were restarted and slurm.conf is the same on all nodes, so I don't think this will matter, but do you remember in what order the daemons were restarted? Is it easy to restart slurmctld and slurmd's again, just to double check if restarting the daemons will fix the problem? > What does your network look like between the slurmctld and the slurmd nodes? It's pretty flat, all on the same subnet. These new hosts are connected to a new generation of switches (Cisco 9000 series), but otherwise they're all on the same network. > Is it possible for the same node name to resolve to two different addresses? I don't believe so. Everything queries the same DNS servers- I've looked at those DNS servers and can't locate any duplicates or other name issues. We don't use any local host entries. > I assume that SlurmdTimeout isn't supposed to be this high for normal use, and we recommend reducing that to something smaller Yup- we'd increased that for the last maintenance, didn't get it set back down. I'll go ahead and reduce that to sensible levels- may help tracking down this problem. > You verified that all daemons were restarted and slurm.conf is the same on all nodes, so I don't think this will matter, but do you remember in what order the daemons were restarted? I restarted all of the daemons (one after the other with about 3 seconds between each node) then the controller. > Is it easy to restart slurmctld and slurmd's again, just to double check if restarting the daemons will fix the problem? Given the other information you provided perhaps I should take a longer outage so I can shut down the controller and restart the slurmd's at the same time. I'll have to send some notice... might be able to do that late tonight early tomorrow morning. Thanks for the info! Thanks for the responses. (In reply to Michael Gutteridge from comment #4) > Given the other information you provided perhaps I should take a longer > outage so I can shut down the controller and restart the slurmd's at the > same time. I'll have to send some notice... might be able to do that late > tonight early tomorrow morning. > > Thanks for the info! I'll wait for you to do this and we can see how it goes. I've restarted the daemons as you'd suggested. Everything did come up just fine and I haven't noted any nodes go not-responding as yet. I did find another node that had an out-of-date config. Not sure how that snuck past the first scan through the cluster. That node has been shut down. Is there a way to dump information about the topology from Slurm? - Michael (In reply to Michael Gutteridge from comment #6) > I've restarted the daemons as you'd suggested. Everything did come up just > fine and I haven't noted any nodes go not-responding as yet. Good to hear! We can keep this ticket open for a little bit just to make sure. > I did find another node that had an out-of-date config. Not sure how that > snuck past the first scan through the cluster. That node has been shut down. > > Is there a way to dump information about the topology from Slurm? We don't have any tools to do that. I've heard of some 3rd party tools that can do that, but of course we don't support them and your mileage may vary on how accurate (to the network topology and syntactically) they are. Since I haven't heard otherwise, I assume that this bug is resolved. I'm closing this as infogiven, but please re-open it if you continue to have issues. My apologies- I thought I'd resolved this ticket. Everything is working great. - Michael |
Created attachment 18474 [details] current slurm config We've got a problem with nodes being marked down. The nodes seem to function fine for a little while when the nodes are subsequently returned to service (`scontrol update nodename= state=resume`). A sample of some (apparently) relevant log messages at debug level 6: [2021-03-15T22:09:00.842] agent/is_node_resp: node:gizmok61 RPC:REQUEST_PING : Can't find an address, check slurm.conf [2021-03-15T23:55:43.563] Node gizmok61 now responding [2021-03-15T23:55:43.563] debug2: _slurm_rpc_node_registration complete for gizmok61 usec=110541 [2021-03-15T23:55:45.717] debug2: node_did_resp gizmok61 [2021-03-16T00:09:01.723] debug2: node_did_resp gizmok61 [2021-03-16T00:22:21.171] agent/is_node_resp: node:gizmok61 RPC:REQUEST_PING : Can't find an address, check slurm.conf [2021-03-16T02:09:04.107] Node gizmok61 now responding [2021-03-16T02:09:04.107] debug2: _slurm_rpc_node_registration complete for gizmok61 usec=116 [2021-03-16T02:09:06.351] debug2: node_did_resp gizmok61 [2021-03-16T02:22:24.380] debug2: node_did_resp gizmok61 [2021-03-16T02:35:44.663] debug2: node_did_resp gizmok61 [2021-03-16T02:49:05.220] debug2: node_did_resp gizmok61 [2021-03-16T03:02:25.380] debug2: node_did_resp gizmok61 [2021-03-16T03:15:46.083] debug2: node_did_resp gizmok61 [2021-03-16T03:29:06.034] debug2: node_did_resp gizmok61 [2021-03-16T03:42:25.805] debug2: node_did_resp gizmok61 [2021-03-16T03:55:45.162] debug2: node_did_resp gizmok61 [2021-03-16T04:09:05.928] debug2: node_did_resp gizmok61 [2021-03-16T04:22:27.237] debug2: _slurm_rpc_node_registration complete for gizmok61 usec=16 [2021-03-16T04:22:29.494] debug2: node_did_resp gizmok61 [2021-03-16T04:35:45.618] debug2: node_did_resp gizmok61 [2021-03-16T04:49:05.341] agent/is_node_resp: node:gizmok61 RPC:REQUEST_PING : Can't find an address, check slurm.conf It does seem to be correlated to nodes that were recently added to the slurm config (attached). The node names are resolvable, and there doesn't seem to be an issue with DNS. I've verified that all configured nodes have the same slurm.conf and restarted all slurm daemons and the controller (the messages above were logged after that restart). As part of troubleshooting I've also set "ReturnToService" to 0 (it was at 2 for some reason I don't immediately recall). Thanks