| Summary: | nodes are going offline for unknown reasons. - Slurm says "not responding" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Wei Feinstein <wfeinstein> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | LBNL - Lawrence Berkeley National Laboratory | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf file
attachment-24638-0.html |
||
Jackie, could you send the slurmd log during this time for one of the nodes (n0000.cumulus0)? Created attachment 788 [details] attachment-24638-0.html Danny just note that we just added a bunch of nodes today and we just started seeing these issues. Clusters that were added are Natgas, cumulus, explorer and musigny. Look at the slurm.conf file and you will see the node names. [2014-04-22T10:26:24.371] topology tree plugin loaded [2014-04-22T10:26:24.516] Warning: Note very large processing time from slurm_topo_build_config: usec=145546 began=10:26:24.371 [2014-04-22T10:26:24.517] Gathering cpu frequency information for 12 cpus [2014-04-22T10:26:24.517] task NONE plugin loaded [2014-04-22T10:26:24.517] auth plugin for Munge ( http://code.google.com/p/munge/) loaded [2014-04-22T10:26:24.517] Munge cryptographic signature plugin loaded [2014-04-22T10:26:24.534] Warning: Core limit is only 0 KB [2014-04-22T10:26:24.534] slurmd version 2.6.4 started [2014-04-22T10:26:24.535] Job accounting gather LINUX plugin loaded [2014-04-22T10:26:24.535] switch NONE plugin loaded [2014-04-22T10:26:24.535] slurmd started on Tue, 22 Apr 2014 10:26:24 -0700 [2014-04-22T10:26:24.535] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=96869 TmpDisk=30042 Uptime=1292 [2014-04-22T10:26:24.535] AcctGatherEnergy NONE plugin loaded [2014-04-22T10:26:24.535] AcctGatherProfile NONE plugin loaded [2014-04-22T10:26:24.535] AcctGatherInfiniband NONE plugin loaded [2014-04-22T10:26:24.536] AcctGatherFilesystem NONE plugin loaded [2014-04-22T10:41:18.472] error: forward_thread to n0008.baldur0: No route to host [2014-04-22T15:25:51.182] error: forward_thread to n0008.baldur0: No route to host [2014-04-22T15:47:01.128] error: forward_thread to n0008.baldur0: No route to host On Tue, Apr 22, 2014 at 3:57 PM, <bugs@schedmd.com> wrote: > *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=740#c1> on bug > 740 <http://bugs.schedmd.com/show_bug.cgi?id=740> from Danny Auble > <da@schedmd.com> * > > Jackie, could you send the slurmd log during this time for one of the nodes > (n0000.cumulus0)? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > I see those errors: ->[2014-04-22T10:41:18.472] error: forward_thread to n0008.baldur0: No route to host ->[2014-04-22T15:25:51.182] error: forward_thread to n0008.baldur0: No route to host ->[2014-04-22T15:47:01.128] error: forward_thread to n0008.baldur0: No route to host different slurmd must be able to communicate with each other because of the message tree routing mechanism, basically a message from slurmd A cant hop through slurmd B and C before reaching the controller. David (In reply to David Bigagli from comment #3) > I see those errors: > > ->[2014-04-22T10:41:18.472] error: forward_thread to n0008.baldur0: No route > to host > ->[2014-04-22T15:25:51.182] error: forward_thread to n0008.baldur0: No route > to host > ->[2014-04-22T15:47:01.128] error: forward_thread to n0008.baldur0: No route > to host > > different slurmd must be able to communicate with each other because of the > message tree routing mechanism, basically a message from slurmd A cant hop > through slurmd B and C before reaching the controller. > > David Can each of your slurmd communicate with each other? Hierarchical communications can be disabled and otherwise configured using the TreeWidth slurm.conf parameter. Setting a really large value will adversely impact Slurm performance, but each Slurm command or daemon will directly manage all communications without routing through intermediate slurmd daemons. If desired, you can also configure each node's IP address in slurm.conf. See NodeName, NodeHostName and NodeAddr descriptions in man slurm.conf. For example NodeName=tux[0-10] NodeHostName=n[0-10].tux[0] NodeAddr=12.3.45.[0-10] ... I will also add that support for more controlled communications using gateway nodes is under development for a future release. Hi, could you please update this ticket. Was the problem solved? David The problem was resolved by running scontrol reconfig. We have not seen the problem any more. |
Created attachment 787 [details] slurm.conf file [2014-04-22T15:45:57.648] _slurm_rpc_submit_batch_job JobId=225901 usec=2185 [2014-04-22T15:46:15.010] error: Nodes n0000.cumulus0,n0000.explorer0,n0000.musigny0,n0000.natgas0,n0000.voltaire0,n0001.cumulus0,n0001.explorer0,n0001.musigny0,n0001.natgas0,n0001.voltaire0,n0002.cumulus0,n0002.explorer0,n0002.musigny0,n0002.natgas0,n0002.voltaire0,n0003.cumulus0,n0003.explorer0,n0003.musigny0,n0003.natgas0,n0004.cumulus0,n0004.explorer0,n0004.musigny0,n0004.natgas0,n0005.cumulus0,n0005.explorer0,n0005.musigny0,n0005.natgas0,n0006.cumulus0,n0006.explorer0,n0006.musigny0,n0006.natgas0,n0007.cumulus0,n0007.explorer0,n0007.musigny0,n0012.cumulus0,n0012.musigny0,n0012.natgas0,n0012.voltaire0,n0013.cumulus0,n0013.musigny0,n0013.natgas0,n0013.voltaire0,n0014.cumulus0,n0014.musigny0,n0014.natgas0,n0014.voltaire0,n0015.cumulus0,n0015.musigny0,n0015.natgas0,n0015.voltaire0,n0016.cumulus0,n0016.natgas0,n0016.voltaire0,n0017.cumulus0,n0017.natgas0,n0017.voltaire0,n0018.cumulus0,n0018.natgas0,n0018.voltaire0,n0019.cumulus0,n0019.natgas0,n0019.voltaire0,n0020.cumulus0,n0020.natgas0,n0020.voltaire0,n0021.cumulus0,n0021.natgas0,n0021.voltaire0,n0022.cumulus0,n0022.natgas0,n0022.voltaire0,n0023.cumulus0,n0023.natgas0,n0023.voltaire0,n0024.cumulus0,n0024.natgas0,n0024.voltaire0,n0025.cumulus0,n0025.natgas0,n0026.cumulus0,n0026.natgas0,n0027.cumulus0,n0027.natgas0,n0028.natgas0,n0029.natgas0,n0030.natgas0,n0031.natgas0,n0032.natgas0,n0033.natgas0,n0034.natgas0,n0035.natgas0,n0036.natgas0,n0037.natgas0,n0037.voltaire0,n0038.natgas0,n0038.voltaire0,n0039.voltaire0,n0040.voltaire0,n0041.voltaire0,n0042.natgas0,n0042.voltaire0,n0043.natgas0,n0043.voltaire0,n0044.natgas0,n0045.natgas0,n0046.natgas0,n0047.natgas0,n0048.natgas0,n0049.natgas0,n0050.natgas0,n0051.natgas0,n0052.natgas0,n0053.natgas0,n0054.natgas0,n0055.natgas0,n0056.natgas0,n0057.natgas0,n0058.natgas0,n0059.natgas0,n0060.natgas0,n0061.natgas0,n0062.natgas0,n0063.natgas0,n0064.natgas0,n0065.natgas0,n0066.natgas0,n0067.natgas0,n0068.natgas0,n0069.natgas0,n0070.natgas0,n0071.natgas0,n0072.natgas0,n0073.natgas0,n0074.natgas0,n0075.natgas0,n0076.natgas0,n0077.natgas0,n0078.natgas0,n0079.natgas0,n0080.natgas0,n0081.natgas0,n0082.natgas0,n0083.natgas0,n0084.natgas0,n0085.natgas0,n0086.natgas0,n0087.natgas0,n0088.natgas0 not responding [2014-04-22T15:46:26.030] Warning: Note very large processing time from _slurmctld_background: usec=2014955 began=15:46:24.015 [2014-04-22T15:46:29.029] Warning: Note very large processing time from _slurmctld_background: usec=1998884 began=15:46:27.030 [root@perceus-00 sysconfig]# sinfo -R REASON USER TIMESTAMP NODELIST IB down - yqin yqin 2014-04-09T15:51:20 n0241.mako0 IB error - yqin root 2013-12-09T13:15:04 n0132.mako0 need to check IB - y root 2013-12-02T10:19:23 n0198.mako0 Out to Finetec root 2014-03-11T15:41:27 n0025.jbei0 Not responding root 2014-03-11T20:51:45 n0026.jbei0 Not responding slurm 2014-04-18T16:05:19 n0044.jbei0 Not responding slurm 2014-04-15T09:11:50 n0050.jbei0 node keeps rebooting root 2014-04-08T08:10:41 n0039.jbei0 contacting Dell for root 2014-02-19T17:17:12 n0000.baldur0 disk not seen / may root 2014-02-19T17:18:07 n0005.baldur0 disk not seen / may root 2014-02-19T17:18:40 n0007.baldur0 disk not seen / may root 2014-02-19T17:18:49 n0008.baldur0 disk not seen / may root 2014-02-19T17:19:12 n0015.baldur0 NHC: check_fs_mount: root 2014-02-27T13:36:55 n0021.baldur0 hard disk failure - yqin 2014-03-24T11:30:11 n0003.jcap0 BIOS issues- sja root 2014-04-04T10:16:20 n0000.mhg0 disk backplane-sja root 2014-03-21T06:44:25 n0014.mhg0 failed raid root 2014-04-08T08:13:22 n0017.mhg0 batch job complete f root 2014-04-21T18:55:15 n0021.mhg0 Memory test - kmwf yqin 2014-04-10T14:18:57 n0063.catamount0 failed disk root 2014-04-08T08:15:17 n0024.hbar0 Not responding slurm 2014-04-09T15:35:21 n0001.hbar0 RAM R/U test failed root 2014-04-08T08:15:47 n0003.hbar0 node unexpectedly re root 2014-03-25T08:57:38 n0032.hbar0 batch job complete f root 2014-04-10T07:54:12 n0006.hbar0 batch job complete f root 2014-04-14T23:23:14 n0017.hbar0 unexpectedly reboots root 2014-04-08T08:19:10 n0007.hbar0 Not responding slurm 2014-04-22T15:42:35 n0039.natgas0,n0040.natgas0,n0041.natgas0 Not responding slurm 2014-04-22T15:40:55 n0106.natgas0,n0107.natgas0,n0108.natgas0,n0109.natgas0,n0110.natgas0,n0111.natgas0,n0112.natgas0,n0113.natgas0,n0114.natgas0,n0115.natgas0,n0116.natgas0,n0117.natgas0,n0118.natgas0,n0119.natgas0,n0120.natgas0,n0121.natgas0,n0122.natgas0,n0123.natgas0,n0124.natgas0,n0125.natgas0