Ticket 740

Summary:	nodes are going offline for unknown reasons. - Slurm says "not responding"
Product:	Slurm	Reporter:	Wei Feinstein <wfeinstein>
Component:	slurmctld	Assignee:	David Bigagli <david>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	da
Version:	2.6.4
Hardware:	Linux
OS:	Linux
Site:	LBNL - Lawrence Berkeley National Laboratory	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf file attachment-24638-0.html

Description Wei Feinstein 2014-04-22 10:53:11 MDT

Created attachment 787 [details]
slurm.conf file

[2014-04-22T15:45:57.648] _slurm_rpc_submit_batch_job JobId=225901 usec=2185
[2014-04-22T15:46:15.010] error: Nodes n0000.cumulus0,n0000.explorer0,n0000.musigny0,n0000.natgas0,n0000.voltaire0,n0001.cumulus0,n0001.explorer0,n0001.musigny0,n0001.natgas0,n0001.voltaire0,n0002.cumulus0,n0002.explorer0,n0002.musigny0,n0002.natgas0,n0002.voltaire0,n0003.cumulus0,n0003.explorer0,n0003.musigny0,n0003.natgas0,n0004.cumulus0,n0004.explorer0,n0004.musigny0,n0004.natgas0,n0005.cumulus0,n0005.explorer0,n0005.musigny0,n0005.natgas0,n0006.cumulus0,n0006.explorer0,n0006.musigny0,n0006.natgas0,n0007.cumulus0,n0007.explorer0,n0007.musigny0,n0012.cumulus0,n0012.musigny0,n0012.natgas0,n0012.voltaire0,n0013.cumulus0,n0013.musigny0,n0013.natgas0,n0013.voltaire0,n0014.cumulus0,n0014.musigny0,n0014.natgas0,n0014.voltaire0,n0015.cumulus0,n0015.musigny0,n0015.natgas0,n0015.voltaire0,n0016.cumulus0,n0016.natgas0,n0016.voltaire0,n0017.cumulus0,n0017.natgas0,n0017.voltaire0,n0018.cumulus0,n0018.natgas0,n0018.voltaire0,n0019.cumulus0,n0019.natgas0,n0019.voltaire0,n0020.cumulus0,n0020.natgas0,n0020.voltaire0,n0021.cumulus0,n0021.natgas0,n0021.voltaire0,n0022.cumulus0,n0022.natgas0,n0022.voltaire0,n0023.cumulus0,n0023.natgas0,n0023.voltaire0,n0024.cumulus0,n0024.natgas0,n0024.voltaire0,n0025.cumulus0,n0025.natgas0,n0026.cumulus0,n0026.natgas0,n0027.cumulus0,n0027.natgas0,n0028.natgas0,n0029.natgas0,n0030.natgas0,n0031.natgas0,n0032.natgas0,n0033.natgas0,n0034.natgas0,n0035.natgas0,n0036.natgas0,n0037.natgas0,n0037.voltaire0,n0038.natgas0,n0038.voltaire0,n0039.voltaire0,n0040.voltaire0,n0041.voltaire0,n0042.natgas0,n0042.voltaire0,n0043.natgas0,n0043.voltaire0,n0044.natgas0,n0045.natgas0,n0046.natgas0,n0047.natgas0,n0048.natgas0,n0049.natgas0,n0050.natgas0,n0051.natgas0,n0052.natgas0,n0053.natgas0,n0054.natgas0,n0055.natgas0,n0056.natgas0,n0057.natgas0,n0058.natgas0,n0059.natgas0,n0060.natgas0,n0061.natgas0,n0062.natgas0,n0063.natgas0,n0064.natgas0,n0065.natgas0,n0066.natgas0,n0067.natgas0,n0068.natgas0,n0069.natgas0,n0070.natgas0,n0071.natgas0,n0072.natgas0,n0073.natgas0,n0074.natgas0,n0075.natgas0,n0076.natgas0,n0077.natgas0,n0078.natgas0,n0079.natgas0,n0080.natgas0,n0081.natgas0,n0082.natgas0,n0083.natgas0,n0084.natgas0,n0085.natgas0,n0086.natgas0,n0087.natgas0,n0088.natgas0 not responding
[2014-04-22T15:46:26.030] Warning: Note very large processing time from _slurmctld_background: usec=2014955 began=15:46:24.015
[2014-04-22T15:46:29.029] Warning: Note very large processing time from _slurmctld_background: usec=1998884 began=15:46:27.030


[root@perceus-00 sysconfig]# sinfo -R
REASON               USER      TIMESTAMP           NODELIST
IB down - yqin       yqin      2014-04-09T15:51:20 n0241.mako0
IB error - yqin      root      2013-12-09T13:15:04 n0132.mako0
need to check IB - y root      2013-12-02T10:19:23 n0198.mako0
Out to Finetec       root      2014-03-11T15:41:27 n0025.jbei0
Not responding       root      2014-03-11T20:51:45 n0026.jbei0
Not responding       slurm     2014-04-18T16:05:19 n0044.jbei0
Not responding       slurm     2014-04-15T09:11:50 n0050.jbei0
node keeps rebooting root      2014-04-08T08:10:41 n0039.jbei0
contacting Dell for  root      2014-02-19T17:17:12 n0000.baldur0
disk not seen / may  root      2014-02-19T17:18:07 n0005.baldur0
disk not seen / may  root      2014-02-19T17:18:40 n0007.baldur0
disk not seen / may  root      2014-02-19T17:18:49 n0008.baldur0
disk not seen / may  root      2014-02-19T17:19:12 n0015.baldur0
NHC: check_fs_mount: root      2014-02-27T13:36:55 n0021.baldur0
hard disk failure -  yqin      2014-03-24T11:30:11 n0003.jcap0
BIOS issues- sja     root      2014-04-04T10:16:20 n0000.mhg0
disk backplane-sja   root      2014-03-21T06:44:25 n0014.mhg0
failed raid          root      2014-04-08T08:13:22 n0017.mhg0
batch job complete f root      2014-04-21T18:55:15 n0021.mhg0
Memory test - kmwf   yqin      2014-04-10T14:18:57 n0063.catamount0
failed disk          root      2014-04-08T08:15:17 n0024.hbar0
Not responding       slurm     2014-04-09T15:35:21 n0001.hbar0
RAM R/U test failed  root      2014-04-08T08:15:47 n0003.hbar0
node unexpectedly re root      2014-03-25T08:57:38 n0032.hbar0
batch job complete f root      2014-04-10T07:54:12 n0006.hbar0
batch job complete f root      2014-04-14T23:23:14 n0017.hbar0
unexpectedly reboots root      2014-04-08T08:19:10 n0007.hbar0
Not responding       slurm     2014-04-22T15:42:35 n0039.natgas0,n0040.natgas0,n0041.natgas0
Not responding       slurm     2014-04-22T15:40:55 n0106.natgas0,n0107.natgas0,n0108.natgas0,n0109.natgas0,n0110.natgas0,n0111.natgas0,n0112.natgas0,n0113.natgas0,n0114.natgas0,n0115.natgas0,n0116.natgas0,n0117.natgas0,n0118.natgas0,n0119.natgas0,n0120.natgas0,n0121.natgas0,n0122.natgas0,n0123.natgas0,n0124.natgas0,n0125.natgas0

Comment 1 Danny Auble 2014-04-22 10:57:14 MDT

Jackie, could you send the slurmd log during this time for one of the nodes (n0000.cumulus0)?

Comment 2 Wei Feinstein 2014-04-22 11:12:39 MDT

Created attachment 788 [details]
attachment-24638-0.html

Danny just note that we just added a bunch of nodes today and we just
started seeing these issues.  Clusters that were added are Natgas, cumulus,
explorer and musigny.  Look at the slurm.conf file and you will see the
node names.


[2014-04-22T10:26:24.371] topology tree plugin loaded
[2014-04-22T10:26:24.516] Warning: Note very large processing time from
slurm_topo_build_config: usec=145546 began=10:26:24.371
[2014-04-22T10:26:24.517] Gathering cpu frequency information for 12 cpus
[2014-04-22T10:26:24.517] task NONE plugin loaded
[2014-04-22T10:26:24.517] auth plugin for Munge (
http://code.google.com/p/munge/) loaded
[2014-04-22T10:26:24.517] Munge cryptographic signature plugin loaded
[2014-04-22T10:26:24.534] Warning: Core limit is only 0 KB
[2014-04-22T10:26:24.534] slurmd version 2.6.4 started
[2014-04-22T10:26:24.535] Job accounting gather LINUX plugin loaded
[2014-04-22T10:26:24.535] switch NONE plugin loaded
[2014-04-22T10:26:24.535] slurmd started on Tue, 22 Apr 2014 10:26:24 -0700
[2014-04-22T10:26:24.535] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1
Memory=96869 TmpDisk=30042 Uptime=1292
[2014-04-22T10:26:24.535] AcctGatherEnergy NONE plugin loaded
[2014-04-22T10:26:24.535] AcctGatherProfile NONE plugin loaded
[2014-04-22T10:26:24.535] AcctGatherInfiniband NONE plugin loaded
[2014-04-22T10:26:24.536] AcctGatherFilesystem NONE plugin loaded
[2014-04-22T10:41:18.472] error: forward_thread to n0008.baldur0: No route
to host
[2014-04-22T15:25:51.182] error: forward_thread to n0008.baldur0: No route
to host
[2014-04-22T15:47:01.128] error: forward_thread to n0008.baldur0: No route
to host


On Tue, Apr 22, 2014 at 3:57 PM, <bugs@schedmd.com> wrote:

>   *Comment # 1 <http://bugs.schedmd.com/show_bug.cgi?id=740#c1> on bug
> 740 <http://bugs.schedmd.com/show_bug.cgi?id=740> from Danny Auble
> <da@schedmd.com> *
>
> Jackie, could you send the slurmd log during this time for one of the nodes
> (n0000.cumulus0)?
>
>  ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 3 David Bigagli 2014-04-22 11:20:50 MDT

I see those errors:

->[2014-04-22T10:41:18.472] error: forward_thread to n0008.baldur0: No route
to host
->[2014-04-22T15:25:51.182] error: forward_thread to n0008.baldur0: No route
to host
->[2014-04-22T15:47:01.128] error: forward_thread to n0008.baldur0: No route
to host

different slurmd must be able to communicate with each other because of the
message tree routing mechanism, basically a message from slurmd A cant hop through slurmd B and C before reaching the controller.

David

Comment 4 Moe Jette 2014-04-23 03:44:47 MDT

(In reply to David Bigagli from comment #3)
> I see those errors:
> 
> ->[2014-04-22T10:41:18.472] error: forward_thread to n0008.baldur0: No route
> to host
> ->[2014-04-22T15:25:51.182] error: forward_thread to n0008.baldur0: No route
> to host
> ->[2014-04-22T15:47:01.128] error: forward_thread to n0008.baldur0: No route
> to host
> 
> different slurmd must be able to communicate with each other because of the
> message tree routing mechanism, basically a message from slurmd A cant hop
> through slurmd B and C before reaching the controller.
> 
> David

Can each of your slurmd communicate with each other?

Hierarchical communications can be disabled and otherwise configured using the TreeWidth slurm.conf parameter. Setting a really large value will adversely impact Slurm performance, but each Slurm command or daemon will directly manage all communications without routing through intermediate slurmd daemons.

If desired, you can also configure each node's IP address in slurm.conf. See NodeName, NodeHostName and NodeAddr descriptions in man slurm.conf. For example
NodeName=tux[0-10] NodeHostName=n[0-10].tux[0] NodeAddr=12.3.45.[0-10] ...

I will also add that support for more controlled communications using gateway nodes is under development for a future release.

Comment 5 David Bigagli 2014-04-25 06:31:22 MDT

Hi,
  could you please update this ticket. Was the problem solved?

David

Comment 6 Wei Feinstein 2014-04-28 09:32:45 MDT

The problem was resolved by running scontrol reconfig.  We have not seen the problem any more.