Ticket 4788

Summary: Slurm State=IDLE*+POWER
Product: Slurm Reporter: Jason Repik <jjrepik>
Component: KNLAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll
Version: 17.02.9   
Hardware: Cray XC   
OS: Linux   
Site: Sandia National Laboratories Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 17.11.6
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: partial slurmctld.log

Description Jason Repik 2018-02-14 07:42:45 MST
We have a set of KNL nodes in the following state.  
This has persisted through a full system reboot too.

sinfo for the nodes:
knl       up    1-infini 1-00:00:00    272 1:68:4      14 idle~      nid00[203-208,239-242,268-271]

sdb:~ # scontrol show node=nid00203
NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.12
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,quad
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92035 Sockets=1 Boards=1
   State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-02-14T07:09:21 SlurmdStartTime=2018-02-14T07:18:59
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

After rebooting nid00203 individually, here's it's slurmd.log:
nid00203:/var/spool/slurm # cat slurmd.log
[2018-02-14T07:18:59.631] Message aggregation disabled
[2018-02-14T07:18:59.676] Resource spec: Reserved system memory limit not configured for this node
[2018-02-14T07:18:59.688] slurmd version 17.02.9 started
[2018-02-14T07:18:59.689] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery
[2018-02-14T07:18:59.689] core_spec/cray: init
[2018-02-14T07:18:59.697] slurmd started on Wed, 14 Feb 2018 07:18:59 -0700
[2018-02-14T07:18:59.698] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
nid00203:/var/spool/slurm #

I'm not familiar with this state and I'm not sure how to get them out of it.
Comment 1 Felip Moll 2018-02-14 09:15:08 MST
(In reply to Jason Repik from comment #0)
> We have a set of KNL nodes in the following state.  
> This has persisted through a full system reboot too.
> 
> sinfo for the nodes:
> knl       up    1-infini 1-00:00:00    272 1:68:4      14 idle~     
> nid00[203-208,239-242,268-271]
> 
> sdb:~ # scontrol show node=nid00203
> NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
>    CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.12
>   
> AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
>    ActiveFeatures=knl,compute,cache,quad
>    Gres=craynetwork:4,hbm:0
>    NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
>    OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92035 Sockets=1 Boards=1
>    State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=knl,all,ccm_queue 
>    BootTime=2018-02-14T07:09:21 SlurmdStartTime=2018-02-14T07:18:59
>    CfgTRES=cpu=272,mem=96559M
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> 
> After rebooting nid00203 individually, here's it's slurmd.log:
> nid00203:/var/spool/slurm # cat slurmd.log
> [2018-02-14T07:18:59.631] Message aggregation disabled
> [2018-02-14T07:18:59.676] Resource spec: Reserved system memory limit not
> configured for this node
> [2018-02-14T07:18:59.688] slurmd version 17.02.9 started
> [2018-02-14T07:18:59.689] error: No /var/spool/slurm/job_container_state
> file for job_container/cncu state recovery
> [2018-02-14T07:18:59.689] core_spec/cray: init
> [2018-02-14T07:18:59.697] slurmd started on Wed, 14 Feb 2018 07:18:59 -0700
> [2018-02-14T07:18:59.698] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4
> Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> nid00203:/var/spool/slurm #
> 
> I'm not familiar with this state and I'm not sure how to get them out of it.


Hi Jason,

Idle*+power (seen as idle~ in sinfo) indicates that nodes have booted up and are in some power saving mode:

I suppose that nodes have been booted due to some job request and then this job finished. Have 'SuspendTimeout' been reached for this nodes? Can you successfully send jobs to this nodes?

Have you seen any indication that SuspendProgram have problems running on these nodes?


Not directly related with your question but I want to mention bug 4536 here: When dealing with power saving features you should be aware of this bug, solved in 17.02.10 - https://github.com/SchedMD/slurm/commit/c43df3a78b0f235f3502a395321690e492dc7bcc
Comment 2 Jason Repik 2018-02-14 10:03:44 MST
Again, these nodes were in this state prior to our DST yesterday.  After a full system reboot, these nodes are still in the IDLE*+POWER state.  It was my understanding that they should have returned to a 'normal' state after being rebooted.  

From what I can tell via 'capmc' the power setting on those nodes is no different than the other nodes.

I was able to submit a job that used those nodes a part of the request however, all of those nodes are now stuck in a 'completing' state.

I'm not aware of any issues with the SuspendProgram but to be fair, I'm not sure what it does or where it runs.
Comment 3 Jason Repik 2018-02-14 10:42:21 MST
Rebooted one of the nodes and no change:

nid00203:~ # cat /var/spool/slurm/slurmd.log
[2018-02-14T10:27:54.699] Message aggregation disabled
[2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not configured for this node
[2018-02-14T10:27:54.756] slurmd version 17.02.9 started
[2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery
[2018-02-14T10:27:54.757] core_spec/cray: init
[2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700
[2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
nid00203:~ #

sdb:~ # scontrol show node=nid00203
NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.53
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,quad
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92084 Sockets=1 Boards=1
   State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-14T10:27:54
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

sdb:~ #
sdb:~ # scontrol update nodename=nid00203 state=resume
slurm_update error: Invalid node state specified
sdb:~ #
Comment 4 Felip Moll 2018-02-15 08:10:48 MST
(In reply to Jason Repik from comment #3)
> Rebooted one of the nodes and no change:
> 
> nid00203:~ # cat /var/spool/slurm/slurmd.log
> [2018-02-14T10:27:54.699] Message aggregation disabled
> [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not
> configured for this node
> [2018-02-14T10:27:54.756] slurmd version 17.02.9 started
> [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state
> file for job_container/cncu state recovery
> [2018-02-14T10:27:54.757] core_spec/cray: init
> [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700
> [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4
> Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> nid00203:~ #
> 
> sdb:~ # scontrol show node=nid00203
> NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
>    CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.53
>   
> AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
>    ActiveFeatures=knl,compute,cache,quad
>    Gres=craynetwork:4,hbm:0
>    NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
>    OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92084 Sockets=1 Boards=1
>    State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=knl,all,ccm_queue 
>    BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-14T10:27:54
>    CfgTRES=cpu=272,mem=96559M
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>    
> 
> sdb:~ #
> sdb:~ # scontrol update nodename=nid00203 state=resume
> slurm_update error: Invalid node state specified
> sdb:~ #

In respect to IDLE+COMPLETING problem, you've certainly hit bug 4536.

This is resolved in commit: 7d24678477d49c5305792b00883bf79c55ee3307 in next 17.02.10 version. You can safely apply this if you need it now.


I am still investigating why your nodes remained in IDLE*+POWER status.
Comment 5 Felip Moll 2018-02-15 08:23:02 MST
(In reply to Felip Moll from comment #4)
> (In reply to Jason Repik from comment #3)
> > Rebooted one of the nodes and no change:
> > 
> > nid00203:~ # cat /var/spool/slurm/slurmd.log
> > [2018-02-14T10:27:54.699] Message aggregation disabled
> > [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not
> > configured for this node
> > [2018-02-14T10:27:54.756] slurmd version 17.02.9 started
> > [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state
> > file for job_container/cncu state recovery
> > [2018-02-14T10:27:54.757] core_spec/cray: init
> > [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700
> > [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4
> > Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null)
> > FeaturesAvail=(null) FeaturesActive=(null)
> > nid00203:~ #
> > 
> > sdb:~ # scontrol show node=nid00203
> > NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
> >    CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.53
> >   
> > AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
> >    ActiveFeatures=knl,compute,cache,quad
> >    Gres=craynetwork:4,hbm:0
> >    NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
> >    OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92084 Sockets=1 Boards=1
> >    State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A
> > MCS_label=N/A
> >    Partitions=knl,all,ccm_queue 
> >    BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-14T10:27:54
> >    CfgTRES=cpu=272,mem=96559M
> >    AllocTRES=
> >    CapWatts=n/a
> >    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> >    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> >    
> > 
> > sdb:~ #
> > sdb:~ # scontrol update nodename=nid00203 state=resume
> > slurm_update error: Invalid node state specified
> > sdb:~ #
> 
> In respect to IDLE+COMPLETING problem, you've certainly hit bug 4536.
> 
> This is resolved in commit: 7d24678477d49c5305792b00883bf79c55ee3307 in next
> 17.02.10 version. You can safely apply this if you need it now.
> 
> 
> I am still investigating why your nodes remained in IDLE*+POWER status.

Btw, could you check status of filesystems, etc in those nodes and send back slurmd and slurmctld logs?


> I'm not aware of any issues with the SuspendProgram but to be fair, I'm not 
> sure what it does or where it runs.

The SuspendProgram runs on the nodes after SuspendTimeout and put the nodes offline. Would it be possible you to send slurm.conf?

So:
- Check filesystems and node health in general (dmesg, etc.)
- Attach slurmd, slurmctld logs and slurm.conf
- To put nodes up, set it to down state, then restart slurmd, then set them to resume, tell me if it works.
Comment 7 Jason Repik 2018-02-15 09:48:38 MST
Created attachment 6153 [details]
slurm.conf
Comment 8 Jason Repik 2018-02-15 09:52:26 MST
slurmd.log for node that was rebooted:

nid00203:~ # cat /var/spool/slurm/slurmd.log
[2018-02-14T10:27:54.699] Message aggregation disabled
[2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not configured for this node
[2018-02-14T10:27:54.756] slurmd version 17.02.9 started
[2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery
[2018-02-14T10:27:54.757] core_spec/cray: init
[2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700
[2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
nid00203:~ # 

Only entries for and after reboot of node in slurmctld log:

[2018-02-14T10:27:18.962] error: Nodes nid00203 not responding
[2018-02-14T10:36:49.854] Invalid node state transition requested for node nid00203 from=COMPLETING to=RESUME
[2018-02-14T10:36:49.855] _slurm_rpc_update_node for nid00203: Invalid node state specified
Comment 9 Jason Repik 2018-02-15 10:00:34 MST
- Check filesystems and node health in general (dmesg, etc.)  

  * Everything on the node was fine.

- To put nodes up, set it to down state, then restart slurmd, then set them to resume, tell me if it works.

  * This procedure worked for one of the nodes.
Comment 10 Jason Repik 2018-02-15 12:03:08 MST
Here is a cut/paste from my screen after I completed the procedure to get it back to idle:

sdb:/var/spool/slurm/log # scontrol show node=nid00203
NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.02
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,quad
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92093 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-02-14T10:18:17 SlurmdStartTime=2018-02-15T09:58:21
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

sdb:/var/spool/slurm/log #

Some how the node went back to IDLE+COMPLETING again

sdb:/var/spool/slurm/log # scontrol show node=nid00203
NodeName=nid00203 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.01
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,quad
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92090 Sockets=1 Boards=1
   State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-15T09:58:21
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

sdb:/var/spool/slurm/log #
Comment 11 Jason Repik 2018-02-15 12:24:27 MST
No new entries in slurmd.log:

nid00203:/var/spool/slurm # cat slurmd.log
[2018-02-14T10:27:54.699] Message aggregation disabled
[2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not configured for this node
[2018-02-14T10:27:54.756] slurmd version 17.02.9 started
[2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery
[2018-02-14T10:27:54.757] core_spec/cray: init
[2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700
[2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-02-15T09:58:18.812] core_spec/cray: fini
[2018-02-15T09:58:18.841] Slurmd shutdown completing
[2018-02-15T09:58:21.059] Message aggregation disabled
[2018-02-15T09:58:21.103] Resource spec: Reserved system memory limit not configured for this node
[2018-02-15T09:58:21.114] slurmd version 17.02.9 started
[2018-02-15T09:58:21.116] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery
[2018-02-15T09:58:21.116] core_spec/cray: init
[2018-02-15T09:58:21.123] slurmd started on Thu, 15 Feb 2018 09:58:21 -0700
[2018-02-15T09:58:21.124] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=85204 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
nid00203:/var/spool/slurm #
Comment 12 Jason Repik 2018-02-15 12:33:38 MST
Created attachment 6158 [details]
partial slurmctld.log

grep nid00203 slurmctld.log_partial.txt

This part I manually set node to 'suspect' in xtproadmin
and ran the node health checker against it:

[2018-02-15T09:43:48.991] update_node: node nid00203 reason set to: suspect by xtprocadmin
[2018-02-15T09:43:48.992] update_node: node nid00203 state set to DRAINING
[2018-02-15T09:44:11.065] update_node: node nid00203 state set to IDLE

This is the section where I set to down than restarted  slurmd on the nid00203:

[2018-02-15T09:57:52.580] update_node: node nid00203 reason set to: cause
[2018-02-15T09:57:52.582] update_node: node nid00203 state set to DOWN
[2018-02-15T09:59:08.596] update_node: node nid00203 state set to IDLE
[2018-02-15T09:59:09.225] Node nid00203 now responding

At this point it showed as 'IDLE'.  At some point later it switched back to
'IDLE+COMPLETING' again.
Comment 13 Felip Moll 2018-02-19 09:40:26 MST
Jason,

I did some more tests and this is what I found:

With your configuration, a node can be set into idle~ state only when a job is allocated and capmc command fails to set the correct Cluster or Memory mode for one or more node(s) of this allocation or node_reinit fails to execute.

In this case, capmc_resume program (the one you've set in slurm.conf) resets the node state to IDLE+POWER.

capmc strings in log file should indicate what exactly happened. Setting debug flags 'NodeFeatures' will help.


After this, a job may be executed on the node and then canceled/requeued because it wasn't able to run since the node was not being rebooted (again because of capmc problem). At this point you could've hit bug 4536 as I commented before. When a job is cancelled or requeued when the node is not ready it leads to set the node to 'completing' state unconsistently.

The status of IDLE+POWER is not necessarily bad, when a job is allocated to a node in this state, the node will be switched back to IDLE if everything is running fine.


I suggest you to do:

a) Apply commit: 7d24678477d49c5305792b00883bf79c55ee3307

b) Verify how capmc could be failing in these nodes. 

For this set debug flags to 'NodeFeatures' and debug2. You should see 'capmc: ' strings in log file.

Try for example to send a job to the node asking different Cluster or Memory modes, and monitoring what happens to the node. Is it being rebooted? Mode changed? 'capmc: ' string in log file?
Comment 14 Jason Repik 2018-02-21 12:48:44 MST
We typically get are slurm rpms spun from LANL.  They are working on integrating the patch and spinning us a new set.

At this point though-
System had filesystem issues this past weekend (power related).
We had a full system reboot yesterday.

We have not seen any of the IDLE+ states since than and reprovisioning seems to be working normally too.
Comment 15 Felip Moll 2018-02-26 11:47:01 MST
(In reply to Jason Repik from comment #14)
> We typically get are slurm rpms spun from LANL.  They are working on
> integrating the patch and spinning us a new set.
> 
> At this point though-
> System had filesystem issues this past weekend (power related).
> We had a full system reboot yesterday.
> 
> We have not seen any of the IDLE+ states since than and reprovisioning seems
> to be working normally too.

It is good that nodes are reestablished back to normal.. this confirms some of my thoughts about problems with capmc.

Don't you have by chance any of this capmc strings in log files from other possible past situation? I am interested in investigating this since it is not the first time I see it.
Comment 16 Felip Moll 2018-03-06 05:43:26 MST
> Don't you have by chance any of this capmc strings in log files from other
> possible past situation? I am interested in investigating this since it is
> not the first time I see it.

Hi Jason,

Since no more replies have been made here, I assume everything is working as expected. If you have any capmc log for my own investigation I would appreciate it.

If not, is it ok for you to close this bug?

Thanks
Comment 17 Jason Repik 2018-03-06 07:39:18 MST
We do have another node in that state and I'm trying to find in the capmc log where that node 'changed'.

mutrino-int:/opt/modules/default/etc # scontrol show node=nid00194
NodeName=nid00194 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.14
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,a2a
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00194 NodeHostName=nid00194 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92063 Sockets=1 Boards=1
   State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-03-05T09:13:48 SlurmdStartTime=2018-03-05T09:23:21
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

mutrino-int:/opt/modules/default/etc #
Comment 18 Jason Repik 2018-03-06 07:44:41 MST
According to the bootime info our xtremoted.out should have the 
right info but there is no valid data:

-rw-rw-r-- 1 root    crayadm  33042 Feb 28 21:35 xtremoted-20180228
-rw-r--r-- 1 crayadm crayadm    287 Mar  5 13:08 xtremoted.out
-rw-r--r-- 1 crayadm crayadm   1053 Feb 27 00:47 xtremoted.out.1
-rw-r--r-- 1 crayadm crayadm    164 Dec 20 14:25 xtremoted.out.1.gz

root@mutrino-smw:/var/opt/cray/log # cat xtremoted.out

(xtremoted:14152): GLib-CRITICAL **: Source ID 261 was not found when attempting to remove it

(xtremoted:14152): GLib-CRITICAL **: Source ID 1073 was not found when attempting to remove it

(xtremoted:14152): GLib-CRITICAL **: Source ID 1510 was not found when attempting to remove it
root@mutrino-smw:/var/opt/cray/log #
Comment 19 Jason Repik 2018-03-06 07:48:30 MST
For the older node, based on bootime of the show node output here's the relevant capmc log data:

<157>1 2018-02-14T06:07:12.804258-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 authorizati
on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc
<157>1 2018-02-14T06:07:12.804296-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) request 
is  authorized
<157>1 2018-02-14T07:07:12.801289-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 authorizati
on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc
<157>1 2018-02-14T07:07:12.801323-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) request 
is  authorized
<157>1 2018-02-14T07:19:01.362140-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_node_status) x509 authoriza
tion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
<157>1 2018-02-14T07:19:01.362174-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_node_status) reques
t is  authorized
<157>1 2018-02-14T07:19:01.488107-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_mcdram_capabilities) x509 a
uthorization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
<157>1 2018-02-14T07:19:01.488131-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_mcdram_capabilities
) request is  authorized
<157>1 2018-02-14T07:19:01.721332-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_mcdram_cfg) x509 authorizat
ion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
<157>1 2018-02-14T07:19:01.721350-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_mcdram_cfg) request
 is  authorized
<189>1 2018-02-14T07:19:01.729821-07:00 mutrino-smw xtremoted_dbutil 43394 - [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177
<189>1 2018-02-14T07:19:01.729880-07:00 mutrino-smw xtremoted_dbutil 43394 - [llm_connection_test@34] craylog: testing connectivity
<157>1 2018-02-14T07:19:02.324278-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_capabilities) x509 aut
horization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
<157>1 2018-02-14T07:19:02.324311-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_numa_capabilities) 
request is  authorized
<157>1 2018-02-14T07:19:02.388290-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_cfg) x509 authorizatio
n result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
<157>1 2018-02-14T07:19:02.388305-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_numa_cfg) request i
s  authorized
<189>1 2018-02-14T07:19:02.396868-07:00 mutrino-smw xtremoted_dbutil 43401 - [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177
<189>1 2018-02-14T07:19:02.396932-07:00 mutrino-smw xtremoted_dbutil 43401 - [llm_connection_test@34] craylog: testing connectivity
<157>1 2018-02-14T08:07:12.805593-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 authorizati
on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc
<157>1 2018-02-14T08:07:12.805626-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) request 
is  authorized
Comment 20 Felip Moll 2018-03-06 08:31:24 MST
(In reply to Jason Repik from comment #19)
> For the older node, based on bootime of the show node output here's the
> relevant capmc log data:
> 
> <157>1 2018-02-14T06:07:12.804258-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509
> authorizati
> on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc
> <157>1 2018-02-14T06:07:12.804296-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags)
> request 
> is  authorized
> <157>1 2018-02-14T07:07:12.801289-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509
> authorizati
> on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc
> <157>1 2018-02-14T07:07:12.801323-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags)
> request 
> is  authorized
> <157>1 2018-02-14T07:19:01.362140-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_node_status) x509
> authoriza
> tion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
> <157>1 2018-02-14T07:19:01.362174-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253)
> URI(/capmc/get_node_status) reques
> t is  authorized
> <157>1 2018-02-14T07:19:01.488107-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token:
> URI(/capmc/get_mcdram_capabilities) x509 a
> uthorization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
> <157>1 2018-02-14T07:19:01.488131-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253)
> URI(/capmc/get_mcdram_capabilities
> ) request is  authorized
> <157>1 2018-02-14T07:19:01.721332-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_mcdram_cfg) x509
> authorizat
> ion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
> <157>1 2018-02-14T07:19:01.721350-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253)
> URI(/capmc/get_mcdram_cfg) request
>  is  authorized
> <189>1 2018-02-14T07:19:01.729821-07:00 mutrino-smw xtremoted_dbutil 43394 -
> [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177
> <189>1 2018-02-14T07:19:01.729880-07:00 mutrino-smw xtremoted_dbutil 43394 -
> [llm_connection_test@34] craylog: testing connectivity
> <157>1 2018-02-14T07:19:02.324278-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_capabilities)
> x509 aut
> horization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
> <157>1 2018-02-14T07:19:02.324311-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253)
> URI(/capmc/get_numa_capabilities) 
> request is  authorized
> <157>1 2018-02-14T07:19:02.388290-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_cfg) x509
> authorizatio
> n result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm
> <157>1 2018-02-14T07:19:02.388305-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_numa_cfg)
> request i
> s  authorized
> <189>1 2018-02-14T07:19:02.396868-07:00 mutrino-smw xtremoted_dbutil 43401 -
> [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177
> <189>1 2018-02-14T07:19:02.396932-07:00 mutrino-smw xtremoted_dbutil 43401 -
> [llm_connection_test@34] craylog: testing connectivity
> <157>1 2018-02-14T08:07:12.805593-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509
> authorizati
> on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc
> <157>1 2018-02-14T08:07:12.805626-07:00 mutrino-smw xtremoted 26663 -
> [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags)
> request 
> is  authorized

Jason,

By 'capmc strings' I was referring to logs generated by capmc_resume.c and capmc_suspend.c and put in Logfile defined in knl_cray.conf, or slurmctld log if not defined.

For getting these set debug flags to 'NodeFeatures' and debug2.

Sorry I should have specified it better in my comment 13.

Of course correlating slurm log files with xtremoted.out would be useful.
Comment 22 Jason Repik 2018-03-06 11:01:15 MST
For our current node (nid00194) I don't see any info:

sdb:/etc/opt/slurm # tail /var/spool/slurm/log/slurm_capmc.log
[2018-03-03T14:34:24.908] debug:  capmc_resume[15749]: key=on not found in nid specification
[2018-03-03T14:35:10.028] debug:  capmc_resume[15749]: key=on not found in nid specification
[2018-03-03T14:35:55.148] debug:  capmc_resume[15749]: key=on not found in nid specification
[2018-03-05T13:05:33.985] debug:  capmc_resume[25610]: set_mcdram_cfg sent to 192
[2018-03-05T13:05:34.535] debug:  capmc_resume[25610]: set_numa_cfg sent to 192
[2018-03-05T13:05:34.851] debug:  capmc_resume[25610]: node_reinit sent to 192
[2018-03-05T13:05:34.852] debug:  Munge authentication plugin loaded
[2018-03-05T13:06:19.972] debug:  capmc_resume[25610]: key=on not found in nid specification
[2018-03-05T13:07:05.096] debug:  capmc_resume[25610]: key=on not found in nid specification
[2018-03-05T13:07:50.221] debug:  capmc_resume[25610]: key=on not found in nid specification

For he older node (nid00203) we just see the reinit:

sdb:/etc/opt/slurm # grep "^\[2018\-02\-14" /var/spool/slurm/log/slurm_capmc.log
[2018-02-14T09:25:41.026] debug:  capmc_resume[13783]: node_reinit sent to 203-208,239-242,268-271
[2018-02-14T09:26:26.147] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:27:11.271] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:27:56.391] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:28:41.675] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:29:26.799] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:30:11.920] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:30:57.043] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:31:42.163] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:32:27.283] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:33:12.403] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:33:57.523] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:34:42.643] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:35:27.763] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:36:12.883] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:36:58.003] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:37:43.123] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:38:28.243] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:39:13.363] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:39:58.483] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:40:43.603] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:41:28.723] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:42:13.843] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:42:58.963] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:43:44.083] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:44:29.203] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:45:14.323] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:45:59.443] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:46:44.563] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:47:29.688] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:48:14.811] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:48:59.935] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:49:45.055] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:50:30.175] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:51:15.295] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:52:00.415] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:52:45.535] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:53:30.655] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:54:15.779] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:55:00.899] debug:  capmc_resume[13783]: key=on not found in nid specification
[2018-02-14T09:55:46.020] debug:  capmc_resume[13783]: key=on not found in nid specification
Comment 23 Felip Moll 2018-03-07 06:52:10 MST
(In reply to Jason Repik from comment #22)
> For our current node (nid00194) I don't see any info:
> 
> sdb:/etc/opt/slurm # tail /var/spool/slurm/log/slurm_capmc.log
> [2018-03-03T14:34:24.908] debug:  capmc_resume[15749]: key=on not found in
> nid specification


Thanks for the log, this is what I needed.

This 'key=on not found in nid specification' message comes from _json_parse_nids() called from _wait_all_nodes_on().

If you look at the times you see messages for the older node lasts for exactly 30 minutes which means that _wait_all_nodes_on() reached the timeout while trying to get the status 'on' of all nodes.

_wait_all_nodes_on() issues a 'capmc node_status' and while it has nodes in its 'node_bitmap' list or 30 minutes haven't passed, it keeps parsing the json received from capmc to identify if this nodes are 'on' or not. While a node is detected as 'on' it is removed from node_bitmap.

So here your capmc_resume is stuck for 30 minutes.


I am trying to reproduce it now.
Comment 24 Felip Moll 2018-03-08 04:50:24 MST
Hi Jason,

I reproduced your problem.

What is happening is not KNL specific but power saving 'issue'.

To test, I have this config:

SuspendProgram          = /nfs/virsh/slurm_suspend.sh
ResumeProgram           = /nfs/virsh/slurm_resume.sh

ResumeTimeout           = 70 sec
SuspendTime             = 30000000 sec
SuspendTimeout          = 30 sec (default)

My slurm_resume.sh starts the node, gets stuck for 120 seconds, then starts slurmd -b.
Same approach can work if first start the node, then start 'slurmd' (no -b), then stuck.

This simulates capmc_resume when the node is not getting 'on' state, so when it is not rebooted but slurmd is started on it.

To simulate the KNL capmc behavior I first have to force nodes to get into suspend state, so I temporary modify SuspendTime to 10 seconds, wait for nodes to be suspended (stop slurmd and poweroff/suspend node), then restore SuspendTime parameter (scontrol reconfig).

The timeline is:
- Nodes suspended
- srun -w moll3 hostname
- 'node_reinit' done, resume program is stuck, slurmd -b not started yet | other option: 'node_reinit' done, slurmd started without reboot (no -b), resume program stuck
- Resume timeout reached, node set to DOWN*+POWER and Reason: resume timeout reached
- scontrol update node state=resume
- Node set to IDLE*+POWER , or if other option IDLE+POWER
If first option:
- Capmc finishes, slurmd -b started
- IDLE+POWER

Then, there's no way to exit IDLE+POWER if it is not being launching a job and successfully complete in the ResumeTimeout window.

The point to make it fail here is to exhaust ResumeTimeout and then start slurmd -b.


In your situation:

- You launch a job
- capmc does the node_reinit but it doesn't reboot on time as capmc logs show
- The state is set to IDLE+POWER

To exit from IDLE+POWER ResumeProgram (capmc) must be able to finish correctly on time (2h in your setup).

Does it make sense?

I will look if it is possible to workaround this issue clearing the reboot bitmaps when slurmd is registered.
Comment 25 Jason Repik 2018-03-08 06:43:56 MST
The ResumeProgram is part of capmc, correct?  So I'm not
sure why this would be timing out.  The reboot of a node
is in the 10-12 minute time frame.  Is it just not reporting
back correctly?
Comment 26 Felip Moll 2018-03-08 09:53:46 MST
(In reply to Jason Repik from comment #25)
> The ResumeProgram is part of capmc, correct?  So I'm not
> sure why this would be timing out.  The reboot of a node
> is in the 10-12 minute time frame.  Is it just not reporting
> back correctly?

ResumeProgram is capmc_resume that comes from a cray contrib compiled within Slurm, and this capmc_resume you've configured in ResumeProgram calls the official 'capmc'.

The timeout is hardcoded in capmc_resume.c and is set to 30 minutes. The problem is it is making consecutive calls to 'capmc node_status' and is not getting a correct response (with the string 'on' that reflects the status of the node).

You should be able to run "capmc node_status..." the same way capmc_resume.c does and you will see the json respond it emits.

Responding to your question: yes, it is just not reporting back correctly for the failing nodes.
Comment 32 Jason Repik 2018-03-08 13:43:36 MST
Thanks for the additional explanation.
Comment 33 Felip Moll 2018-03-09 03:04:06 MST
Jason,

If you still have nodes in this state, try to clean them (reboot and start slurmd) and then issue:

 scontrol update NodeName=xyz state=power_down

Tell me if this returns the node to a normal state.
Comment 34 Jason Repik 2018-03-13 08:10:33 MDT
sdb:/cray_home/crayadm # scontrol show node=nid00198
NodeName=nid00198 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.35
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,quad
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00198 NodeHostName=nid00198 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92068 Sockets=1 Boards=1
   State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-03-12T07:12:30 SlurmdStartTime=2018-03-12T07:22:02
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

sdb:/cray_home/crayadm # scontrol update NodeName=nid00198 state=power_down
sdb:/cray_home/crayadm # 

State changes:
State -->    State=IDLE*    -->   State=DOWN*

Staying at 'DOWN*'
Comment 35 Jason Repik 2018-03-13 08:32:06 MDT
I take that back.  It eventually changed to IDLE apparently cause it now allocated:

 sdb:/cray_home/crayadm # scontrol show node=nid00198
NodeName=nid00198 Arch=x86_64 CoresPerSocket=68
   CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.02
   AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad
   ActiveFeatures=knl,compute,cache,quad
   Gres=craynetwork:4,hbm:0
   NodeAddr=nid00198 NodeHostName=nid00198 Version=17.02
   OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92120 Sockets=1 Boards=1
   State=ALLOCATED ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=knl,all,ccm_queue 
   BootTime=2018-03-12T07:12:30 SlurmdStartTime=2018-03-12T07:22:02
   CfgTRES=cpu=272,mem=96559M
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

sdb:/cray_home/crayadm #
Comment 36 Jason Repik 2018-03-13 08:34:51 MDT
We have one more node in that state if you want me to try anything else.

sdb:/cray_home/crayadm # scontrol show node=nid00199 |grep State
   State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A
sdb:/cray_home/crayadm #
Comment 37 Felip Moll 2018-03-13 09:06:47 MDT
(In reply to Jason Repik from comment #36)
> We have one more node in that state if you want me to try anything else.
> 
> sdb:/cray_home/crayadm # scontrol show node=nid00199 |grep State
>    State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A
> MCS_label=N/A
> sdb:/cray_home/crayadm #

It's enough for me. I have the problem identified and must discuss internally what to do in respect.

At that point, the diagnosis is what we commented before: capmc is failing in some way and since the node cannot be brought up correctly during ResumeTimeout it remains in this state.

The way to fix it is to make capmc work correctly or manually clean the node. To manually clean it you can start slurmd with -b flag or reboot the node, and then clear the power flag with 'scontrol update NodeName=nid00198 state=power_down'.

Does it make sense to you?
Comment 38 Felip Moll 2018-03-26 11:59:12 MDT
Hi Jason,

Did you see more nodes in this state, and if so, did the commented workaround work in all cases?

Thanks
Comment 40 Jason Repik 2018-03-27 06:59:52 MDT
We've actually just went through a software update (Cray CLE6.0UP05) and updated our slurm in the process.  We're now running 17.02.10-SSE.0.  The work around was working prior to the update.  We have not had any more nodes in that state since the update though.
Comment 41 Felip Moll 2018-03-27 07:09:19 MDT
(In reply to Jason Repik from comment #40)
> We've actually just went through a software update (Cray CLE6.0UP05) and
> updated our slurm in the process.  We're now running 17.02.10-SSE.0.  The
> work around was working prior to the update.  We have not had any more nodes
> in that state since the update though.

Thank you, I think the analysis and diagnosis done were correct and will apply also to your upgraded version. After discussing internally we agreed to just document this situation since it is uncommon and a capmc sporadic failure.

I have documented some lines in commit 391415ad for 17.11.6.

Hope it will help in the future.

I am closing the issue now, if you have any further questions don't hesitate to ask us again.

Thanks for your patience
Regards,
Felip