| Summary: | Slurm State=IDLE*+POWER | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jason Repik <jjrepik> |
| Component: | KNL | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 17.02.9 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | Sandia National Laboratories | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11.6 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | partial slurmctld.log | ||
|
Description
Jason Repik
2018-02-14 07:42:45 MST
(In reply to Jason Repik from comment #0) > We have a set of KNL nodes in the following state. > This has persisted through a full system reboot too. > > sinfo for the nodes: > knl up 1-infini 1-00:00:00 272 1:68:4 14 idle~ > nid00[203-208,239-242,268-271] > > sdb:~ # scontrol show node=nid00203 > NodeName=nid00203 Arch=x86_64 CoresPerSocket=68 > CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.12 > > AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad > ActiveFeatures=knl,compute,cache,quad > Gres=craynetwork:4,hbm:0 > NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02 > OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92035 Sockets=1 Boards=1 > State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=knl,all,ccm_queue > BootTime=2018-02-14T07:09:21 SlurmdStartTime=2018-02-14T07:18:59 > CfgTRES=cpu=272,mem=96559M > AllocTRES= > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > After rebooting nid00203 individually, here's it's slurmd.log: > nid00203:/var/spool/slurm # cat slurmd.log > [2018-02-14T07:18:59.631] Message aggregation disabled > [2018-02-14T07:18:59.676] Resource spec: Reserved system memory limit not > configured for this node > [2018-02-14T07:18:59.688] slurmd version 17.02.9 started > [2018-02-14T07:18:59.689] error: No /var/spool/slurm/job_container_state > file for job_container/cncu state recovery > [2018-02-14T07:18:59.689] core_spec/cray: init > [2018-02-14T07:18:59.697] slurmd started on Wed, 14 Feb 2018 07:18:59 -0700 > [2018-02-14T07:18:59.698] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 > Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > nid00203:/var/spool/slurm # > > I'm not familiar with this state and I'm not sure how to get them out of it. Hi Jason, Idle*+power (seen as idle~ in sinfo) indicates that nodes have booted up and are in some power saving mode: I suppose that nodes have been booted due to some job request and then this job finished. Have 'SuspendTimeout' been reached for this nodes? Can you successfully send jobs to this nodes? Have you seen any indication that SuspendProgram have problems running on these nodes? Not directly related with your question but I want to mention bug 4536 here: When dealing with power saving features you should be aware of this bug, solved in 17.02.10 - https://github.com/SchedMD/slurm/commit/c43df3a78b0f235f3502a395321690e492dc7bcc Again, these nodes were in this state prior to our DST yesterday. After a full system reboot, these nodes are still in the IDLE*+POWER state. It was my understanding that they should have returned to a 'normal' state after being rebooted. From what I can tell via 'capmc' the power setting on those nodes is no different than the other nodes. I was able to submit a job that used those nodes a part of the request however, all of those nodes are now stuck in a 'completing' state. I'm not aware of any issues with the SuspendProgram but to be fair, I'm not sure what it does or where it runs. Rebooted one of the nodes and no change: nid00203:~ # cat /var/spool/slurm/slurmd.log [2018-02-14T10:27:54.699] Message aggregation disabled [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not configured for this node [2018-02-14T10:27:54.756] slurmd version 17.02.9 started [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery [2018-02-14T10:27:54.757] core_spec/cray: init [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700 [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) nid00203:~ # sdb:~ # scontrol show node=nid00203 NodeName=nid00203 Arch=x86_64 CoresPerSocket=68 CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.53 AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad ActiveFeatures=knl,compute,cache,quad Gres=craynetwork:4,hbm:0 NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02 OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92084 Sockets=1 Boards=1 State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A Partitions=knl,all,ccm_queue BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-14T10:27:54 CfgTRES=cpu=272,mem=96559M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s sdb:~ # sdb:~ # scontrol update nodename=nid00203 state=resume slurm_update error: Invalid node state specified sdb:~ # (In reply to Jason Repik from comment #3) > Rebooted one of the nodes and no change: > > nid00203:~ # cat /var/spool/slurm/slurmd.log > [2018-02-14T10:27:54.699] Message aggregation disabled > [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not > configured for this node > [2018-02-14T10:27:54.756] slurmd version 17.02.9 started > [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state > file for job_container/cncu state recovery > [2018-02-14T10:27:54.757] core_spec/cray: init > [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700 > [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 > Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > nid00203:~ # > > sdb:~ # scontrol show node=nid00203 > NodeName=nid00203 Arch=x86_64 CoresPerSocket=68 > CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.53 > > AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad > ActiveFeatures=knl,compute,cache,quad > Gres=craynetwork:4,hbm:0 > NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02 > OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92084 Sockets=1 Boards=1 > State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=knl,all,ccm_queue > BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-14T10:27:54 > CfgTRES=cpu=272,mem=96559M > AllocTRES= > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > sdb:~ # > sdb:~ # scontrol update nodename=nid00203 state=resume > slurm_update error: Invalid node state specified > sdb:~ # In respect to IDLE+COMPLETING problem, you've certainly hit bug 4536. This is resolved in commit: 7d24678477d49c5305792b00883bf79c55ee3307 in next 17.02.10 version. You can safely apply this if you need it now. I am still investigating why your nodes remained in IDLE*+POWER status. (In reply to Felip Moll from comment #4) > (In reply to Jason Repik from comment #3) > > Rebooted one of the nodes and no change: > > > > nid00203:~ # cat /var/spool/slurm/slurmd.log > > [2018-02-14T10:27:54.699] Message aggregation disabled > > [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not > > configured for this node > > [2018-02-14T10:27:54.756] slurmd version 17.02.9 started > > [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state > > file for job_container/cncu state recovery > > [2018-02-14T10:27:54.757] core_spec/cray: init > > [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700 > > [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 > > Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) > > FeaturesAvail=(null) FeaturesActive=(null) > > nid00203:~ # > > > > sdb:~ # scontrol show node=nid00203 > > NodeName=nid00203 Arch=x86_64 CoresPerSocket=68 > > CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.53 > > > > AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad > > ActiveFeatures=knl,compute,cache,quad > > Gres=craynetwork:4,hbm:0 > > NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02 > > OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92084 Sockets=1 Boards=1 > > State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A > > MCS_label=N/A > > Partitions=knl,all,ccm_queue > > BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-14T10:27:54 > > CfgTRES=cpu=272,mem=96559M > > AllocTRES= > > CapWatts=n/a > > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > > > > > > sdb:~ # > > sdb:~ # scontrol update nodename=nid00203 state=resume > > slurm_update error: Invalid node state specified > > sdb:~ # > > In respect to IDLE+COMPLETING problem, you've certainly hit bug 4536. > > This is resolved in commit: 7d24678477d49c5305792b00883bf79c55ee3307 in next > 17.02.10 version. You can safely apply this if you need it now. > > > I am still investigating why your nodes remained in IDLE*+POWER status. Btw, could you check status of filesystems, etc in those nodes and send back slurmd and slurmctld logs? > I'm not aware of any issues with the SuspendProgram but to be fair, I'm not > sure what it does or where it runs. The SuspendProgram runs on the nodes after SuspendTimeout and put the nodes offline. Would it be possible you to send slurm.conf? So: - Check filesystems and node health in general (dmesg, etc.) - Attach slurmd, slurmctld logs and slurm.conf - To put nodes up, set it to down state, then restart slurmd, then set them to resume, tell me if it works. Created attachment 6153 [details]
slurm.conf
slurmd.log for node that was rebooted: nid00203:~ # cat /var/spool/slurm/slurmd.log [2018-02-14T10:27:54.699] Message aggregation disabled [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not configured for this node [2018-02-14T10:27:54.756] slurmd version 17.02.9 started [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery [2018-02-14T10:27:54.757] core_spec/cray: init [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700 [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) nid00203:~ # Only entries for and after reboot of node in slurmctld log: [2018-02-14T10:27:18.962] error: Nodes nid00203 not responding [2018-02-14T10:36:49.854] Invalid node state transition requested for node nid00203 from=COMPLETING to=RESUME [2018-02-14T10:36:49.855] _slurm_rpc_update_node for nid00203: Invalid node state specified - Check filesystems and node health in general (dmesg, etc.) * Everything on the node was fine. - To put nodes up, set it to down state, then restart slurmd, then set them to resume, tell me if it works. * This procedure worked for one of the nodes. Here is a cut/paste from my screen after I completed the procedure to get it back to idle: sdb:/var/spool/slurm/log # scontrol show node=nid00203 NodeName=nid00203 Arch=x86_64 CoresPerSocket=68 CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.02 AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad ActiveFeatures=knl,compute,cache,quad Gres=craynetwork:4,hbm:0 NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02 OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92093 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A Partitions=knl,all,ccm_queue BootTime=2018-02-14T10:18:17 SlurmdStartTime=2018-02-15T09:58:21 CfgTRES=cpu=272,mem=96559M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s sdb:/var/spool/slurm/log # Some how the node went back to IDLE+COMPLETING again sdb:/var/spool/slurm/log # scontrol show node=nid00203 NodeName=nid00203 Arch=x86_64 CoresPerSocket=68 CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.01 AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad ActiveFeatures=knl,compute,cache,quad Gres=craynetwork:4,hbm:0 NodeAddr=nid00203 NodeHostName=nid00203 Version=17.02 OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92090 Sockets=1 Boards=1 State=IDLE+COMPLETING ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A Partitions=knl,all,ccm_queue BootTime=2018-02-14T10:18:16 SlurmdStartTime=2018-02-15T09:58:21 CfgTRES=cpu=272,mem=96559M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s sdb:/var/spool/slurm/log # No new entries in slurmd.log: nid00203:/var/spool/slurm # cat slurmd.log [2018-02-14T10:27:54.699] Message aggregation disabled [2018-02-14T10:27:54.744] Resource spec: Reserved system memory limit not configured for this node [2018-02-14T10:27:54.756] slurmd version 17.02.9 started [2018-02-14T10:27:54.757] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery [2018-02-14T10:27:54.757] core_spec/cray: init [2018-02-14T10:27:54.765] slurmd started on Wed, 14 Feb 2018 10:27:54 -0700 [2018-02-14T10:27:54.766] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=578 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2018-02-15T09:58:18.812] core_spec/cray: fini [2018-02-15T09:58:18.841] Slurmd shutdown completing [2018-02-15T09:58:21.059] Message aggregation disabled [2018-02-15T09:58:21.103] Resource spec: Reserved system memory limit not configured for this node [2018-02-15T09:58:21.114] slurmd version 17.02.9 started [2018-02-15T09:58:21.116] error: No /var/spool/slurm/job_container_state file for job_container/cncu state recovery [2018-02-15T09:58:21.116] core_spec/cray: init [2018-02-15T09:58:21.123] slurmd started on Thu, 15 Feb 2018 09:58:21 -0700 [2018-02-15T09:58:21.124] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96559 TmpDisk=48279 Uptime=85204 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) nid00203:/var/spool/slurm # Created attachment 6158 [details]
partial slurmctld.log
grep nid00203 slurmctld.log_partial.txt
This part I manually set node to 'suspect' in xtproadmin
and ran the node health checker against it:
[2018-02-15T09:43:48.991] update_node: node nid00203 reason set to: suspect by xtprocadmin
[2018-02-15T09:43:48.992] update_node: node nid00203 state set to DRAINING
[2018-02-15T09:44:11.065] update_node: node nid00203 state set to IDLE
This is the section where I set to down than restarted slurmd on the nid00203:
[2018-02-15T09:57:52.580] update_node: node nid00203 reason set to: cause
[2018-02-15T09:57:52.582] update_node: node nid00203 state set to DOWN
[2018-02-15T09:59:08.596] update_node: node nid00203 state set to IDLE
[2018-02-15T09:59:09.225] Node nid00203 now responding
At this point it showed as 'IDLE'. At some point later it switched back to
'IDLE+COMPLETING' again.
Jason, I did some more tests and this is what I found: With your configuration, a node can be set into idle~ state only when a job is allocated and capmc command fails to set the correct Cluster or Memory mode for one or more node(s) of this allocation or node_reinit fails to execute. In this case, capmc_resume program (the one you've set in slurm.conf) resets the node state to IDLE+POWER. capmc strings in log file should indicate what exactly happened. Setting debug flags 'NodeFeatures' will help. After this, a job may be executed on the node and then canceled/requeued because it wasn't able to run since the node was not being rebooted (again because of capmc problem). At this point you could've hit bug 4536 as I commented before. When a job is cancelled or requeued when the node is not ready it leads to set the node to 'completing' state unconsistently. The status of IDLE+POWER is not necessarily bad, when a job is allocated to a node in this state, the node will be switched back to IDLE if everything is running fine. I suggest you to do: a) Apply commit: 7d24678477d49c5305792b00883bf79c55ee3307 b) Verify how capmc could be failing in these nodes. For this set debug flags to 'NodeFeatures' and debug2. You should see 'capmc: ' strings in log file. Try for example to send a job to the node asking different Cluster or Memory modes, and monitoring what happens to the node. Is it being rebooted? Mode changed? 'capmc: ' string in log file? We typically get are slurm rpms spun from LANL. They are working on integrating the patch and spinning us a new set. At this point though- System had filesystem issues this past weekend (power related). We had a full system reboot yesterday. We have not seen any of the IDLE+ states since than and reprovisioning seems to be working normally too. (In reply to Jason Repik from comment #14) > We typically get are slurm rpms spun from LANL. They are working on > integrating the patch and spinning us a new set. > > At this point though- > System had filesystem issues this past weekend (power related). > We had a full system reboot yesterday. > > We have not seen any of the IDLE+ states since than and reprovisioning seems > to be working normally too. It is good that nodes are reestablished back to normal.. this confirms some of my thoughts about problems with capmc. Don't you have by chance any of this capmc strings in log files from other possible past situation? I am interested in investigating this since it is not the first time I see it. > Don't you have by chance any of this capmc strings in log files from other
> possible past situation? I am interested in investigating this since it is
> not the first time I see it.
Hi Jason,
Since no more replies have been made here, I assume everything is working as expected. If you have any capmc log for my own investigation I would appreciate it.
If not, is it ok for you to close this bug?
Thanks
We do have another node in that state and I'm trying to find in the capmc log where that node 'changed'. mutrino-int:/opt/modules/default/etc # scontrol show node=nid00194 NodeName=nid00194 Arch=x86_64 CoresPerSocket=68 CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.14 AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad ActiveFeatures=knl,compute,cache,a2a Gres=craynetwork:4,hbm:0 NodeAddr=nid00194 NodeHostName=nid00194 Version=17.02 OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92063 Sockets=1 Boards=1 State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A Partitions=knl,all,ccm_queue BootTime=2018-03-05T09:13:48 SlurmdStartTime=2018-03-05T09:23:21 CfgTRES=cpu=272,mem=96559M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s mutrino-int:/opt/modules/default/etc # According to the bootime info our xtremoted.out should have the right info but there is no valid data: -rw-rw-r-- 1 root crayadm 33042 Feb 28 21:35 xtremoted-20180228 -rw-r--r-- 1 crayadm crayadm 287 Mar 5 13:08 xtremoted.out -rw-r--r-- 1 crayadm crayadm 1053 Feb 27 00:47 xtremoted.out.1 -rw-r--r-- 1 crayadm crayadm 164 Dec 20 14:25 xtremoted.out.1.gz root@mutrino-smw:/var/opt/cray/log # cat xtremoted.out (xtremoted:14152): GLib-CRITICAL **: Source ID 261 was not found when attempting to remove it (xtremoted:14152): GLib-CRITICAL **: Source ID 1073 was not found when attempting to remove it (xtremoted:14152): GLib-CRITICAL **: Source ID 1510 was not found when attempting to remove it root@mutrino-smw:/var/opt/cray/log # For the older node, based on bootime of the show node output here's the relevant capmc log data: <157>1 2018-02-14T06:07:12.804258-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 authorizati on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc <157>1 2018-02-14T06:07:12.804296-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) request is authorized <157>1 2018-02-14T07:07:12.801289-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 authorizati on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc <157>1 2018-02-14T07:07:12.801323-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) request is authorized <157>1 2018-02-14T07:19:01.362140-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_node_status) x509 authoriza tion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm <157>1 2018-02-14T07:19:01.362174-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_node_status) reques t is authorized <157>1 2018-02-14T07:19:01.488107-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_mcdram_capabilities) x509 a uthorization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm <157>1 2018-02-14T07:19:01.488131-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_mcdram_capabilities ) request is authorized <157>1 2018-02-14T07:19:01.721332-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_mcdram_cfg) x509 authorizat ion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm <157>1 2018-02-14T07:19:01.721350-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_mcdram_cfg) request is authorized <189>1 2018-02-14T07:19:01.729821-07:00 mutrino-smw xtremoted_dbutil 43394 - [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177 <189>1 2018-02-14T07:19:01.729880-07:00 mutrino-smw xtremoted_dbutil 43394 - [llm_connection_test@34] craylog: testing connectivity <157>1 2018-02-14T07:19:02.324278-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_capabilities) x509 aut horization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm <157>1 2018-02-14T07:19:02.324311-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_numa_capabilities) request is authorized <157>1 2018-02-14T07:19:02.388290-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_cfg) x509 authorizatio n result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm <157>1 2018-02-14T07:19:02.388305-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_numa_cfg) request i s authorized <189>1 2018-02-14T07:19:02.396868-07:00 mutrino-smw xtremoted_dbutil 43401 - [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177 <189>1 2018-02-14T07:19:02.396932-07:00 mutrino-smw xtremoted_dbutil 43401 - [llm_connection_test@34] craylog: testing connectivity <157>1 2018-02-14T08:07:12.805593-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 authorizati on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc <157>1 2018-02-14T08:07:12.805626-07:00 mutrino-smw xtremoted 26663 - [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) request is authorized (In reply to Jason Repik from comment #19) > For the older node, based on bootime of the show node output here's the > relevant capmc log data: > > <157>1 2018-02-14T06:07:12.804258-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 > authorizati > on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc > <157>1 2018-02-14T06:07:12.804296-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) > request > is authorized > <157>1 2018-02-14T07:07:12.801289-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 > authorizati > on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc > <157>1 2018-02-14T07:07:12.801323-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) > request > is authorized > <157>1 2018-02-14T07:19:01.362140-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_node_status) x509 > authoriza > tion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm > <157>1 2018-02-14T07:19:01.362174-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) > URI(/capmc/get_node_status) reques > t is authorized > <157>1 2018-02-14T07:19:01.488107-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: > URI(/capmc/get_mcdram_capabilities) x509 a > uthorization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm > <157>1 2018-02-14T07:19:01.488131-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) > URI(/capmc/get_mcdram_capabilities > ) request is authorized > <157>1 2018-02-14T07:19:01.721332-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_mcdram_cfg) x509 > authorizat > ion result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm > <157>1 2018-02-14T07:19:01.721350-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) > URI(/capmc/get_mcdram_cfg) request > is authorized > <189>1 2018-02-14T07:19:01.729821-07:00 mutrino-smw xtremoted_dbutil 43394 - > [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177 > <189>1 2018-02-14T07:19:01.729880-07:00 mutrino-smw xtremoted_dbutil 43394 - > [llm_connection_test@34] craylog: testing connectivity > <157>1 2018-02-14T07:19:02.324278-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_capabilities) > x509 aut > horization result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm > <157>1 2018-02-14T07:19:02.324311-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) > URI(/capmc/get_numa_capabilities) > request is authorized > <157>1 2018-02-14T07:19:02.388290-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_numa_cfg) x509 > authorizatio > n result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=crayadm > <157>1 2018-02-14T07:19:02.388305-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.3.1.253) URI(/capmc/get_numa_cfg) > request i > s authorized > <189>1 2018-02-14T07:19:02.396868-07:00 mutrino-smw xtremoted_dbutil 43401 - > [llm_status@34] craylog: message type: hss_xtremoted, connected to port: 5177 > <189>1 2018-02-14T07:19:02.396932-07:00 mutrino-smw xtremoted_dbutil 43401 - > [llm_connection_test@34] craylog: testing connectivity > <157>1 2018-02-14T08:07:12.805593-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] authorizer_check_token: URI(/capmc/get_ssd_diags) x509 > authorizati > on result: success client DN: /C=XX/ST=XX/O=XX/OU=XX/CN=DWS-capmc > <157>1 2018-02-14T08:07:12.805626-07:00 mutrino-smw xtremoted 26663 - > [hss_xtremoted@34] auth_cb: Remote IP (10.128.0.7) URI(/capmc/get_ssd_diags) > request > is authorized Jason, By 'capmc strings' I was referring to logs generated by capmc_resume.c and capmc_suspend.c and put in Logfile defined in knl_cray.conf, or slurmctld log if not defined. For getting these set debug flags to 'NodeFeatures' and debug2. Sorry I should have specified it better in my comment 13. Of course correlating slurm log files with xtremoted.out would be useful. For our current node (nid00194) I don't see any info: sdb:/etc/opt/slurm # tail /var/spool/slurm/log/slurm_capmc.log [2018-03-03T14:34:24.908] debug: capmc_resume[15749]: key=on not found in nid specification [2018-03-03T14:35:10.028] debug: capmc_resume[15749]: key=on not found in nid specification [2018-03-03T14:35:55.148] debug: capmc_resume[15749]: key=on not found in nid specification [2018-03-05T13:05:33.985] debug: capmc_resume[25610]: set_mcdram_cfg sent to 192 [2018-03-05T13:05:34.535] debug: capmc_resume[25610]: set_numa_cfg sent to 192 [2018-03-05T13:05:34.851] debug: capmc_resume[25610]: node_reinit sent to 192 [2018-03-05T13:05:34.852] debug: Munge authentication plugin loaded [2018-03-05T13:06:19.972] debug: capmc_resume[25610]: key=on not found in nid specification [2018-03-05T13:07:05.096] debug: capmc_resume[25610]: key=on not found in nid specification [2018-03-05T13:07:50.221] debug: capmc_resume[25610]: key=on not found in nid specification For he older node (nid00203) we just see the reinit: sdb:/etc/opt/slurm # grep "^\[2018\-02\-14" /var/spool/slurm/log/slurm_capmc.log [2018-02-14T09:25:41.026] debug: capmc_resume[13783]: node_reinit sent to 203-208,239-242,268-271 [2018-02-14T09:26:26.147] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:27:11.271] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:27:56.391] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:28:41.675] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:29:26.799] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:30:11.920] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:30:57.043] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:31:42.163] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:32:27.283] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:33:12.403] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:33:57.523] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:34:42.643] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:35:27.763] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:36:12.883] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:36:58.003] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:37:43.123] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:38:28.243] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:39:13.363] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:39:58.483] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:40:43.603] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:41:28.723] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:42:13.843] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:42:58.963] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:43:44.083] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:44:29.203] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:45:14.323] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:45:59.443] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:46:44.563] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:47:29.688] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:48:14.811] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:48:59.935] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:49:45.055] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:50:30.175] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:51:15.295] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:52:00.415] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:52:45.535] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:53:30.655] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:54:15.779] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:55:00.899] debug: capmc_resume[13783]: key=on not found in nid specification [2018-02-14T09:55:46.020] debug: capmc_resume[13783]: key=on not found in nid specification (In reply to Jason Repik from comment #22) > For our current node (nid00194) I don't see any info: > > sdb:/etc/opt/slurm # tail /var/spool/slurm/log/slurm_capmc.log > [2018-03-03T14:34:24.908] debug: capmc_resume[15749]: key=on not found in > nid specification Thanks for the log, this is what I needed. This 'key=on not found in nid specification' message comes from _json_parse_nids() called from _wait_all_nodes_on(). If you look at the times you see messages for the older node lasts for exactly 30 minutes which means that _wait_all_nodes_on() reached the timeout while trying to get the status 'on' of all nodes. _wait_all_nodes_on() issues a 'capmc node_status' and while it has nodes in its 'node_bitmap' list or 30 minutes haven't passed, it keeps parsing the json received from capmc to identify if this nodes are 'on' or not. While a node is detected as 'on' it is removed from node_bitmap. So here your capmc_resume is stuck for 30 minutes. I am trying to reproduce it now. Hi Jason, I reproduced your problem. What is happening is not KNL specific but power saving 'issue'. To test, I have this config: SuspendProgram = /nfs/virsh/slurm_suspend.sh ResumeProgram = /nfs/virsh/slurm_resume.sh ResumeTimeout = 70 sec SuspendTime = 30000000 sec SuspendTimeout = 30 sec (default) My slurm_resume.sh starts the node, gets stuck for 120 seconds, then starts slurmd -b. Same approach can work if first start the node, then start 'slurmd' (no -b), then stuck. This simulates capmc_resume when the node is not getting 'on' state, so when it is not rebooted but slurmd is started on it. To simulate the KNL capmc behavior I first have to force nodes to get into suspend state, so I temporary modify SuspendTime to 10 seconds, wait for nodes to be suspended (stop slurmd and poweroff/suspend node), then restore SuspendTime parameter (scontrol reconfig). The timeline is: - Nodes suspended - srun -w moll3 hostname - 'node_reinit' done, resume program is stuck, slurmd -b not started yet | other option: 'node_reinit' done, slurmd started without reboot (no -b), resume program stuck - Resume timeout reached, node set to DOWN*+POWER and Reason: resume timeout reached - scontrol update node state=resume - Node set to IDLE*+POWER , or if other option IDLE+POWER If first option: - Capmc finishes, slurmd -b started - IDLE+POWER Then, there's no way to exit IDLE+POWER if it is not being launching a job and successfully complete in the ResumeTimeout window. The point to make it fail here is to exhaust ResumeTimeout and then start slurmd -b. In your situation: - You launch a job - capmc does the node_reinit but it doesn't reboot on time as capmc logs show - The state is set to IDLE+POWER To exit from IDLE+POWER ResumeProgram (capmc) must be able to finish correctly on time (2h in your setup). Does it make sense? I will look if it is possible to workaround this issue clearing the reboot bitmaps when slurmd is registered. The ResumeProgram is part of capmc, correct? So I'm not sure why this would be timing out. The reboot of a node is in the 10-12 minute time frame. Is it just not reporting back correctly? (In reply to Jason Repik from comment #25) > The ResumeProgram is part of capmc, correct? So I'm not > sure why this would be timing out. The reboot of a node > is in the 10-12 minute time frame. Is it just not reporting > back correctly? ResumeProgram is capmc_resume that comes from a cray contrib compiled within Slurm, and this capmc_resume you've configured in ResumeProgram calls the official 'capmc'. The timeout is hardcoded in capmc_resume.c and is set to 30 minutes. The problem is it is making consecutive calls to 'capmc node_status' and is not getting a correct response (with the string 'on' that reflects the status of the node). You should be able to run "capmc node_status..." the same way capmc_resume.c does and you will see the json respond it emits. Responding to your question: yes, it is just not reporting back correctly for the failing nodes. Thanks for the additional explanation. Jason, If you still have nodes in this state, try to clean them (reboot and start slurmd) and then issue: scontrol update NodeName=xyz state=power_down Tell me if this returns the node to a normal state. sdb:/cray_home/crayadm # scontrol show node=nid00198 NodeName=nid00198 Arch=x86_64 CoresPerSocket=68 CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=1.35 AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad ActiveFeatures=knl,compute,cache,quad Gres=craynetwork:4,hbm:0 NodeAddr=nid00198 NodeHostName=nid00198 Version=17.02 OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92068 Sockets=1 Boards=1 State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A Partitions=knl,all,ccm_queue BootTime=2018-03-12T07:12:30 SlurmdStartTime=2018-03-12T07:22:02 CfgTRES=cpu=272,mem=96559M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s sdb:/cray_home/crayadm # scontrol update NodeName=nid00198 state=power_down sdb:/cray_home/crayadm # State changes: State --> State=IDLE* --> State=DOWN* Staying at 'DOWN*' I take that back. It eventually changed to IDLE apparently cause it now allocated: sdb:/cray_home/crayadm # scontrol show node=nid00198 NodeName=nid00198 Arch=x86_64 CoresPerSocket=68 CPUAlloc=0 CPUErr=0 CPUTot=272 CPULoad=0.02 AvailableFeatures=knl,compute,flat,split,equal,cache,a2a,snc2,snc4,hemi,quad ActiveFeatures=knl,compute,cache,quad Gres=craynetwork:4,hbm:0 NodeAddr=nid00198 NodeHostName=nid00198 Version=17.02 OS=Linux RealMemory=96559 AllocMem=0 FreeMem=92120 Sockets=1 Boards=1 State=ALLOCATED ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A Partitions=knl,all,ccm_queue BootTime=2018-03-12T07:12:30 SlurmdStartTime=2018-03-12T07:22:02 CfgTRES=cpu=272,mem=96559M AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s sdb:/cray_home/crayadm # We have one more node in that state if you want me to try anything else. sdb:/cray_home/crayadm # scontrol show node=nid00199 |grep State State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A MCS_label=N/A sdb:/cray_home/crayadm # (In reply to Jason Repik from comment #36) > We have one more node in that state if you want me to try anything else. > > sdb:/cray_home/crayadm # scontrol show node=nid00199 |grep State > State=IDLE*+POWER ThreadsPerCore=4 TmpDisk=48279 Weight=1 Owner=N/A > MCS_label=N/A > sdb:/cray_home/crayadm # It's enough for me. I have the problem identified and must discuss internally what to do in respect. At that point, the diagnosis is what we commented before: capmc is failing in some way and since the node cannot be brought up correctly during ResumeTimeout it remains in this state. The way to fix it is to make capmc work correctly or manually clean the node. To manually clean it you can start slurmd with -b flag or reboot the node, and then clear the power flag with 'scontrol update NodeName=nid00198 state=power_down'. Does it make sense to you? Hi Jason, Did you see more nodes in this state, and if so, did the commented workaround work in all cases? Thanks We've actually just went through a software update (Cray CLE6.0UP05) and updated our slurm in the process. We're now running 17.02.10-SSE.0. The work around was working prior to the update. We have not had any more nodes in that state since the update though. (In reply to Jason Repik from comment #40) > We've actually just went through a software update (Cray CLE6.0UP05) and > updated our slurm in the process. We're now running 17.02.10-SSE.0. The > work around was working prior to the update. We have not had any more nodes > in that state since the update though. Thank you, I think the analysis and diagnosis done were correct and will apply also to your upgraded version. After discussing internally we agreed to just document this situation since it is uncommon and a capmc sporadic failure. I have documented some lines in commit 391415ad for 17.11.6. Hope it will help in the future. I am closing the issue now, if you have any further questions don't hesitate to ask us again. Thanks for your patience Regards, Felip |