Ticket 4536 - nodes left in completing state following reboot during cancelled knl-mode-change job
Summary: nodes left in completing state following reboot during cancelled knl-mode-cha...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.02.9
Hardware: Cray XC Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-15 17:22 MST by Doug Jacobsen
Modified: 2018-02-16 02:37 MST (History)
4 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.10
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
c43df3a78b0f235f3502a395321690e492dc7bcc (1.61 KB, patch)
2018-02-16 02:37 MST, Felip Moll
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2017-12-15 17:22:09 MST
Hello,

I'm fairly certain this is a regression with slurm 17.02.9 (or .8) relative to .7.

Since updating to slurm 17.02.9 yesterday I'm noticing that node reboots can leave nodes in a completing state (with no CG job) if the user cancels the job during a reboot.

ctl1:/var/tmp/slurm # sacct -j 9001462 --format=nodelist,start,end,state -p
NodeList|Start|End|State|
nid[11904-11919,11924-11971]|2017-12-15T13:33:21|2017-12-15T13:54:28|CANCELLED by 57075|
ctl1:/var/tmp/slurm #

ctl1:/var/tmp/slurm # sinfo --node=nid[11904-11919,11924-11971] -p system
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
system       up 1-12:00:00     64   comp nid[11904-11919,11924-11971]
ctl1:/var/tmp/slurm #

ctl1:/var/tmp/slurm # squeue --node=nid[11904-11919,11924-11971]
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
ctl1:/var/tmp/slurm #


ctl1:/var/tmp/slurm # xzcat slurmctld.log* | grep 9001462
xzcat: slurmctld.log: File format not recognized
[2017-12-15T13:29:20.397] debug:  Job 9001462 feature list: knl&quad&cache
[2017-12-15T13:29:20.414] _slurm_rpc_submit_batch_job JobId=9001462 usec=17846
[2017-12-15T13:29:49.205] backfill test for JobID=9001462 Prio=69120 Partition=knl
[2017-12-15T13:31:15.556] backfill test for JobID=9001462 Prio=69121 Partition=knl
[2017-12-15T13:33:21.392] backfill test for JobID=9001462 Prio=69121 Partition=knl
[2017-12-15T13:33:21.423] email msg to rhager@pppl.gov: SLURM Job_id=9001462 Name=xgca_jet_85794 Began, Queued time 00:04:01
[2017-12-15T13:33:21.515] backfill: Started JobId=9001462 in knl on nid[11904-11919,11924-11971]
[2017-12-15T13:54:28.978] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 9001462 uid 57075
[2017-12-15T13:54:28.978] email msg to rhager@pppl.gov: SLURM Job_id=9001462 Name=xgca_jet_85794 Ended, Run time 00:21:07, CANCELLED, ExitCode 0
[2017-12-15T13:54:28.978] debug:  jobcomp/nersc: took 0.0000s to collect data for job 9001462
[2017-12-15T13:54:29.072] _job_signal: 9 of configuring JobID=9001462 State=0x4 NodeCnt=0 successful



ctl1:/var/tmp/slurm # xzcat slurmctld.log* | grep nid11904
xzcat: slurmctld.log: File format not recognized
[2017-12-14T14:35:58.211] agent/is_node_resp: node:nid11904 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2017-12-14T15:11:22.126] Node nid11904 now responding
[2017-12-14T15:11:22.126] node nid11904 returned to service
[2017-12-14T17:01:06.842] debug:  validate_node_specs: node nid11904 registered with 0 jobs
[2017-12-14T17:25:38.325] _update_node_active_features: nodes nid11904 active features set to: quad,cache,knl
[2017-12-14T17:26:02.159] update_node: node nid11904 state set to ALLOCATED
[2017-12-14T17:26:34.243] update_node: node nid11904 state set to ALLOCATED
[2017-12-14T18:10:04.894] Node nid11904 now responding
[2017-12-14T18:50:58.678] debug:  validate_node_specs: node nid11904 registered with 0 jobs
[2017-12-15T00:49:33.752] _update_node_active_features: nodes nid11904 active features set to: quad,flat,knl
[2017-12-15T01:31:15.246] Node nid11904 now responding
[2017-12-15T13:33:29.170] _update_node_active_features: nodes nid11904 active features set to: quad,cache,knl
[2017-12-15T13:33:48.779] debug:  Still waiting for boot of node nid11904
[2017-12-15T14:06:36.836] Node nid11904 now responding

For now we are detecting this issue, then marking the nodes down, then resuming them.


Thanks,
Doug
Comment 1 Doug Jacobsen 2017-12-15 17:22:58 MST
nid11904:~ # cat /var/spool/slurmd/nid11904.log
[2017-12-15T14:06:36.345] Message aggregation disabled
[2017-12-15T14:06:36.541] Resource spec: Reserved system memory limit not configured for this node
[2017-12-15T14:06:36.574] slurmd version 17.02.9 started
[2017-12-15T14:06:36.575] error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery
[2017-12-15T14:06:36.575] core_spec/cray: init
[2017-12-15T14:06:36.820] slurmd started on Fri, 15 Dec 2017 14:06:36 -0800
[2017-12-15T14:06:36.820] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96563 TmpDisk=48281 Uptime=1718 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
nid11904:~ #
Comment 3 Doug Jacobsen 2017-12-19 06:28:09 MST
We had another strange incident with cori, slurm 17.02.9, and knl mode changes yesterday involving a mode change of ~3800 KNL nodes.

The Cray system failed to boot the nodes, and eventually they hit the ResumeTimeout.  In the past these nodes would be marked Down with reason "Not Responding".  Now, in slurm 17.02.9 they were given reason "ResumeTimeout".  I assume that this was from commit 7d24678477d49c530.

This created a problem for us in that when we manually rebooted the nodes and then resumed them they did not come back into service.  Note that we also restarted slurmctld since we manually forced the KNL MCDRAM mode back to cache. 
 The nodes had status "Idle~", and then "Alloc~", and eventually "Alloc#" (configuring).  And yet they never re-rebooted.  Finally the job started after slurmctld received a SIGHUP to allow us to rotate its logs and slurmctld initiated a ping.  However, before that the nodes sat for nearly two hours.  In almost the same timeframe I restarted slurmd on _one_ of the 3800 "power saving" nodes with the "-b" flag.


Timeline:
3:58 AM - massive mode change from Cache to Flat initiated
3:59 AM - node reinitialization fails owing to system management software issue
5:59 AM - nodes get marked down with ResumeTimeout
6:30 AM - nodes get marked down (by me) with reason "reboot:dmj:failed mode change", to meet node maintenance documentation requirements
6:30 AM - I manually use capmc to force mode back to Cache
6:30 AM - I manually use capmc to initiate reboot the nodes
6:35 AM - I manually restart slurmctld so it sees the change in modes
7:10 - 7:25 AM - nodes finish booting
7:30 AM - slurmctld receives sighup to rotate logs, nodes were still marked down
8:00 AM - full scale knl reservation starts
8:05 AM - I manually resume the nodes
8:06 AM - reservation job starts, ~800 nodes actually reboot (unrelated to my power saving set), ResumeTimeout nodes get "Alloc~" state.  Nodes ___slowly___ move from Alloc~ to Alloc#
9:00 AM - We notice that the ResumeTimeout nodes are not actually rebooting, or doing anything for that matter.
9:27 AM - We get idea that maybe slurmd -b might fix this
9:30 AM - slurmctld receives sighup to rotate logs
9:30 AM - I manually run slurmd -b on one node (and then restart slurmd again so it isn't in the sshd systemd cgroups)
9:31 AM - slurmctld initiates a ping, all the nodes wake up
9:31 AM - job starts

So, I see at least two issues here.  (1) The nodes were left in a power saving state and there was no obvious way to resume them.  (2) Somewhere between the slurmctld restart, the sighup after the nodes were resumed and the ping all the nodes were able to wake from this state -- which may be a different bug in power saving.

Remember, I have no interest in the power saving capabilities that slurm provides -- it is not clear how they can be used for real on a cray system, but the power saving capabilities were leveraged to provide knl mode change support.

Thank you,
Doug
Comment 5 Felip Moll 2017-12-19 06:40:02 MST
Hi Doug,

We seem to identify two different problems here. The first one regarding your initial comment is happening even without knl. It happens on --reboot and in 17.02.9 and later, but not in 17.02.8 (same config).

The difference is that in 17.02.8, job and node remains in CG state until node comes back to life and slurmctld issues a:

[2017-12-19T14:01:36.395] Resending TERMINATE_JOB request JobId=4 Nodelist=moll1

In 17.02.9 and 17.11, job ends instantly and node remains in CG state forever. Now looking at changes between .8 and .9.


This other issue you comment in #3 seems to be a different problem.

Let me check all of it and I will come back to you when I have more info.

Thanks,
Felip M
Comment 9 Felip Moll 2017-12-21 12:31:41 MST
(In reply to Doug Jacobsen from comment #3)
> We had another strange incident with cori, slurm 17.02.9, and knl mode
> changes yesterday involving a mode change of ~3800 KNL nodes.
> 
> The Cray system failed to boot the nodes, and eventually they hit the
> ResumeTimeout.  In the past these nodes would be marked Down with reason
> "Not Responding".  Now, in slurm 17.02.9 they were given reason
> "ResumeTimeout".  I assume that this was from commit 7d24678477d49c530.
> 
> This created a problem for us in that when we manually rebooted the nodes
> and then resumed them they did not come back into service.

Hi Doug,

I have a solution pending to review and commit for the first issue.

For the second issue, I tried to reproduce it without luck, here is my sequence.

Please, let me know if this is the same case than the one you are experimenting (with different result though):

1. Nodes are powered off and idle~
2. Start a job sbatch
3. Nodes become alloc#, job CF
4. Nodes does not boot
5. Nodes set down*+power (seen as down* in sinfo) with reason ResumeTimeout
6. Job is requeued with a begintime of +120s
7. Issue a scontrol update nodename=xxx state=resume before 120s.
8. Nodes are now idle*+power (seen as idle~ in sinfo)
9. Wait for begintime
10. Nodes tried to be booted again
11. Nodes to alloc state and job to CF
12. Job run


In this sequence everything went as expected.
Comment 12 Felip Moll 2018-01-04 01:53:07 MST
Hi Doug, do you have any feedback about comment #9?
Comment 16 Felip Moll 2018-01-05 15:16:32 MST
(In reply to Felip Moll from comment #9)
> (In reply to Doug Jacobsen from comment #3)
> > We had another strange incident with cori, slurm 17.02.9, and knl mode
> > changes yesterday involving a mode change of ~3800 KNL nodes.
> > 
> > The Cray system failed to boot the nodes, and eventually they hit the
> > ResumeTimeout.  In the past these nodes would be marked Down with reason
> > "Not Responding".  Now, in slurm 17.02.9 they were given reason
> > "ResumeTimeout".  I assume that this was from commit 7d24678477d49c530.
> > 
> > This created a problem for us in that when we manually rebooted the nodes
> > and then resumed them they did not come back into service.
> 
> Hi Doug,
> 
> I have a solution pending to review and commit for the first issue.
> 

Doug, fix for first problem commited in c43df3a78b0f235f3502a395321690e492dc7bcc and available on version 17.02.10.

Any news about second problem?
Comment 17 Felip Moll 2018-01-08 09:14:34 MST
Closing bug.

Fix for first problem commited in c43df3a78b0f235f3502a395321690e492dc7bcc and available on version 17.02.10.

Please, reopen if you need more clarification.
Comment 18 S Senator 2018-02-15 15:00:28 MST
Would you please attach the patch to ticket #4536 so that we can cherry-pick it into our Cray 17.02.9 build environment? Thank you.
Comment 19 Felip Moll 2018-02-16 02:36:36 MST
(In reply to S Senator from comment #18)
> Would you please attach the patch to ticket #4536 so that we can cherry-pick
> it into our Cray 17.02.9 build environment? Thank you.

Please, grab it from here:

https://github.com/SchedMD/slurm/commit/c43df3a78b0f235f3502a395321690e492dc7bcc.patch
Comment 20 Felip Moll 2018-02-16 02:37:41 MST
Created attachment 6163 [details]
c43df3a78b0f235f3502a395321690e492dc7bcc

https://github.com/SchedMD/slurm/commit/c43df3a78b0f235f3502a395321690e492dc7bcc.patch