Hello, I'm fairly certain this is a regression with slurm 17.02.9 (or .8) relative to .7. Since updating to slurm 17.02.9 yesterday I'm noticing that node reboots can leave nodes in a completing state (with no CG job) if the user cancels the job during a reboot. ctl1:/var/tmp/slurm # sacct -j 9001462 --format=nodelist,start,end,state -p NodeList|Start|End|State| nid[11904-11919,11924-11971]|2017-12-15T13:33:21|2017-12-15T13:54:28|CANCELLED by 57075| ctl1:/var/tmp/slurm # ctl1:/var/tmp/slurm # sinfo --node=nid[11904-11919,11924-11971] -p system PARTITION AVAIL TIMELIMIT NODES STATE NODELIST system up 1-12:00:00 64 comp nid[11904-11919,11924-11971] ctl1:/var/tmp/slurm # ctl1:/var/tmp/slurm # squeue --node=nid[11904-11919,11924-11971] JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ctl1:/var/tmp/slurm # ctl1:/var/tmp/slurm # xzcat slurmctld.log* | grep 9001462 xzcat: slurmctld.log: File format not recognized [2017-12-15T13:29:20.397] debug: Job 9001462 feature list: knl&quad&cache [2017-12-15T13:29:20.414] _slurm_rpc_submit_batch_job JobId=9001462 usec=17846 [2017-12-15T13:29:49.205] backfill test for JobID=9001462 Prio=69120 Partition=knl [2017-12-15T13:31:15.556] backfill test for JobID=9001462 Prio=69121 Partition=knl [2017-12-15T13:33:21.392] backfill test for JobID=9001462 Prio=69121 Partition=knl [2017-12-15T13:33:21.423] email msg to rhager@pppl.gov: SLURM Job_id=9001462 Name=xgca_jet_85794 Began, Queued time 00:04:01 [2017-12-15T13:33:21.515] backfill: Started JobId=9001462 in knl on nid[11904-11919,11924-11971] [2017-12-15T13:54:28.978] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 9001462 uid 57075 [2017-12-15T13:54:28.978] email msg to rhager@pppl.gov: SLURM Job_id=9001462 Name=xgca_jet_85794 Ended, Run time 00:21:07, CANCELLED, ExitCode 0 [2017-12-15T13:54:28.978] debug: jobcomp/nersc: took 0.0000s to collect data for job 9001462 [2017-12-15T13:54:29.072] _job_signal: 9 of configuring JobID=9001462 State=0x4 NodeCnt=0 successful ctl1:/var/tmp/slurm # xzcat slurmctld.log* | grep nid11904 xzcat: slurmctld.log: File format not recognized [2017-12-14T14:35:58.211] agent/is_node_resp: node:nid11904 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2017-12-14T15:11:22.126] Node nid11904 now responding [2017-12-14T15:11:22.126] node nid11904 returned to service [2017-12-14T17:01:06.842] debug: validate_node_specs: node nid11904 registered with 0 jobs [2017-12-14T17:25:38.325] _update_node_active_features: nodes nid11904 active features set to: quad,cache,knl [2017-12-14T17:26:02.159] update_node: node nid11904 state set to ALLOCATED [2017-12-14T17:26:34.243] update_node: node nid11904 state set to ALLOCATED [2017-12-14T18:10:04.894] Node nid11904 now responding [2017-12-14T18:50:58.678] debug: validate_node_specs: node nid11904 registered with 0 jobs [2017-12-15T00:49:33.752] _update_node_active_features: nodes nid11904 active features set to: quad,flat,knl [2017-12-15T01:31:15.246] Node nid11904 now responding [2017-12-15T13:33:29.170] _update_node_active_features: nodes nid11904 active features set to: quad,cache,knl [2017-12-15T13:33:48.779] debug: Still waiting for boot of node nid11904 [2017-12-15T14:06:36.836] Node nid11904 now responding For now we are detecting this issue, then marking the nodes down, then resuming them. Thanks, Doug
nid11904:~ # cat /var/spool/slurmd/nid11904.log [2017-12-15T14:06:36.345] Message aggregation disabled [2017-12-15T14:06:36.541] Resource spec: Reserved system memory limit not configured for this node [2017-12-15T14:06:36.574] slurmd version 17.02.9 started [2017-12-15T14:06:36.575] error: No /var/spool/slurmd/job_container_state file for job_container/cncu state recovery [2017-12-15T14:06:36.575] core_spec/cray: init [2017-12-15T14:06:36.820] slurmd started on Fri, 15 Dec 2017 14:06:36 -0800 [2017-12-15T14:06:36.820] CPUs=272 Boards=1 Sockets=1 Cores=68 Threads=4 Memory=96563 TmpDisk=48281 Uptime=1718 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) nid11904:~ #
We had another strange incident with cori, slurm 17.02.9, and knl mode changes yesterday involving a mode change of ~3800 KNL nodes. The Cray system failed to boot the nodes, and eventually they hit the ResumeTimeout. In the past these nodes would be marked Down with reason "Not Responding". Now, in slurm 17.02.9 they were given reason "ResumeTimeout". I assume that this was from commit 7d24678477d49c530. This created a problem for us in that when we manually rebooted the nodes and then resumed them they did not come back into service. Note that we also restarted slurmctld since we manually forced the KNL MCDRAM mode back to cache. The nodes had status "Idle~", and then "Alloc~", and eventually "Alloc#" (configuring). And yet they never re-rebooted. Finally the job started after slurmctld received a SIGHUP to allow us to rotate its logs and slurmctld initiated a ping. However, before that the nodes sat for nearly two hours. In almost the same timeframe I restarted slurmd on _one_ of the 3800 "power saving" nodes with the "-b" flag. Timeline: 3:58 AM - massive mode change from Cache to Flat initiated 3:59 AM - node reinitialization fails owing to system management software issue 5:59 AM - nodes get marked down with ResumeTimeout 6:30 AM - nodes get marked down (by me) with reason "reboot:dmj:failed mode change", to meet node maintenance documentation requirements 6:30 AM - I manually use capmc to force mode back to Cache 6:30 AM - I manually use capmc to initiate reboot the nodes 6:35 AM - I manually restart slurmctld so it sees the change in modes 7:10 - 7:25 AM - nodes finish booting 7:30 AM - slurmctld receives sighup to rotate logs, nodes were still marked down 8:00 AM - full scale knl reservation starts 8:05 AM - I manually resume the nodes 8:06 AM - reservation job starts, ~800 nodes actually reboot (unrelated to my power saving set), ResumeTimeout nodes get "Alloc~" state. Nodes ___slowly___ move from Alloc~ to Alloc# 9:00 AM - We notice that the ResumeTimeout nodes are not actually rebooting, or doing anything for that matter. 9:27 AM - We get idea that maybe slurmd -b might fix this 9:30 AM - slurmctld receives sighup to rotate logs 9:30 AM - I manually run slurmd -b on one node (and then restart slurmd again so it isn't in the sshd systemd cgroups) 9:31 AM - slurmctld initiates a ping, all the nodes wake up 9:31 AM - job starts So, I see at least two issues here. (1) The nodes were left in a power saving state and there was no obvious way to resume them. (2) Somewhere between the slurmctld restart, the sighup after the nodes were resumed and the ping all the nodes were able to wake from this state -- which may be a different bug in power saving. Remember, I have no interest in the power saving capabilities that slurm provides -- it is not clear how they can be used for real on a cray system, but the power saving capabilities were leveraged to provide knl mode change support. Thank you, Doug
Hi Doug, We seem to identify two different problems here. The first one regarding your initial comment is happening even without knl. It happens on --reboot and in 17.02.9 and later, but not in 17.02.8 (same config). The difference is that in 17.02.8, job and node remains in CG state until node comes back to life and slurmctld issues a: [2017-12-19T14:01:36.395] Resending TERMINATE_JOB request JobId=4 Nodelist=moll1 In 17.02.9 and 17.11, job ends instantly and node remains in CG state forever. Now looking at changes between .8 and .9. This other issue you comment in #3 seems to be a different problem. Let me check all of it and I will come back to you when I have more info. Thanks, Felip M
(In reply to Doug Jacobsen from comment #3) > We had another strange incident with cori, slurm 17.02.9, and knl mode > changes yesterday involving a mode change of ~3800 KNL nodes. > > The Cray system failed to boot the nodes, and eventually they hit the > ResumeTimeout. In the past these nodes would be marked Down with reason > "Not Responding". Now, in slurm 17.02.9 they were given reason > "ResumeTimeout". I assume that this was from commit 7d24678477d49c530. > > This created a problem for us in that when we manually rebooted the nodes > and then resumed them they did not come back into service. Hi Doug, I have a solution pending to review and commit for the first issue. For the second issue, I tried to reproduce it without luck, here is my sequence. Please, let me know if this is the same case than the one you are experimenting (with different result though): 1. Nodes are powered off and idle~ 2. Start a job sbatch 3. Nodes become alloc#, job CF 4. Nodes does not boot 5. Nodes set down*+power (seen as down* in sinfo) with reason ResumeTimeout 6. Job is requeued with a begintime of +120s 7. Issue a scontrol update nodename=xxx state=resume before 120s. 8. Nodes are now idle*+power (seen as idle~ in sinfo) 9. Wait for begintime 10. Nodes tried to be booted again 11. Nodes to alloc state and job to CF 12. Job run In this sequence everything went as expected.
Hi Doug, do you have any feedback about comment #9?
(In reply to Felip Moll from comment #9) > (In reply to Doug Jacobsen from comment #3) > > We had another strange incident with cori, slurm 17.02.9, and knl mode > > changes yesterday involving a mode change of ~3800 KNL nodes. > > > > The Cray system failed to boot the nodes, and eventually they hit the > > ResumeTimeout. In the past these nodes would be marked Down with reason > > "Not Responding". Now, in slurm 17.02.9 they were given reason > > "ResumeTimeout". I assume that this was from commit 7d24678477d49c530. > > > > This created a problem for us in that when we manually rebooted the nodes > > and then resumed them they did not come back into service. > > Hi Doug, > > I have a solution pending to review and commit for the first issue. > Doug, fix for first problem commited in c43df3a78b0f235f3502a395321690e492dc7bcc and available on version 17.02.10. Any news about second problem?
Closing bug. Fix for first problem commited in c43df3a78b0f235f3502a395321690e492dc7bcc and available on version 17.02.10. Please, reopen if you need more clarification.
Would you please attach the patch to ticket #4536 so that we can cherry-pick it into our Cray 17.02.9 build environment? Thank you.
(In reply to S Senator from comment #18) > Would you please attach the patch to ticket #4536 so that we can cherry-pick > it into our Cray 17.02.9 build environment? Thank you. Please, grab it from here: https://github.com/SchedMD/slurm/commit/c43df3a78b0f235f3502a395321690e492dc7bcc.patch
Created attachment 6163 [details] c43df3a78b0f235f3502a395321690e492dc7bcc https://github.com/SchedMD/slurm/commit/c43df3a78b0f235f3502a395321690e492dc7bcc.patch