Running a slightly-modified b5ace9a808. At 03:16, logrotate issued a "slurm reconfig". Here's the beginning of the new slurmctld.log: > [2014-05-13T03:16:01.690] error: read_slurm_conf: default partition not set. > [2014-05-13T03:16:01.704] restoring original state of nodes > [2014-05-13T03:16:01.704] restoring original partition state > [2014-05-13T03:16:01.747] cons_res: select_p_node_init > [2014-05-13T03:16:01.747] cons_res: preparing for 75 partitions > [2014-05-13T03:16:01.832] Killing job 1624824 on DOWN node clus658 The only problem was, job 1624824 wasn't running on clus658! It was running on clus645. As proof, I offer the previous entry for this job in the pre-rotation log: > [2014-05-12T23:10:47.737] sched: Allocate JobId=1624824 NodeList=clus645 #CPUs=40 This also matches what is in the job's log file: > == JOB START HOST=clus645 DATE=Fri May 9 14:29:16 WST 2014 ==
(In reply to Phil Schwan from comment #0) > Running a slightly-modified b5ace9a808. > > At 03:16, logrotate issued a "slurm reconfig". Note, this does not save/restore state from disk. It is continuing to work with the job table already in memory. > Here's the beginning of the new slurmctld.log: > > > [2014-05-13T03:16:01.690] error: read_slurm_conf: default partition not set. > > [2014-05-13T03:16:01.704] restoring original state of nodes > > [2014-05-13T03:16:01.704] restoring original partition state > > [2014-05-13T03:16:01.747] cons_res: select_p_node_init > > [2014-05-13T03:16:01.747] cons_res: preparing for 75 partitions > > [2014-05-13T03:16:01.832] Killing job 1624824 on DOWN node clus658 > > The only problem was, job 1624824 wasn't running on clus658! It was running > on clus645. > > As proof, I offer the previous entry for this job in the pre-rotation log: > > > [2014-05-12T23:10:47.737] sched: Allocate JobId=1624824 NodeList=clus645 #CPUs=40 Did you happen to add or remove compute nodes in your slurm.conf and reconfigure? Were there other changes in slurm.conf? > This also matches what is in the job's log file: > > > == JOB START HOST=clus645 DATE=Fri May 9 14:29:16 WST 2014 == Note, this is a different time (and date) than the record in the slurmctld log file, so I presume the job has been requeued at least once since the time this job log was written. For what its worth. slurm is reconfigured on hundreds of computers every day and I've never heard of such a failure
(In reply to Moe Jette from comment #1) > > Did you happen to add or remove compute nodes in your slurm.conf and > reconfigure? Were there other changes in slurm.conf? It looks like there was _some_ change -- the slurm.conf had been modified since the previous reconfigure, 24 hours ago -- but I'll have to ask Stu what it was. I would be very surprised if it were a node change. > Note, this is a different time (and date) than the record in the slurmctld > log file, so I presume the job has been requeued at least once since the > time this job log was written. Good point! No, it appears that it's a bug in our logger. Looks like it's printing the *submit* time instead of the *start* time. That's consistent with its first appearance in the slurmctld log, and the other tasks for this job all have that May 9 date. I don't think it was requeued.
(In reply to Phil Schwan from comment #2) > > I would be very surprised if it were a node change. I stand corrected. It looks like there were node changes -- we finally got around to removing some nodes from the slurm.conf. Mostly, but not entirely, those that had never been online (i.e. had failed before we switched to slurm).
David, You did some work related to adding and removing nodes on a reconfiguration. Can you pursue this?
Sure, Phil could you please update me about this problem? I was away on a business trip. Thanks, David
I think it's all here... Comment 0 sets the scene Comment 2 clarifies that the job was not requeued (the time in the log written by our script was the submit time, not the start time) Comment 3 answers Moe's question in the affirmative: nodes were removed from the slurm.conf prior to this reconfig. Although for the avoidance of doubt, neither of the nodes mentioned in comment 0 were amongst the nodes that were removed. Please let us know if you need anything else!
In order to exclude the job was requeued this command will tell us for sure: sacct -D -j 1624824 -o jobid,nodelist,alloccpus,partition,submit,start,end a requeued job is basically resubmitted so multiple submission times indicate a requeue, unless the jobid rolled over. If hosts are added or removed from slurm.conf the controller must be restarted, if not the code detects the inconsistency and the slurmctld aborts, perhaps your watchdog restarted it? Do you find these messages in your slurmctld.log: ->error: _compare_hostnames: node count has changed before reconfiguration from 4 ->to 3. You have to restart slurmctld. ->fatal: read_slurm_conf: hostnames inconsistency detected upon restart the node bitmaps slurmctld uses are rebuilt, so it is hard to understand how a job could endup on a wrong host. Do you frequently add/remove hosts from the slurm.conf? and if you do could you please tell us why since we think it should be an infrequent operation. Thanks, David
> # sacct -D -j 1624824 -o jobid,nodelist,alloccpus,partition,submit,start,end > JobID NodeList AllocCPUS Partition Submit Start End > ------------ --------------- ---------- ---------- ------------------- ------------------- ------------------- > 1624824 None assigned 1 teamfrase+ 2014-05-09T14:29:16 2014-05-13T00:10:47 2014-05-13T00:10:47 > 1624824.bat+ clus645 1 2014-05-12T23:10:47 2014-05-12T23:10:47 2014-05-13T03:16:09 > 1624824 None assigned 1 teamfrase+ 2014-05-13T03:16:11 2014-05-13T07:23:43 2014-05-13T07:23:43 > 1624824.bat+ clus639 1 2014-05-13T07:23:43 2014-05-13T07:23:43 2014-05-13T07:23:43 > 1624824 None assigned 1 teamfrase+ 2014-05-13T07:23:44 2014-05-14T06:50:06 2014-05-14T06:50:06 > 1624824.bat+ clus639 1 Unknown Unknown 2014-05-13T07:23:45 The first entry (on clus645) was the one that was running when the event happened. > If hosts are added or removed from slurm.conf the controller must be restarted, > if not the code detects the inconsistency and the slurmctld aborts, perhaps > your watchdog restarted it? This slurm reconfig occurred as a result of logrotate running; of that I'm certain. Unfortunately, I now only have logs going back to 11 May (I've changed our logrotate to keep much more in future). Having dug more deeply, I think it's possible that *this* slurm reconfig wasn't actually the one that removed nodes. I now think that probably several reconfigs occurred after the node list was changed, before the one in which this problem occurred. I asked the IT team about a particular date range, someone answered the question with an event that occurred before that range, and I didn't pick it up. I apologise if I've sent you down any dead-ends. > Do you find these messages in your slurmctld.log: > > ->error: _compare_hostnames: node count has changed before reconfiguration > from 4 ->to 3. You have to restart slurmctld. > ->fatal: read_slurm_conf: hostnames inconsistency detected Nope; I can't find any fatals in our logs. > Do you frequently add/remove hosts from the slurm.conf? and if you do > could you please tell us why since we think it should be an infrequent > operation. No, definitely not. As far as I know, this was the first and only time that we've changed the list of hosts so far. And per above, I'm no longer even convinced that it happened on this reconfig (in fact, I think it very likely didn't). Interestingly, that log message -- "Killing job j on DOWN node n" -- has appeared in our logs only once the 12 days of data that we have. This at least reassures me that it's not happening regularly.
Hi, it is hard to tell what happen at this point. Clearly the job appears to be running on clus645. Is it possible there were some name resolution problems which would cause clus645 to be resolved as clus658? (just an idea..) David
I have discovered and fixed a possible scenario responsible for this problem. It would be very difficult to reproduce and given some other recent changes by David could not happen again, but this is cleaner code in any case. See: https://github.com/SchedMD/slurm/commit/f8a38c4826969d9ad4c2637d1dbdd35329c2821e
(In reply to David Bigagli from comment #9) > it is hard to tell what happen at this point. Clearly the job appears to > be running on clus645. Is it possible there were some name resolution > problems which would cause clus645 to be resolved as clus658? (just an idea..) I ... really can't see how :)
Have you installed the patch sent by Moe on 5/27? Has this problem happened again after? David
I don't think we have yet, but I expect that we will today, along with the couple other patches you've provided for other issues.
This might be due both a primary and backup slurmctld running at the same time and assigning duplicate job IDs. Also see bug 860.
Were you running the backup controller at the time when this happened? David
It was sufficiently long ago that the logs are gone, so I'm not sure we can prove it one way or the other. But it's possible -- we only gave up on the backup controller as a concept last week.
Let's close this for now and reopen if necessary. David
Great news, that I'm sure you're all totally excited to hear! It's recurred! I'll start by answering some of the questions from older comments: (In reply to David Bigagli from comment #12) > Have you installed the patch sent by Moe on 5/27? Has this problem happened > again after? This has now happened with 16.05.7 (In reply to David Bigagli from comment #15) > Were you running the backup controller at the time when this happened? We no longer ever run a backup controller, so there's no risk of conflict there. (In reply to Moe Jette from comment #1) > > Did you happen to add or remove compute nodes in your slurm.conf and > reconfigure? Were there other changes in slurm.conf? YES -- we had definitely added (and possibly removed, though I don't think so) nodes in the vicinity of this event. I don't know if this was the first "slurm reconfigure" since the nodes were added, but this was a week when we were adding dozens of nodes a day. ===== Executive summary: - At 03:00: logrotate runs, decides that it needed to rotate the slurmctld.log, and afterward runs slurm reconfigure - Post-reconfigure, slurmctld gets confused about which jobs are running on which nodes, and in this case kills 3 of them - Worst of all, these jobs that get killed disappear completely from the queue, which for us is the cardinal sin. As you'll recall, we've organised everything (via the epilog, and with your help) to ensure that a task should NEVER be removed from the queue except in two situations: successful completion, or explicit user cancelation. Logs: ===== slurmctld (redacted to include the highlights + anything referencing job 5078025 or hnod0750): > [2017-03-18T03:00:01.428] Reconfigure signal (SIGHUP) received > [2017-03-18T03:00:01.941] Killing job 5078025 on DOWN node hnod0797 > [2017-03-18T03:00:01.969] _sync_nodes_to_comp_job: Job 5078025 in completing state > [2017-03-18T03:00:01.969] error: Node hnod0750 run_job_cnt underflow in make_node_comp > [2017-03-18T03:00:01.969] error: Node hnod0750 no_share_job_cnt underflow in make_node_comp > [2017-03-18T03:00:09.722] Resending TERMINATE_JOB request JobId=5078025 Nodelist=hnod0750 > [2017-03-18T03:00:17.317] job_complete: JobID=5051858_222(5078025) State=0x8007 NodeCnt=1 WTERMSIG 15 Note that on the second line, slurmctld says that 5078025 is running on hnod0797 (incorrect). By the 5th line, it seems to know that it's running on hnod0750 (correct). But it's too late -- it's already decided to kill it. Job 5078025 (task 222): > slurmstepd: error: *** JOB 5078025 ON hnod0750 CANCELLED AT 2017-03-18T03:00:01 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** hnod0750 syslog: > 2017-03-18T03:00:01-05:00 hnod0750 slurmd[25427]: debug: _rpc_terminate_job, uid = 601 > 2017-03-18T03:00:01-05:00 hnod0750 slurmd[25427]: debug: task_p_slurmd_release_resources: 5078025 > 2017-03-18T03:00:01-05:00 hnod0750 slurmd[25427]: debug: credential for job 5078025 revoked > 2017-03-18T03:00:01-05:00 hnod0750 slurmstepd[145483]: debug: Handling REQUEST_SIGNAL_CONTAINER > 2017-03-18T03:00:01-05:00 hnod0750 slurmstepd[145483]: debug: _handle_signal_container for step=5078025.4294967294 uid=0 signal=998 > 2017-03-18T03:00:01-05:00 hnod0750 slurmstepd[145483]: error: *** JOB 5078025 ON hnod0750 CANCELLED AT 2017-03-18T03:00:01 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** (Should we happen to need more client-side data, tasks 225 and 310 show exactly the same on hnod0878 and hnod0804.) Meanwhile, on hnod0797, we see these weird kill signals with invalid task IDs: > 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: debug: Handling REQUEST_STATE > 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: debug: Handling REQUEST_STEP_TERMINATE > 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: debug: _handle_terminate for step=5078025.4294967294 uid=0 > 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: Sent SIGKILL signal to 5078025.4294967294 ...and it continues to get that command once per second for the next 20 HOURS. Until the node is rebooted, which presumably stopped slurmctld's attempted murder spree. slurmctld, it will surprise nobody, was still confused, ten hours later: > [2017-03-19T13:46:29.649] error: Orphan job 5078025.batch reported on node hnod0797 > [2017-03-19T13:46:29.649] debug: Aborting job 5078025 on node hnod0797 > $ sacct -j 5078025 -a -D --format "jobid,jobname,end,exitcode,nodelist" > JobID JobName End ExitCode NodeList > ------------ ---------- ------------------- -------- --------------- > 5051858_222 200_SWB16+ 2017-03-18T03:00:01 1:0 hnod0750 > 5051858_222+ batch 2017-03-18T03:00:17 0:15 hnod0750 In closing: - The most important problem is that tasks are vanishing from the queue (very similar to bug 1889 and bug 1985 in that regard). Our entire workflow relies on the queueing system keeping accurate track of what has and has not run successfully. - The problem is that tasks get killed. This is bad because obviously it wastes cluster time, but at least it's cleanly (indeed, automatically) recoverable if it were treated like normal node failure, and the tasks placed back in the queue to be rescheduled. - Finally, there's the endless once-per-second kill signals being sent for tasks that don't exist. Apart from filling up logs, this may not be especially harmful, but it's certainly a troubling indicator of confusion.
> > Did you happen to add or remove compute nodes in your slurm.conf and > > reconfigure? Were there other changes in slurm.conf? > > YES -- we had definitely added (and possibly removed, though I don't think > so) nodes in the vicinity of this event. I don't know if this was the first > "slurm reconfigure" since the nodes were added, but this was a week when we > were adding dozens of nodes a day. Can you walk me through the potential scenario here? I suspect this is the root cause of the problem - if the Node definitions in slurm.conf are out of sync in the cluster bad things will happen. Is it possible that the slurm.conf on the master node was changed, but slurmctld had not been restarted until that reconfigure? I'm guessing that slurmctld itself was SIGHUP'd for logrotation?
(In reply to Tim Wickberg from comment #19) > > Can you walk me through the potential scenario here? I suspect this is the > root cause of the problem - if the Node definitions in slurm.conf are out of > sync in the cluster bad things will happen. Sure thing. We have a single copy of the slurm.conf, on an NFS server mounted by the controller and all client nodes. The list of nodes would've been updated from something like this: NodeName=hnod[0500-0796] CPUs=256 RealMemory=193327 Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN MemSpecLimit=6144 To something like this: NodeName=hnod[0500-1499] CPUs=256 RealMemory=193327 Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN MemSpecLimit=6144 (Unfortunately I don't know the exact set of nodes before and after.) > Is it possible that the slurm.conf on the master node was changed, but > slurmctld had not been restarted until that reconfigure? Yes, I think that's totally possible. > I'm guessing that slurmctld itself was SIGHUP'd for logrotation? Actually, this is our logrotate config: postrotate /bin/kill -HUP `cat /var/run/slurm/slurmctl.pid 2>/dev/null` 2> /dev/null || true /etc/init.d/slurm reconfig Is that best practice? Or are we basically reloading twice in rapid succession?
> postrotate > /bin/kill -HUP `cat /var/run/slurm/slurmctl.pid 2>/dev/null` 2> > /dev/null || true > /etc/init.d/slurm reconfig > > Is that best practice? Or are we basically reloading twice in rapid > succession? Use one of the other - if you look into the init script you'll find that's roughly the same thing it's doing. That might be what's causing this; I'll see if we can reproduce a scenario where that causes a direct problem for slurmctld.
(In reply to Tim Wickberg from comment #21) > > Use one of the other - if you look into the init script you'll find that's > roughly the same thing it's doing. Will do. I thought that seemed weird.
More grist for the mill: In the task's log: > slurmstepd: error: *** JOB 2429631 ON knod0189 CANCELLED AT 2017-04-10T13:00:02 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** In the (perfectly healthy) knod0189's log: > 2017-04-10T12:59:54+08:00 knod0189 slurmstepd[708]: debug: jag_common_poll_data: Task average frequency = 20872 pid 712 mem size 30440920 48388512 time 108702.970000(82370+26332) > 2017-04-10T13:00:02+08:00 knod0189 slurmd[15735]: debug: _rpc_terminate_job, uid = 601 > 2017-04-10T13:00:02+08:00 knod0189 slurmd[15735]: debug: task_p_slurmd_release_resources: 2429631 > 2017-04-10T13:00:02+08:00 knod0189 slurmd[15735]: debug: credential for job 2429631 revoked > 2017-04-10T13:00:02+08:00 knod0189 slurmstepd[708]: debug: Handling REQUEST_SIGNAL_CONTAINER > 2017-04-10T13:00:02+08:00 knod0189 slurmstepd[708]: debug: _handle_signal_container for step=2429631.4294967294 uid=0 signal=998 > 2017-04-10T13:00:02+08:00 knod0189 slurmstepd[708]: error: *** JOB 2429631 ON knod0189 CANCELLED AT 2017-04-10T13:00:02 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** In slurmctld.log: > [2017-04-10T13:00:02.112] Reconfigure signal (SIGHUP) received > [2017-04-10T13:00:02.244] error: read_slurm_conf: default partition not set. > [2017-04-10T13:00:02.267] restoring original state of nodes > [2017-04-10T13:00:02.267] restoring original partition state > [2017-04-10T13:00:02.268] cons_res: select_p_node_init > [2017-04-10T13:00:02.269] cons_res: preparing for 67 partitions > [2017-04-10T13:00:02.333] Killing job 2429631 on DOWN node knod0236 > [2017-04-10T13:00:02.334] _sync_nodes_to_jobs updated state of 1 nodes > [2017-04-10T13:00:02.337] _sync_nodes_to_comp_job: Job 2429631 in completing state > [2017-04-10T13:00:02.337] error: Node knod0189 run_job_cnt underflow in make_node_comp > [2017-04-10T13:00:02.337] error: Node knod0189 no_share_job_cnt underflow in make_node_comp > [2017-04-10T13:00:02.337] _sync_nodes_to_comp_job: completing 1 jobs > ... > [2017-04-10T13:00:05.802] job_complete: JobID=2428544_21(2429631) State=0x8007 NodeCnt=1 WTERMSIG 15 I find it interesting just how *quickly* the killing stroke occurs. Almost like it's making a decision before it's even unpacked all its previous state. To the best of our knowledge, we'd removed 20 nodes from slurm.conf since the previous reload.
Phil, For this bug I made a patch that logs all the down nodes on a slurmctld startup so there's no question as to what the controller considers as "down". This includes both nodes in a down state that were read in from the state file, as well as nodes marked as down in the slurm.conf file. Here's what the log line looks like: Down nodes: knod[0236,1234] Here's the commit: https://github.com/SchedMD/slurm/commit/d8e953946d9bfcc02412b1a72c1efc6be25aede8 This patch has been checked into the master branch but should patch cleanly into 16.05 and beyond if you'd like to back port it. Hopefully, you will find this patch helpful. Regards. Tim
Phil, Am I okay to resolve this bug? Is there any further action needed? Thanks Tim
(In reply to Tim Shaw from comment #35) > > Am I okay to resolve this bug? Is there any further action needed? Maybe I misunderstood your comment #34, but it doesn't sound like it actually fixes anything, only adds some logging? If that's true, then the further action is that we need it to stop killing tasks on nodes that aren't actually down. :) And *especially* not removing them from the queue entirely, which is an unforgivable sin. Our entire workflow relies on being able to trust that tasks never exit the queue except by either (a) completing successfully, or (b) being explicitly canceled by user action. When tasks just vanish, it means part of the job never runs, and we lose data. Lastly, there was the case of the endless once-per-second kill signals being sent for tasks that don't exist. It's not clear whether that caused actual harm or not, but it definitely indicates a system that's deeply confused about its state. Cheers, -Phil
Marking resolved/timedout as discussed.