805 – slurmctld killed a job during reload, because some other node was down!

Ticket 805 - slurmctld killed a job during reload, because some other node was down!

Summary: slurmctld killed a job during reload, because some other node was down!

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-05-12 23:52 MDT by Phil Schwan
Modified:	2017-06-21 15:37 MDT (History)
CC List:	3 users (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Phil Schwan 2014-05-12 23:52:53 MDT

Running a slightly-modified b5ace9a808.

At 03:16, logrotate issued a "slurm reconfig".

Here's the beginning of the new slurmctld.log:

> [2014-05-13T03:16:01.690] error: read_slurm_conf: default partition not set.
> [2014-05-13T03:16:01.704] restoring original state of nodes
> [2014-05-13T03:16:01.704] restoring original partition state
> [2014-05-13T03:16:01.747] cons_res: select_p_node_init
> [2014-05-13T03:16:01.747] cons_res: preparing for 75 partitions
> [2014-05-13T03:16:01.832] Killing job 1624824 on DOWN node clus658

The only problem was, job 1624824 wasn't running on clus658!  It was running on clus645.

As proof, I offer the previous entry for this job in the pre-rotation log:

> [2014-05-12T23:10:47.737] sched: Allocate JobId=1624824 NodeList=clus645 #CPUs=40

This also matches what is in the job's log file:

> == JOB START  HOST=clus645 DATE=Fri May  9 14:29:16 WST 2014 ==

Comment 1 Moe Jette 2014-05-13 03:47:42 MDT

(In reply to Phil Schwan from comment #0)
> Running a slightly-modified b5ace9a808.
> 
> At 03:16, logrotate issued a "slurm reconfig".

Note, this does not save/restore state from disk. It is continuing to work with the job table already in memory.


> Here's the beginning of the new slurmctld.log:
> 
> > [2014-05-13T03:16:01.690] error: read_slurm_conf: default partition not set.
> > [2014-05-13T03:16:01.704] restoring original state of nodes
> > [2014-05-13T03:16:01.704] restoring original partition state
> > [2014-05-13T03:16:01.747] cons_res: select_p_node_init
> > [2014-05-13T03:16:01.747] cons_res: preparing for 75 partitions
> > [2014-05-13T03:16:01.832] Killing job 1624824 on DOWN node clus658
> 
> The only problem was, job 1624824 wasn't running on clus658!  It was running
> on clus645.
> 
> As proof, I offer the previous entry for this job in the pre-rotation log:
> 
> > [2014-05-12T23:10:47.737] sched: Allocate JobId=1624824 NodeList=clus645 #CPUs=40

Did you happen to add or remove compute nodes in your slurm.conf and reconfigure? Were there other changes in slurm.conf?


> This also matches what is in the job's log file:
> 
> > == JOB START  HOST=clus645 DATE=Fri May  9 14:29:16 WST 2014 ==

Note, this is a different time (and date) than the record in the slurmctld log file, so I presume the job has been requeued at least once since the time this job log was written.

For what its worth. slurm is reconfigured on hundreds of computers every day and I've never heard of such a failure

Comment 2 Phil Schwan 2014-05-13 10:50:39 MDT

(In reply to Moe Jette from comment #1)
> 
> Did you happen to add or remove compute nodes in your slurm.conf and
> reconfigure? Were there other changes in slurm.conf?

It looks like there was _some_ change -- the slurm.conf had been modified since the previous reconfigure, 24 hours ago -- but I'll have to ask Stu what it was.

I would be very surprised if it were a node change.

> Note, this is a different time (and date) than the record in the slurmctld
> log file, so I presume the job has been requeued at least once since the
> time this job log was written.

Good point!  No, it appears that it's a bug in our logger.  Looks like it's printing the *submit* time instead of the *start* time.  That's consistent with its first appearance in the slurmctld log, and the other tasks for this job all have that May 9 date.

I don't think it was requeued.

Comment 3 Phil Schwan 2014-05-16 13:48:51 MDT

(In reply to Phil Schwan from comment #2)
> 
> I would be very surprised if it were a node change.

I stand corrected.  It looks like there were node changes -- we finally got around to removing some nodes from the slurm.conf.  Mostly, but not entirely, those that had never been online (i.e. had failed before we switched to slurm).

Comment 4 Moe Jette 2014-05-19 09:02:37 MDT

David, You did some work related to adding and removing nodes on a reconfiguration. Can you pursue this?

Comment 5 David Bigagli 2014-05-20 05:06:33 MDT

Sure, Phil could you please update me about this problem? I was away on a business trip.

Thanks,
David

Comment 6 Phil Schwan 2014-05-21 17:28:14 MDT

I think it's all here...

Comment 0 sets the scene

Comment 2 clarifies that the job was not requeued (the time in the log written by our script was the submit time, not the start time)

Comment 3 answers Moe's question in the affirmative: nodes were removed from the slurm.conf prior to this reconfig.  Although for the avoidance of doubt, neither of the nodes mentioned in comment 0 were amongst the nodes that were removed.

Please let us know if you need anything else!

Comment 7 David Bigagli 2014-05-23 07:57:23 MDT

In order to exclude the job was requeued this command will tell us 
for sure:

sacct -D -j 1624824 -o jobid,nodelist,alloccpus,partition,submit,start,end

a requeued job is basically resubmitted so multiple submission times indicate
a requeue, unless the jobid rolled over.

If hosts are added or removed from slurm.conf the controller must be restarted,
if not the code detects the inconsistency and the slurmctld aborts, perhaps
your watchdog restarted it? Do you find these messages in your slurmctld.log:

->error: _compare_hostnames: node count has changed before reconfiguration from 4 ->to 3. You have to restart slurmctld.
->fatal: read_slurm_conf: hostnames inconsistency detected

upon restart the node bitmaps slurmctld uses are rebuilt, so it is hard to 
understand how a job could endup on a wrong host.

Do you frequently add/remove hosts from the slurm.conf? and if you do
could you please tell us why since we think it should be an infrequent
operation.

Thanks,
David

Comment 8 Phil Schwan 2014-05-23 13:23:07 MDT

> # sacct -D -j 1624824 -o jobid,nodelist,alloccpus,partition,submit,start,end
>        JobID        NodeList  AllocCPUS  Partition              Submit               Start                 End
> ------------ --------------- ---------- ---------- ------------------- ------------------- -------------------
> 1624824        None assigned          1 teamfrase+ 2014-05-09T14:29:16 2014-05-13T00:10:47 2014-05-13T00:10:47
> 1624824.bat+         clus645          1            2014-05-12T23:10:47 2014-05-12T23:10:47 2014-05-13T03:16:09
> 1624824        None assigned          1 teamfrase+ 2014-05-13T03:16:11 2014-05-13T07:23:43 2014-05-13T07:23:43
> 1624824.bat+         clus639          1            2014-05-13T07:23:43 2014-05-13T07:23:43 2014-05-13T07:23:43
> 1624824        None assigned          1 teamfrase+ 2014-05-13T07:23:44 2014-05-14T06:50:06 2014-05-14T06:50:06
> 1624824.bat+         clus639          1                        Unknown             Unknown 2014-05-13T07:23:45

The first entry (on clus645) was the one that was running when the event happened.

> If hosts are added or removed from slurm.conf the controller must be restarted,
> if not the code detects the inconsistency and the slurmctld aborts, perhaps
> your watchdog restarted it?

This slurm reconfig occurred as a result of logrotate running; of that I'm certain.

Unfortunately, I now only have logs going back to 11 May (I've changed our logrotate to keep much more in future).

Having dug more deeply, I think it's possible that *this* slurm reconfig wasn't actually the one that removed nodes.  I now think that probably several reconfigs occurred after the node list was changed, before the one in which this problem occurred.  I asked the IT team about a particular date range, someone answered the question with an event that occurred before that range, and I didn't pick it up.

I apologise if I've sent you down any dead-ends.

> Do you find these messages in your slurmctld.log:
> 
> ->error: _compare_hostnames: node count has changed before reconfiguration
> from 4 ->to 3. You have to restart slurmctld.
> ->fatal: read_slurm_conf: hostnames inconsistency detected

Nope; I can't find any fatals in our logs.

> Do you frequently add/remove hosts from the slurm.conf? and if you do
> could you please tell us why since we think it should be an infrequent
> operation.

No, definitely not.  As far as I know, this was the first and only time that we've changed the list of hosts so far.  And per above, I'm no longer even convinced that it happened on this reconfig (in fact, I think it very likely didn't).

Interestingly, that log message -- "Killing job j on DOWN node n" -- has appeared in our logs only once the 12 days of data that we have.  This at least reassures me that it's not happening regularly.

Comment 9 David Bigagli 2014-05-27 09:45:37 MDT

Hi,
   it is hard to tell what happen at this point. Clearly the job appears to be running on clus645. Is it possible there were some name resolution problems 
which would cause clus645 to be resolved as clus658? (just an idea..)

David

Comment 10 Moe Jette 2014-05-27 10:09:15 MDT

I have discovered and fixed a possible scenario responsible for this problem. It would be very difficult to reproduce and given some other recent changes by David could not happen again, but this is cleaner code in any case. See:

https://github.com/SchedMD/slurm/commit/f8a38c4826969d9ad4c2637d1dbdd35329c2821e

Comment 11 Phil Schwan 2014-05-27 23:24:26 MDT

(In reply to David Bigagli from comment #9)
>    it is hard to tell what happen at this point. Clearly the job appears to
> be running on clus645. Is it possible there were some name resolution
> problems which would cause clus645 to be resolved as clus658? (just an idea..)

I ... really can't see how :)

Comment 12 David Bigagli 2014-06-04 05:59:54 MDT

Have you installed the patch sent by Moe on 5/27? Has this problem happened again after?

David

Comment 13 Phil Schwan 2014-06-04 11:11:56 MDT

I don't think we have yet, but I expect that we will today, along with the couple other patches you've provided for other issues.

Comment 14 Moe Jette 2014-06-05 11:20:11 MDT

This might be due both a primary and backup slurmctld running at the same time and assigning duplicate job IDs. Also see bug 860.

Comment 15 David Bigagli 2014-06-09 08:49:34 MDT

Were you running the backup controller at the time when this happened?

David

Comment 16 Phil Schwan 2014-06-10 06:34:34 MDT

It was sufficiently long ago that the logs are gone, so I'm not sure we can prove it one way or the other.

But it's possible -- we only gave up on the backup controller as a concept last week.

Comment 17 David Bigagli 2014-06-10 06:49:46 MDT

Let's close this for now and reopen if necessary.

David

Comment 18 Phil Schwan 2017-04-01 07:57:01 MDT

Great news, that I'm sure you're all totally excited to hear!  It's recurred!

I'll start by answering some of the questions from older comments:

(In reply to David Bigagli from comment #12)
> Have you installed the patch sent by Moe on 5/27? Has this problem happened
> again after?

This has now happened with 16.05.7

(In reply to David Bigagli from comment #15)
> Were you running the backup controller at the time when this happened?

We no longer ever run a backup controller, so there's no risk of conflict there.

(In reply to Moe Jette from comment #1)
> 
> Did you happen to add or remove compute nodes in your slurm.conf and
> reconfigure? Were there other changes in slurm.conf?

YES -- we had definitely added (and possibly removed, though I don't think so) nodes in the vicinity of this event.  I don't know if this was the first "slurm reconfigure" since the nodes were added, but this was a week when we were adding dozens of nodes a day.


=====


Executive summary:

- At 03:00: logrotate runs, decides that it needed to rotate the slurmctld.log, and afterward runs slurm reconfigure

- Post-reconfigure, slurmctld gets confused about which jobs are running on which nodes, and in this case kills 3 of them

- Worst of all, these jobs that get killed disappear completely from the queue, which for us is the cardinal sin.  As you'll recall, we've organised everything (via the epilog, and with your help) to ensure that a task should NEVER be removed from the queue except in two situations: successful completion, or explicit user cancelation.



Logs:
=====

slurmctld (redacted to include the highlights + anything referencing job 5078025 or hnod0750):

> [2017-03-18T03:00:01.428] Reconfigure signal (SIGHUP) received
> [2017-03-18T03:00:01.941] Killing job 5078025 on DOWN node hnod0797
> [2017-03-18T03:00:01.969] _sync_nodes_to_comp_job: Job 5078025 in completing state
> [2017-03-18T03:00:01.969] error: Node hnod0750 run_job_cnt underflow in make_node_comp
> [2017-03-18T03:00:01.969] error: Node hnod0750 no_share_job_cnt underflow in make_node_comp
> [2017-03-18T03:00:09.722] Resending TERMINATE_JOB request JobId=5078025 Nodelist=hnod0750
> [2017-03-18T03:00:17.317] job_complete: JobID=5051858_222(5078025) State=0x8007 NodeCnt=1 WTERMSIG 15


Note that on the second line, slurmctld says that 5078025 is running on hnod0797 (incorrect).  By the 5th line, it seems to know that it's running on hnod0750 (correct).  But it's too late -- it's already decided to kill it.



Job 5078025 (task 222):

> slurmstepd: error: *** JOB 5078025 ON hnod0750 CANCELLED AT 2017-03-18T03:00:01 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***



hnod0750 syslog:

> 2017-03-18T03:00:01-05:00 hnod0750 slurmd[25427]: debug:  _rpc_terminate_job, uid = 601
> 2017-03-18T03:00:01-05:00 hnod0750 slurmd[25427]: debug:  task_p_slurmd_release_resources: 5078025
> 2017-03-18T03:00:01-05:00 hnod0750 slurmd[25427]: debug:  credential for job 5078025 revoked
> 2017-03-18T03:00:01-05:00 hnod0750 slurmstepd[145483]: debug:  Handling REQUEST_SIGNAL_CONTAINER
> 2017-03-18T03:00:01-05:00 hnod0750 slurmstepd[145483]: debug:  _handle_signal_container for step=5078025.4294967294 uid=0 signal=998
> 2017-03-18T03:00:01-05:00 hnod0750 slurmstepd[145483]: error: *** JOB 5078025 ON hnod0750 CANCELLED AT 2017-03-18T03:00:01 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

(Should we happen to need more client-side data, tasks 225 and 310 show exactly the same on hnod0878 and hnod0804.)



Meanwhile, on hnod0797, we see these weird kill signals with invalid task IDs:

> 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: debug:  Handling REQUEST_STATE
> 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: debug:  Handling REQUEST_STEP_TERMINATE
> 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: debug:  _handle_terminate for step=5078025.4294967294 uid=0
> 2017-03-18T03:00:01-05:00 hnod0797 slurmstepd[240359]: Sent SIGKILL signal to 5078025.4294967294

...and it continues to get that command once per second for the next 20 HOURS.  Until the node is rebooted, which presumably stopped slurmctld's attempted murder spree.



slurmctld, it will surprise nobody, was still confused, ten hours later:

> [2017-03-19T13:46:29.649] error: Orphan job 5078025.batch reported on node hnod0797
> [2017-03-19T13:46:29.649] debug:  Aborting job 5078025 on node hnod0797



> $ sacct -j 5078025 -a -D --format "jobid,jobname,end,exitcode,nodelist"
>        JobID    JobName                 End ExitCode        NodeList
> ------------ ---------- ------------------- -------- ---------------
> 5051858_222  200_SWB16+ 2017-03-18T03:00:01      1:0        hnod0750
> 5051858_222+      batch 2017-03-18T03:00:17     0:15        hnod0750



In closing:

- The most important problem is that tasks are vanishing from the queue (very similar to bug 1889 and bug 1985 in that regard).  Our entire workflow relies on the queueing system keeping accurate track of what has and has not run successfully.

- The problem is that tasks get killed.  This is bad because obviously it wastes cluster time, but at least it's cleanly (indeed, automatically) recoverable if it were treated like normal node failure, and the tasks placed back in the queue to be rescheduled.

- Finally, there's the endless once-per-second kill signals being sent for tasks that don't exist.  Apart from filling up logs, this may not be especially harmful, but it's certainly a troubling indicator of confusion.

Comment 19 Tim Wickberg 2017-04-03 16:36:50 MDT

> > Did you happen to add or remove compute nodes in your slurm.conf and
> > reconfigure? Were there other changes in slurm.conf?
> 
> YES -- we had definitely added (and possibly removed, though I don't think
> so) nodes in the vicinity of this event.  I don't know if this was the first
> "slurm reconfigure" since the nodes were added, but this was a week when we
> were adding dozens of nodes a day.

Can you walk me through the potential scenario here? I suspect this is the root cause of the problem - if the Node definitions in slurm.conf are out of sync in the cluster bad things will happen.

Is it possible that the slurm.conf on the master node was changed, but slurmctld had not been restarted until that reconfigure? I'm guessing that slurmctld itself was SIGHUP'd for logrotation?

Comment 20 Phil Schwan 2017-04-04 04:49:08 MDT

(In reply to Tim Wickberg from comment #19)
> 
> Can you walk me through the potential scenario here? I suspect this is the
> root cause of the problem - if the Node definitions in slurm.conf are out of
> sync in the cluster bad things will happen.

Sure thing.  We have a single copy of the slurm.conf, on an NFS server mounted by the controller and all client nodes.

The list of nodes would've been updated from something like this:

 NodeName=hnod[0500-0796] CPUs=256 RealMemory=193327 Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN MemSpecLimit=6144

To something like this:

 NodeName=hnod[0500-1499] CPUs=256 RealMemory=193327 Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN MemSpecLimit=6144

(Unfortunately I don't know the exact set of nodes before and after.)

> Is it possible that the slurm.conf on the master node was changed, but
> slurmctld had not been restarted until that reconfigure?

Yes, I think that's totally possible.

> I'm guessing that slurmctld itself was SIGHUP'd for logrotation?

Actually, this is our logrotate config:

   postrotate
    /bin/kill -HUP `cat /var/run/slurm/slurmctl.pid  2>/dev/null`  2> /dev/null || true
    /etc/init.d/slurm reconfig

Is that best practice?  Or are we basically reloading twice in rapid succession?

Comment 21 Tim Wickberg 2017-04-04 16:08:48 MDT

>    postrotate
>     /bin/kill -HUP `cat /var/run/slurm/slurmctl.pid  2>/dev/null`  2>
> /dev/null || true
>     /etc/init.d/slurm reconfig
> 
> Is that best practice?  Or are we basically reloading twice in rapid
> succession?

Use one of the other - if you look into the init script you'll find that's roughly the same thing it's doing.

That might be what's causing this; I'll see if we can reproduce a scenario where that causes a direct problem for slurmctld.

Comment 23 Phil Schwan 2017-04-05 05:10:13 MDT

(In reply to Tim Wickberg from comment #21)
> 
> Use one of the other - if you look into the init script you'll find that's
> roughly the same thing it's doing.

Will do.  I thought that seemed weird.

Comment 25 Phil Schwan 2017-04-20 01:17:33 MDT

More grist for the mill:

In the task's log:

> slurmstepd: error: *** JOB 2429631 ON knod0189 CANCELLED AT 2017-04-10T13:00:02 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

In the (perfectly healthy) knod0189's log:

> 2017-04-10T12:59:54+08:00 knod0189 slurmstepd[708]: debug:  jag_common_poll_data: Task average frequency = 20872 pid 712 mem size 30440920 48388512 time 108702.970000(82370+26332)
> 2017-04-10T13:00:02+08:00 knod0189 slurmd[15735]: debug:  _rpc_terminate_job, uid = 601
> 2017-04-10T13:00:02+08:00 knod0189 slurmd[15735]: debug:  task_p_slurmd_release_resources: 2429631
> 2017-04-10T13:00:02+08:00 knod0189 slurmd[15735]: debug:  credential for job 2429631 revoked
> 2017-04-10T13:00:02+08:00 knod0189 slurmstepd[708]: debug:  Handling REQUEST_SIGNAL_CONTAINER
> 2017-04-10T13:00:02+08:00 knod0189 slurmstepd[708]: debug:  _handle_signal_container for step=2429631.4294967294 uid=0 signal=998
> 2017-04-10T13:00:02+08:00 knod0189 slurmstepd[708]: error: *** JOB 2429631 ON knod0189 CANCELLED AT 2017-04-10T13:00:02 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

In slurmctld.log:

> [2017-04-10T13:00:02.112] Reconfigure signal (SIGHUP) received
> [2017-04-10T13:00:02.244] error: read_slurm_conf: default partition not set.
> [2017-04-10T13:00:02.267] restoring original state of nodes
> [2017-04-10T13:00:02.267] restoring original partition state
> [2017-04-10T13:00:02.268] cons_res: select_p_node_init
> [2017-04-10T13:00:02.269] cons_res: preparing for 67 partitions
> [2017-04-10T13:00:02.333] Killing job 2429631 on DOWN node knod0236
> [2017-04-10T13:00:02.334] _sync_nodes_to_jobs updated state of 1 nodes
> [2017-04-10T13:00:02.337] _sync_nodes_to_comp_job: Job 2429631 in completing state
> [2017-04-10T13:00:02.337] error: Node knod0189 run_job_cnt underflow in make_node_comp
> [2017-04-10T13:00:02.337] error: Node knod0189 no_share_job_cnt underflow in make_node_comp
> [2017-04-10T13:00:02.337] _sync_nodes_to_comp_job: completing 1 jobs
> ...
> [2017-04-10T13:00:05.802] job_complete: JobID=2428544_21(2429631) State=0x8007 NodeCnt=1 WTERMSIG 15

I find it interesting just how *quickly* the killing stroke occurs.  Almost like it's making a decision before it's even unpacked all its previous state.

To the best of our knowledge, we'd removed 20 nodes from slurm.conf since the previous reload.

Comment 34 Tim Shaw 2017-05-17 13:46:04 MDT

Phil,

For this bug I made a patch that logs all the down nodes on a slurmctld startup so there's no question as to what the controller considers as "down".  This includes both nodes in a down state that were read in from the state file, as well as nodes marked as down in the slurm.conf file.  Here's what the log line looks like:

Down nodes: knod[0236,1234]

Here's the commit:

https://github.com/SchedMD/slurm/commit/d8e953946d9bfcc02412b1a72c1efc6be25aede8

This patch has been checked into the master branch but should patch cleanly into 16.05 and beyond if you'd like to back port it.

Hopefully, you will find this patch helpful.

Regards.

Tim

Comment 35 Tim Shaw 2017-05-22 08:44:40 MDT

Phil,

Am I okay to resolve this bug?  Is there any further action needed?

Thanks

Tim

Comment 36 Phil Schwan 2017-05-22 08:50:49 MDT

(In reply to Tim Shaw from comment #35)
> 
> Am I okay to resolve this bug?  Is there any further action needed?

Maybe I misunderstood your comment #34, but it doesn't sound like it actually fixes anything, only adds some logging?

If that's true, then the further action is that we need it to stop killing tasks on nodes that aren't actually down. :)

And *especially* not removing them from the queue entirely, which is an unforgivable sin.  Our entire workflow relies on being able to trust that tasks never exit the queue except by either (a) completing successfully, or (b) being explicitly canceled by user action.  When tasks just vanish, it means part of the job never runs, and we lose data.

Lastly, there was the case of the endless once-per-second kill signals being sent for tasks that don't exist.  It's not clear whether that caused actual harm or not, but it definitely indicates a system that's deeply confused about its state.

Cheers,

-Phil

Comment 45 Tim Wickberg 2017-06-21 15:37:20 MDT

Marking resolved/timedout as discussed.