Ticket 7500

Summary: Jobs that have been Preempted, Requeued, launched on new node cancelled by "Preemption GraceTime reached"
Product: Slurm Reporter: Matt Jay <mattjay>
Component: SchedulingAssignee: Broderick Gardner <broderick>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, jess, kilian
Version: 19.05.1   
Hardware: Linux   
OS: Linux   
Site: University of Washington Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Excerpts from slurmctld and slurmd logs related to jobs demonstrating the issue.
dump of our slurm.conf
grep slurmctld.log for 'Preemption GraceTime' which appears to always coorelate to this happening
slurmctld.log tarball
Fixes incorrect reason for cancellation of requeued jobs
Reproduction example with reservation in place on nodes.

Description Matt Jay 2019-07-31 16:40:00 MDT
Created attachment 11073 [details]
Excerpts from slurmctld and slurmd logs related to jobs demonstrating the issue.

I'll preamble with brief summary of relevant aspects of our config:

slurm 19.05.1-2

We have global preemption settings of:
PreemptMode=REQUEUE
PreemptType=preempt/partition_prio

We have NOT set GraceTime anywhere (though per the documentation it should not matter since we are not using PreemptMode=CANCEL). 

We have a low priority partition across our compute nodes open to all users called 'ckpt'.

We have various higher priority partitions across subsets of nodes open various groups of users (with PreemptMode=OFF, though should not be relevant here).


The issue:

We are seeing the following happen on a significant number of jobs in the 'ckpt' partition:

- Job is running on a node A in 'ckpt' partition.
- Job is preempted and requeued (and a different job goes on to run on node A)
- Original requeued job started on a different node B.
- Often (but not always) see a bunch of "error: _will_run_test: Active JobId=<job id here> has zero end_time" in slurmctld log.
- In slurmctld log there is a "job_time_limit: Preemption GraceTime reached JobId=<jobid>" 
- Immediately (usually less than .005 seconds per log timestamps) get a "job cancelled due to preemption" error in slurmd on node B.
- The job remains cancelled and does not requeue (and users complain to us).


Specifically the job getting cancelled is what is impacting our users, as they expect them to get requeued until they can run to completion (or fail).

I'm also baffled by the GraceTime message as we haven't set that, and it supposedly isn't relevant without PreemptMode=CANCEL.  



We've never seen this before, and for what it is worth only recently upgraded to 19.05 (from 18.08.3).

I am attaching a text file that includes log excerpts from the slurmctld log and slurmd logs from relevant nodes related to specific jobs where this occurred.  I've interleaved the slurmctld and slurmd entries related to each job sorted by timestamp which helped me to understand the sequencing.

I'll also attach a dump of our slurm.conf -- let me know if I can provide anything further.

Thanks,
-Matt
Comment 1 Matt Jay 2019-07-31 16:40:47 MDT
Created attachment 11075 [details]
dump of our slurm.conf

Attaching dump of our slurm.conf
Comment 4 Broderick Gardner 2019-08-01 09:28:43 MDT
I'm looking into this. There were a couple changes made to GraceTime, specifically with PreemptMode=REQUEUE. I'm determining if there is a regression.

If it interests you, the discussion about the changes made to GraceTime is here: Bug 5867
Comment 5 Matt Jay 2019-08-01 10:16:02 MDT
Interesting read.  For what it is worth, I don't believe any of our users are making any attempt to catch signals (anybody who is checkpointing is just doing so periodically).

It is in the .conf I provided, but the thread mentioned KillWait, so I'll point out we are using:

KillWait=10

Also somewhat related: UnkillableStepTimeout=150 (sometimes comes into play when GPFS is slammed and applications are in uninterruptible sleep state waiting on file IO).

Thanks,
-Matt
Comment 6 Broderick Gardner 2019-08-02 10:54:55 MDT
Does this happen every time a job in ckpt is preempted and requeued? 

Are you able to reproduce this by submitting a low priority job to ckpt and waiting for it to be preempted? If so, please set logging level to debug (`scontrol setdebug debug`), reproduce this, and send me the slurmctld log? Also  include relevant details like the job ids involved. 

As far as you can tell, is there a correlation between whether the preempted job is an array job and whether the message "error: _will_run_test ... has zero end_time" is logged? (array jobs have ids like ####_##)

I have not been able to reproduce this yet; in my tests PreemptMode=REQUEUE is working correctly. So I just need a bit more information about what is happening. Looking at the code, there are a few specific debug messages that will tell me more of what it's doing. 
eg. "debug: send_job_warn_signal: warning signal %u to %pJ", signal, job_ptr
Comment 8 Matt Jay 2019-08-02 12:12:08 MDT
Created attachment 11093 [details]
grep slurmctld.log for 'Preemption GraceTime' which appears to always coorelate to this happening
Comment 9 Matt Jay 2019-08-02 15:15:20 MDT
Hi Broderick,

Thanks for looking into this.

***
> Does this happen every time a job in ckpt is preempted and requeued? 

It seems happen fairly often but not all of the time...

Quick estimate from grepping current slurmctld.log:

Occurences (of GraceTime reached -- sampling seems to indicate they never go on to requeue/complete):  93

Number of non-array job occurrences: 10 (may be misleading as volume-wise high percentage of jobs in ckpt are array jobs).

Total number of preempted jobs: 204

Looking at the grep of the "Preemption GraceTime" messages there are clusters where like 5-20 of them occur at the same time -- some of the clumps were preempted by the same multi-node job, but others were not.

It may be relevant to mention that the partitions involved are all 'exclusive' (ie, jobs get whole nodes).


***
> Are you able to reproduce this by submitting a low priority job to ckpt and waiting for it to be preempted? 

Yes; I submitted 10 array jobs and 10 non-array jobs, so far 2 array jobs and one non-array job have reproduced.

1 of jobs has been requeued twice but and is still happily running.

The rest just haven't been requeued.

***
> If so, please set logging level to debug (`scontrol setdebug debug`), reproduce this, and send me the slurmctld log?

I will attach the slurmctl.log after I submit this message -- here are my jobs that reproduced this, but there are real jobs that have triggered to after I started the debugt (grepping "Preemption GraceTime reached" will find them).

Here are the jobs that I launched that have reproduced thusfar:
1106541      preempt-b+       ckpt   hpc-ckpt         28  PREEMPTED      0:0
1106541.bat+      batch              hpc-ckpt         28  CANCELLED     0:15
1106541.ext+     extern              hpc-ckpt         28  COMPLETED      0:0
1106522_8    preempt-b+       ckpt   hpc-ckpt         28  PREEMPTED      0:0
1106522_8.b+      batch              hpc-ckpt         28  CANCELLED     0:15
1106522_8.e+     extern              hpc-ckpt         28  COMPLETED      0:0
1106522_6    preempt-b+       ckpt   hpc-ckpt         28  PREEMPTED      0:0
1106522_6.b+      batch              hpc-ckpt         28  CANCELLED     0:15
1106522_6.e+     extern              hpc-ckpt         28  COMPLETED      0:0


***
> As far as you can tell, is there a correlation between whether the preempted job is an array job and whether the message "error: _will_run_test ... has zero end_time" is logged? (array jobs have ids like ####_##)

No, zero end_time occurs for both array and non-array, and seems to occur most of the time when this happens.


***
> I have not been able to reproduce this yet; in my tests PreemptMode=REQUEUE is working correctly. So I just need a bit more information about what is happening. Looking at the code, there are a few specific debug messages that will tell me more of what it's doing. 
eg. "debug: send_job_warn_signal: warning signal %u to %pJ", signal, job_ptr

I enabled the debug mode several hours ago, but have grepped and did not return the string 'send_job_warn_signal' on the slurmctld.log or on the slurmd.log of any node.


I will attach our slurmctld log momentarily

Thanks,
-Matt
Comment 10 Matt Jay 2019-08-02 15:20:16 MDT
Created attachment 11097 [details]
slurmctld.log tarball

slurmctld.log (debug enabled at 10:56:14)
Comment 13 Broderick Gardner 2019-08-06 16:07:36 MDT
After some testing we have determined that this is likely happening when a job is cancelled for reasons other than preemption, like time limit. There is a regression here in that a leftover value from the preemption results in the log message and reason for cancellation is preemption. 

I don't have a full explanation for the cause of the _will_run_test errors; I haven't reproduced that myself. It is probably caused by the same issue with unique behavior on a busy production cluster (as opposed to my test cluster). 

While I have reproduced the GraceTime issue with TimeLimit, there appears to be something else going on in your case. As an example, job 1102843_6(1102849). It ran for 3 hours, it was requeued at 9:56:04, started again at 9:58:05, and cancelled at 9:58:34. 

Since the regression is hiding the true cancellation cause, are you able to apply a patch from me that fixes it? If so, we can see if it fixes everything or uncovers another problem. 


My steps to reproduce:

❯ sbatch --wrap="sleep 9999" -n4 -pgeneral -t1
Submitted batch job 44

❯ sbatch --wrap="sleep 9999" -n4 -ppgrp0 -wnode0
Submitted batch job 45

-- partition 'pgrp0' preempts 'general'
-- job 44 is requeued and runs until it hits its time limit

job 44 error message:
slurmstepd-node1: error: *** JOB 44 ON node1 CANCELLED AT 2019-08-06T15:24:49 DUE TO PREEMPTION ***

slurmctld.log:
[2019-08-06T15:24:49.869] job_time_limit: Preemption GraceTime reached JobId=44

❯ sacct -Xj44
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
44                 wrap    general        dev          8  PREEMPTED      0:0
Comment 14 Broderick Gardner 2019-08-06 16:08:56 MDT
Created attachment 11128 [details]
Fixes incorrect reason for cancellation of requeued jobs
Comment 15 Matt Jay 2019-08-07 14:44:40 MDT
Hi Broderick, 

We have maintenance downtime scheduled for next Tuesday, I can give the patch a try then -- this is just for slurmctld, correct?

...that said, from further testing, I don't think that the job can possibly be ending from anything but something related to the original preemption -- to test I've created a two-node reservation for myself so that I don't have to worry about random 'real' jobs clouding things.  The job task itself is just a sleep, and the time limit/reservation is hours while the testing takes minutes to reproduce or not.

Methodology:
- Submit sleep job to ckpt (preemptable, lower priority partition) via the reservation.
- Submit sleep job to hpc (non-preemptable, higher piority partition) via reservation, and restrict via --nodelist to the node ckpt job is running on.
- 2nd job preempts first, 1st job re-queues onto other reservation node. Sits in Pending reason: BeginTime for a little bit, then launches at start time.
- Sometimes second job runs fine, sometimes it dies of "cancelled due to preemption" a short time later -- seems about 50/50.  Note reservation is still in place, so there are no other jobs that can run jobs on those nodes due to th reservation to cause issues, etc.

I'll attach a file with details on an example (reproduction always looks about the same).

-Matt
Comment 16 Matt Jay 2019-08-07 14:47:06 MDT
Created attachment 11150 [details]
Reproduction example with reservation in place on nodes.

Reproduction example with a reservation on the nodes.
Comment 17 Matt Jay 2019-08-07 15:35:12 MDT
Broderick,

I also wanted to mention that we have a second older cluster also running 19.05.1-2 (though under CentOS 6 instead of CentOS 7, and with Myrinet high speed interconnects instead of OmniPath -- apparently we excel at picking doomed network interconnects) -- it is a completely separate cluster, though has a very similar Slurm configuration.

We are getting reports of similar behavior on that cluster, and I confirm it seems to be the same pattern, with the ckpt jobs culminating in the same manner after getting allocated on a new node after preemption:

[2019-08-02T16:53:53.197] job_time_limit: Preemption GraceTime reached JobId=1906643
[2019-08-02T16:53:53.197] email msg to redacted: Slurm Job_id=1906643 Name=abn_aggregated Failed, Run time 00:00:04, PREEMPTED, ExitCode 0

-Matt
Comment 20 Broderick Gardner 2019-08-08 14:25:30 MDT
(In reply to Matt Jay from comment #15)
> We have maintenance downtime scheduled for next Tuesday, I can give the
> patch a try then -- this is just for slurmctld, correct?
Yes

Out of curiosity, is there any particular reason your client commands are still 18.08? This isn't/shouldn't be a problem; just wondering (there is a slight difference in output of `scontrol show jobs` in 19.05).


... time passes ...

The above is still relevant, but we have made a lot of progress diagnosing this. The patch from before will almost certainly fix the problem completely. We still haven't reproduced the _will_run_test errors, but we understand how it could get into such a state. We do now know why the requeued jobs are cancelled due to "GraceTime" so quickly, and why it does not occur every time. 

In short, you can run with the little patch I provided until the release of 19.05.2.

Keep reading if you care about more internal details. I'm writing this in case you're interested and for posterity. 

Basically, there are two schedulers in Slurm: the main and backfill. After a change in 19.05 where we set a job's PreemptTime whenever it is preempted, as opposed to only preemption with GraceTime, the PreemptTime was left set after the requeue. This inadvertently affected some logic in the main scheduler (not the backfill), resulting in the end_time of the job not being set when it should have been. It does start the requeued job correctly, but a combination of preempt_time being set and end_time being 0 results in the job being cancelled at the next TimeLimit check (30 second resolution). If the backfill starts the job, it does set the end_time, so the bug only presents itself as an incorrect reason for cancellation if it is cancelled due to TimeLimit normally.

The reason you only see it sometimes is because the main scheduler runs every SchedulerParameters=sched_interval (default 60) seconds and every time a job is submitted, whereas the backfill scheduler runs every SchedulerParameters=bf_interval (default 30) seconds. If the backfill schedules the requeued job, you probably won't notice anything. If other jobs are submitted frequently, it is likely that the main scheduler will schedule the requeued job after its new BeginTime has passed, leading to it being cancelled.

Let me know if you have any questions and of any problems or concerns when applying the patch during your maintenance window.

Thanks
Comment 21 Broderick Gardner 2019-08-08 14:29:18 MDT
The fix is now committed to the next release of 19.05.2

commit 6cbcfdb81cdbe074fe7d0f1fa66377c478f2773f (HEAD -> slurm-19.05, origin/slurm-19.05)
Author:     Broderick Gardner <broderick@schedmd.com>
AuthorDate: Thu Aug 8 13:37:07 2019 -0600
Commit:     Brian Christiansen <brian@schedmd.com>
CommitDate: Thu Aug 8 13:45:00 2019 -0600

    Fix jobs being killed after being requeued by preemption
    
    Bug 7500


I will close this ticket once we have verified that your cluster is running correctly with the patch applied.
Comment 22 Matt Jay 2019-08-08 14:50:49 MDT
Broderick,

So, it looks like we missed the update on submit host (and also a few nodes still waiting on uber-long jobs).

Glad the patch will do the trick.  I'll get it installed Tuesday during our scheduled maintenance if I can't sneak it in sooner and let you know.

Thanks,
-Matt
Comment 23 Matt Jay 2019-08-13 16:54:11 MDT
Broderick,

I confirm that the patch appears to have fixed the requeue issue (ironically, I saw the mailing list post about 19.05.2 just minutes after we announced our maintenance period was complete).

I'll mark this resolved.

-Matt
Comment 24 Broderick Gardner 2019-08-19 13:09:16 MDT
Assuming your cluster is running correctly now, I will close this ticket. Reopen it or submit a new ticket if something goes wrong. 

Until you are back on a main release without local patches, consider referring to this ticket as the source of the patch you are running with in other tickets you submit as it could save time.

Thanks