Ticket 5253

Summary: Job preemption
Product: Slurm Reporter: Damien <damien.leong>
Component: SchedulingAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: Monash University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Damien 2018-06-04 01:38:28 MDT
Dear SchedMD,

We would like to enable job preemption on our cluster. In particular we would like jobs with one QoS to preempt jobs with other QoS. We'd like the job being stopped to receive a signal (such as SIGCONT SIGTERM used in the REQUEUE mode) then be killed after a grace period and resubmitted to the queue (unless --no-requeue was submitted with the job request). We anticipate most of the jobs being preempted will be machine learning tasks in cafe of tensorflow, so implementing a signal handler to checkpoint should be easy. We'd rather not use SUSPEND because of RAM limitations. 

I think the appropriate settings for what we want are
PreemptMode=REQUEUE
PreemptType=preempt/qos

but documentation suggests that signals and grace period will not be used with PreemptMode=REQUEUE

Should I be combining this to 

PreemptMode=CANCEL,REQUEUE
GracePeriod=60 # 1 minute available for checkpointing

Have we misunderstood the way REQUEUE works? or how can you further advise on this implementation ?



Regards,
--
Monash Uni HPC team
Comment 1 Alejandro Sanchez 2018-06-05 04:46:13 MDT
Hi Damien,

PreemptMode cannot be combined as PreemptMode=CANCEL,REQUEUE, if you set it in slurm.conf you'd get:

slurmd: error: PreemptMode=CANCEL,REQUEUE invalid
slurmd: fatal: Unable to process configuration file

or if defined directly as a QOS specific option:

alex@ibiza:~/t$ sacctmgr modify qos highprio set preemptmode=cancel,requeue
 Bad Preempt Mode given: preemptmode=cancel,requeue
alex@ibiza:~/t$

In order to setup preempt/qos you must define in slurm.conf

alex@ibiza:~/t$ scontrol show conf | grep -i preempt
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
alex@ibiza:~/t$

Then through sacctmgr you can define the Preempt option for the preemptor QOS and PreemptMode if you want to override the slurm.conf PreeemptMode and the GraceTime.

Currently GraceTime is only meaningful for PreemptMode=CANCEL.

Not sure if it'd be possible to support GraceTime with PreemptMode=REQUEUE and the time that it'd require making this change.
Comment 2 Alejandro Sanchez 2018-06-20 04:07:51 MDT
Damien, I'm gonna close this as resolved/infogiven. Please, re-open if there's anything else you need from here. Thank you.
Comment 3 Damien 2018-06-20 08:09:52 MDT
Thanks for this information.
Comment 4 Damien 2018-06-28 00:37:33 MDT
Hi Alejandro

Sorry, I need some advice with regards to this same query.


Existing settings:

PreemptType             = preempt/qos


sacctmgr show qos normal
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES 
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 
    normal         50   00:00:00        irq     requeue


PreemptMode = requeue

GracePeriod = Not set.
GraceTime = Not set


From documentation 'https://slurm.schedmd.com/preempt.html' it mentions

GraceTime: Specifies a time period for a job to execute after it is selected to be preempted. This option can be specified by partition or QOS using the slurm.conf file or database respectively. This option is only honored if PreemptMode=CANCEL. The GraceTime is specified in seconds and the default value is zero, which results in no preemption delay. Once a job has been selected for preemption, its end time is set to the current time plus GraceTime. The job is immediately sent SIGCONT and SIGTERM signals in order to provide notification of its imminent termination. This is followed by the SIGCONT, SIGTERM and SIGKILL signal sequence upon reaching its new end time



Question, since we are using "PreemptMode = requeue" now, so once a job is selected for preemption, Is it that it will immediately sent SIGCONT and SIGTERM signals to this job ?  (as the above, is for "PreemptMode = cancel").


If this is not the case, How can we have "PreemptMode = requeue", and also to SLURM send proper SIGCONT and SIGTERM signals for the job to end gracefully ?



Kindly advise.


Many Thanks.

Damien
Comment 5 Alejandro Sanchez 2018-07-03 09:23:43 MDT
Hi Damien,

with PreemptType=preempt/qos and PreemptMode=REQUEUE, when a job is requeued you it gets signaled like this:

slurmd: debug2: Processing RPC: REQUEST_KILL_PREEMPTED
slurmd: debug2: container signal 994 to job 20028.4294967294
slurmd: debug2: No steps in jobid 20028 to send signal 15
slurmd: Job 20028: timeout: sent SIGTERM to 0 active steps
slurmd: debug:  _rpc_terminate_job, uid = 1000
slurmd: debug:  task_p_slurmd_release_resources: affinity jobid 20028
slurmd: debug:  credential for job 20028 revoked
slurmd: debug2: container signal 993 to job 20028.4294967294
slurmd: debug2: container signal 18 to job 20028.4294967294
slurmd: debug2: container signal 15 to job 20028.4294967294
slurmd: debug2: set revoke expiration for jobid 20028 to 1530630964 UTS

src/common/slurm_protocol_defs.h:#define SIG_REQUEUED	993	/* Dummy signal value to job requeue */
src/common/slurm_protocol_defs.h:#define SIG_PREEMPTED	994	/* Dummy signal value for job preemption */

In this case, the preempted job only had a batch script without steps inside, so first it got signaled 994 (SIG_PREEMPTED), then 993 (SIG_REQUEUED), then 18 (SIGCONT) and finally 15 (SIGTERM), as stated in the documentation.

The QOS GraceTime doesn't apply for PreemptMode=REQUEUE as I mentioned earlier. Please, let me know if you have further questions.
Comment 6 Alejandro Sanchez 2018-07-17 00:54:25 MDT
Damien, is there anything else you need from here? thanks.
Comment 7 Alejandro Sanchez 2018-08-08 07:10:01 MDT
Marking as resolved/infogiven. Please, reopen if there's anything else you need from here. Thanks.