Ticket 8283

Summary: jobs erroneously preempted
Product: Slurm Reporter: frank.schluenzen
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
Site: DESY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description frank.schluenzen 2020-01-03 06:16:38 MST
Hi,

we see low priority jobs being preempted by high priority jobs although the high priority jobs have constraints/features which cannot be satisfied by the nodes being freed on preemption.  

For example

JobId=3860565 starts to run on a low priority partition (all). It uses a constraint Features=Gold-6140.

The job gets immediately preempted by a high priority job (e.g. JobId=3858001 on partition upex). 

JobId=3858001 specifies "Features=P100" as constraint. 
The nodes allocated to JobId=3860565 do not satisfy the constraint "Features=P100". 
There are actually no nodes available satisfying both features, i.e. Gold-6140 and P100 are mutually exclusive. 

JobId=3858001 should hence never preempt JobId=3860565, but it does. Naturally JobId=3858001 will remain pending and continue preempting low priority jobs.

JobId=3858001 is pending because all nodes satisfying the constraint/feature are allocated:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3858001      upex Class_to  user    PD       0:00      1 (Resources)


Some extract from the logs illustrating the behavior (all preempting jobs are identical):

[2020-01-03T11:17:34.488] Requeuing JobId=3860565
[2020-01-03T11:20:04.682] backfill: Started JobId=3860565 in all on max-exfl[147-154]
[2020-01-03T11:20:15.924] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3857997
[2020-01-03T11:20:15.928] job_step_signal: JobId=3860565 StepId=28 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:20:16.355] job_step_signal: JobId=3860565 StepId=28 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:20:18.593] Requeuing JobId=3860565
[2020-01-03T11:22:35.491] backfill: Started JobId=3860565 in all on max-exfl[132-139]
[2020-01-03T11:22:36.440] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3857998
[2020-01-03T11:22:36.444] job_step_signal: JobId=3860565 StepId=29 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:22:40.101] Requeuing JobId=3860565
[2020-01-03T11:25:06.239] backfill: Started JobId=3860565 in all on max-exfl[158-165]
[2020-01-03T11:25:31.135] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000
[2020-01-03T11:25:31.139] job_step_signal: JobId=3860565 StepId=30 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:25:31.207] job_step_signal: JobId=3860565 StepId=30 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:25:33.796] Requeuing JobId=3860565
[2020-01-03T11:27:37.010] backfill: Started JobId=3860565 in all on max-exfl[132-139]
[2020-01-03T11:27:37.956] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000
[2020-01-03T11:27:37.961] job_step_signal: JobId=3860565 StepId=31 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:27:38.205] job_step_signal: JobId=3860565 StepId=31 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:27:39.622] Requeuing JobId=3860565
[2020-01-03T11:30:07.782] backfill: Started JobId=3860565 in all on max-exfl[132-139]
[2020-01-03T11:30:16.523] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000
[2020-01-03T11:30:16.527] job_step_signal: JobId=3860565 StepId=32 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:30:16.957] job_step_signal: JobId=3860565 StepId=32 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:30:19.230] Requeuing JobId=3860565
[2020-01-03T11:32:38.588] backfill: Started JobId=3860565 in all on max-exfl[132-139]
[2020-01-03T11:33:31.857] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000
[2020-01-03T11:33:31.991] job_step_signal: JobId=3860565 StepId=33 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:33:32.302] job_step_signal: JobId=3860565 StepId=33 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:33:34.562] Requeuing JobId=3860565
[2020-01-03T11:35:39.512] backfill: Started JobId=3860565 in all on max-exfl[132-139]
[2020-01-03T11:36:31.716] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000
[2020-01-03T11:36:31.737] job_step_signal: JobId=3860565 StepId=34 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:36:32.152] job_step_signal: JobId=3860565 StepId=34 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:36:34.427] Requeuing JobId=3860565
[2020-01-03T11:38:40.444] backfill: Started JobId=3860565 in all on max-exfl[132-139]
[2020-01-03T11:39:01.072] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000
[2020-01-03T11:39:01.076] job_step_signal: JobId=3860565 StepId=35 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:39:01.501] job_step_signal: JobId=3860565 StepId=35 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:39:03.797] Requeuing JobId=3860565
[2020-01-03T11:41:11.198] backfill: Started JobId=3860565 in all on max-exfl[158-165]
[2020-01-03T11:41:11.915] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858001
[2020-01-03T11:41:11.959] job_step_signal: JobId=3860565 StepId=36 can not be sent signal 9 from state=COMPLETING
[2020-01-03T11:41:12.048] job_step_signal: JobId=3860565 StepId=36 can not be sent signal 991 from state=COMPLETING
[2020-01-03T11:41:12.971] Requeuing JobId=3860565

Occasionally preempted jobs seem to be re-queued more than once leading to duplicate jobids:

/var/log/slurm/slurmctld.log:[2020-01-03T11:38:31.320] preempted JobId=3860461 has been requeued to reclaim resources for JobId=3858000
/var/log/slurm/slurmctld.log:[2020-01-03T11:38:31.377] job_step_signal: JobId=3860461 StepId=40 can not be sent signal 9 from state=COMPLETING
/var/log/slurm/slurmctld.log:[2020-01-03T11:38:31.735] job_step_signal: JobId=3860461 StepId=40 can not be sent signal 991 from state=COMPLETING
/var/log/slurm/slurmctld.log:[2020-01-03T11:38:33.015] Requeuing JobId=3860461
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.040] backfill: Started JobId=3860461 in all on max-exfl[132-139]
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.044] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.056] drain_nodes: node max-exfl132 state set to DRAIN
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.056] error: Duplicate jobid on nodes max-exfl132, set to state DRAIN
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:42.192] Requeuing JobId=3860461
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:42.200] job_signal: 9 of running JobId=3860665 successful 0x8008
/var/log/slurm/slurmctld.log:[2020-01-03T11:40:42.200] preempted JobId=3860665 has been killed to reclaim resources for JobId=3858001


It seems that constraints/features are not properly taken into account when jobs are evaluated for preemption. I have the impression that it primarily happens when both the preempting job and the preempted job declare constraints/features.


Cheers, Frank.
Comment 1 Dominik Bartkiewicz 2020-01-06 06:17:39 MST
Hi

Could you send me your current slurm.conf?

Dominik
Comment 3 frank.schluenzen 2020-01-06 06:20:07 MST
Created attachment 12674 [details]
slurm.conf

attached
Comment 4 Dominik Bartkiewicz 2020-01-06 07:52:53 MST
Hi

Could you send me the output from 
"scontrol show job 3858001" or script used to submitting?

Dominik
Comment 5 frank.schluenzen 2020-01-06 08:07:47 MST
Hi,

preempting jobs looked like this:

!/bin/bash
#SBATCH --partition=upex   
#SBATCH --time=6-00:00:00
#SBATCH --nodes=1  
#SBATCH --constraint=P100 
#SBATCH --job-name  Class_tom
#SBATCH --mail-type ALL
#SBATCH --mail-user [omitted]
#SBATCH --export=NONE
#SBATCH --export=LD_PRELOAD=""
#SBATCH --output    ./out_err/TrainU-newpos_%N-%j-%a.out
#SBATCH --error     ./out_err/TrainU-newpos_%N-%j-%a.err 

echo hello world
[omitted]

Cheers, Frank.
Comment 6 Dominik Bartkiewicz 2020-01-07 06:37:20 MST
Hi

I have analyzed your configuration and I am sure that this is duplicate of bug 7708.

This issue was fixed in 19.05.3:
https://github.com/SchedMD/slurm/commit0d432caedad

Let me know if we can close this bug as a duplicate of bug 7708.

Dominik
Comment 7 frank.schluenzen 2020-01-07 09:23:37 MST
hi,

I'm pretty sure we are running slurmctld with the patch proposed in bug 7708 (sorry, have missed and forgotten that one), and we didn't see the issue until recently, when both preempting and preempted jobs declare mutually exclusive features.  

Anyhow, we will simply update to 19.05.5 and see if we can reproduce the issue after the update. 

Please close the issue and apologies for wasting your time.

Cheers, Frank.
Comment 8 Dominik Bartkiewicz 2020-01-07 10:53:37 MST
Hi

Thanks for info.
I'm closing this now as "duplicate".
If you hit this bug again on 19.05.5 please reopen.

Dominik

*** This ticket has been marked as a duplicate of ticket 7708 ***