| Summary: | jobs erroneously preempted | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | frank.schluenzen |
| Component: | Scheduling | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 19.05.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DESY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.conf | ||
Hi Could you send me your current slurm.conf? Dominik Created attachment 12674 [details]
slurm.conf
attached
Hi Could you send me the output from "scontrol show job 3858001" or script used to submitting? Dominik Hi, preempting jobs looked like this: !/bin/bash #SBATCH --partition=upex #SBATCH --time=6-00:00:00 #SBATCH --nodes=1 #SBATCH --constraint=P100 #SBATCH --job-name Class_tom #SBATCH --mail-type ALL #SBATCH --mail-user [omitted] #SBATCH --export=NONE #SBATCH --export=LD_PRELOAD="" #SBATCH --output ./out_err/TrainU-newpos_%N-%j-%a.out #SBATCH --error ./out_err/TrainU-newpos_%N-%j-%a.err echo hello world [omitted] Cheers, Frank. Hi I have analyzed your configuration and I am sure that this is duplicate of bug 7708. This issue was fixed in 19.05.3: https://github.com/SchedMD/slurm/commit0d432caedad Let me know if we can close this bug as a duplicate of bug 7708. Dominik hi, I'm pretty sure we are running slurmctld with the patch proposed in bug 7708 (sorry, have missed and forgotten that one), and we didn't see the issue until recently, when both preempting and preempted jobs declare mutually exclusive features. Anyhow, we will simply update to 19.05.5 and see if we can reproduce the issue after the update. Please close the issue and apologies for wasting your time. Cheers, Frank. Hi Thanks for info. I'm closing this now as "duplicate". If you hit this bug again on 19.05.5 please reopen. Dominik *** This ticket has been marked as a duplicate of ticket 7708 *** |
Hi, we see low priority jobs being preempted by high priority jobs although the high priority jobs have constraints/features which cannot be satisfied by the nodes being freed on preemption. For example JobId=3860565 starts to run on a low priority partition (all). It uses a constraint Features=Gold-6140. The job gets immediately preempted by a high priority job (e.g. JobId=3858001 on partition upex). JobId=3858001 specifies "Features=P100" as constraint. The nodes allocated to JobId=3860565 do not satisfy the constraint "Features=P100". There are actually no nodes available satisfying both features, i.e. Gold-6140 and P100 are mutually exclusive. JobId=3858001 should hence never preempt JobId=3860565, but it does. Naturally JobId=3858001 will remain pending and continue preempting low priority jobs. JobId=3858001 is pending because all nodes satisfying the constraint/feature are allocated: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3858001 upex Class_to user PD 0:00 1 (Resources) Some extract from the logs illustrating the behavior (all preempting jobs are identical): [2020-01-03T11:17:34.488] Requeuing JobId=3860565 [2020-01-03T11:20:04.682] backfill: Started JobId=3860565 in all on max-exfl[147-154] [2020-01-03T11:20:15.924] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3857997 [2020-01-03T11:20:15.928] job_step_signal: JobId=3860565 StepId=28 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:20:16.355] job_step_signal: JobId=3860565 StepId=28 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:20:18.593] Requeuing JobId=3860565 [2020-01-03T11:22:35.491] backfill: Started JobId=3860565 in all on max-exfl[132-139] [2020-01-03T11:22:36.440] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3857998 [2020-01-03T11:22:36.444] job_step_signal: JobId=3860565 StepId=29 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:22:40.101] Requeuing JobId=3860565 [2020-01-03T11:25:06.239] backfill: Started JobId=3860565 in all on max-exfl[158-165] [2020-01-03T11:25:31.135] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000 [2020-01-03T11:25:31.139] job_step_signal: JobId=3860565 StepId=30 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:25:31.207] job_step_signal: JobId=3860565 StepId=30 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:25:33.796] Requeuing JobId=3860565 [2020-01-03T11:27:37.010] backfill: Started JobId=3860565 in all on max-exfl[132-139] [2020-01-03T11:27:37.956] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000 [2020-01-03T11:27:37.961] job_step_signal: JobId=3860565 StepId=31 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:27:38.205] job_step_signal: JobId=3860565 StepId=31 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:27:39.622] Requeuing JobId=3860565 [2020-01-03T11:30:07.782] backfill: Started JobId=3860565 in all on max-exfl[132-139] [2020-01-03T11:30:16.523] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000 [2020-01-03T11:30:16.527] job_step_signal: JobId=3860565 StepId=32 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:30:16.957] job_step_signal: JobId=3860565 StepId=32 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:30:19.230] Requeuing JobId=3860565 [2020-01-03T11:32:38.588] backfill: Started JobId=3860565 in all on max-exfl[132-139] [2020-01-03T11:33:31.857] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000 [2020-01-03T11:33:31.991] job_step_signal: JobId=3860565 StepId=33 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:33:32.302] job_step_signal: JobId=3860565 StepId=33 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:33:34.562] Requeuing JobId=3860565 [2020-01-03T11:35:39.512] backfill: Started JobId=3860565 in all on max-exfl[132-139] [2020-01-03T11:36:31.716] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000 [2020-01-03T11:36:31.737] job_step_signal: JobId=3860565 StepId=34 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:36:32.152] job_step_signal: JobId=3860565 StepId=34 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:36:34.427] Requeuing JobId=3860565 [2020-01-03T11:38:40.444] backfill: Started JobId=3860565 in all on max-exfl[132-139] [2020-01-03T11:39:01.072] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858000 [2020-01-03T11:39:01.076] job_step_signal: JobId=3860565 StepId=35 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:39:01.501] job_step_signal: JobId=3860565 StepId=35 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:39:03.797] Requeuing JobId=3860565 [2020-01-03T11:41:11.198] backfill: Started JobId=3860565 in all on max-exfl[158-165] [2020-01-03T11:41:11.915] preempted JobId=3860565 has been requeued to reclaim resources for JobId=3858001 [2020-01-03T11:41:11.959] job_step_signal: JobId=3860565 StepId=36 can not be sent signal 9 from state=COMPLETING [2020-01-03T11:41:12.048] job_step_signal: JobId=3860565 StepId=36 can not be sent signal 991 from state=COMPLETING [2020-01-03T11:41:12.971] Requeuing JobId=3860565 Occasionally preempted jobs seem to be re-queued more than once leading to duplicate jobids: /var/log/slurm/slurmctld.log:[2020-01-03T11:38:31.320] preempted JobId=3860461 has been requeued to reclaim resources for JobId=3858000 /var/log/slurm/slurmctld.log:[2020-01-03T11:38:31.377] job_step_signal: JobId=3860461 StepId=40 can not be sent signal 9 from state=COMPLETING /var/log/slurm/slurmctld.log:[2020-01-03T11:38:31.735] job_step_signal: JobId=3860461 StepId=40 can not be sent signal 991 from state=COMPLETING /var/log/slurm/slurmctld.log:[2020-01-03T11:38:33.015] Requeuing JobId=3860461 /var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.040] backfill: Started JobId=3860461 in all on max-exfl[132-139] /var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.044] _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=0 /var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.056] drain_nodes: node max-exfl132 state set to DRAIN /var/log/slurm/slurmctld.log:[2020-01-03T11:40:41.056] error: Duplicate jobid on nodes max-exfl132, set to state DRAIN /var/log/slurm/slurmctld.log:[2020-01-03T11:40:42.192] Requeuing JobId=3860461 /var/log/slurm/slurmctld.log:[2020-01-03T11:40:42.200] job_signal: 9 of running JobId=3860665 successful 0x8008 /var/log/slurm/slurmctld.log:[2020-01-03T11:40:42.200] preempted JobId=3860665 has been killed to reclaim resources for JobId=3858001 It seems that constraints/features are not properly taken into account when jobs are evaluated for preemption. I have the impression that it primarily happens when both the preempting job and the preempted job declare constraints/features. Cheers, Frank.