Ticket 7708

Summary: Nodes will be preempted even if they don't match constaint
Product: Slurm Reporter: Sven Sternberger <sven.sternberger>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: bart, frank.schluenzen
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7962
https://support.schedmd.com/show_bug.cgi?id=21200
Site: DESY Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.3
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: maxwell slurm.conf
slurmctld log
Log with massic duplcuiate id due to preemption bug
slurmctld log file after patch

Description Sven Sternberger 2019-09-09 10:15:07 MDT
Hello!

we have several partitions. There are two special ones especially for groups without sufficient resources 
one is including all nodes with cpus and one is including all nodes with gpu, Both are configured for preemption
PartitionName=all PreemptMode=REQUEUE
PartitionName=allgpu PreemptMode=REQUEUE

User submit jobs to one of the privileged queues (not all*). The job has a constraint like "V100" which can't be fulfiled when it is submitted,
because all nodes which match the constraint run jobs from the privileged queue.
 
Slurm now surprisingly preempts a node which is in the privileged queue and in the the all* queue, but don't have the constraint! 
The job stays in pending. After 3 minutes another node will be preempted this will continue until a node in the privileged queue which fulfill the constrain
is free.



slurm.conf:
NodeName=max-wng[004-007]       Weight=25  Realmemory=256000  Sockets=2 CoresPerSocket=10  ThreadsPerCore=2 State=Unknown Feature=INTEL,V4,E5-2640,GPU,P100,GPUx1,256G
NodeName=max-wng[010-019]       Weight=40  RealMemory=384000  Sockets=2 CoresPerSocket=10  ThreadsPerCore=2 State=Unknown Feature=INTEL,V4,Silver-4114,GPU,V100,GPUx1,384G



/var/log/slurm/job_completions:
JobId=3127852 UserId=foo(42) GroupId=upex(616) Name=spawner-jupyterhub JobState=PREEMPTED Partition=allgpu TimeLimit=480 StartTime=2019-09-09T13:39:04 EndTime=2019-09-09T13:39:43 NodeList=max-wng007 NodeCnt=1 ProcCnt=40 WorkDir=/home/foo ReservationName= Gres= Account=upex QOS=normal WcKey= Cluster=maxwell SubmitTime=2019-09-09T13:36:45 EligibleTime=2019-09-09T13:38:46 DerivedExitCode=0:0 ExitCode=0:0
JobId=3128450 UserId=faa(43) GroupId=cfel(626) Name=0.8_1.0 JobState=COMPLETED Partition=maxgpu TimeLimit=240 StartTime=2019-09-09T13:50:17 EndTime=2019-09-09T15:13:30 NodeList=max-wng[012-013] NodeCnt=2 ProcCnt=80 WorkDir=/beegfs/desy/group/lux-users/localized_injection_2019/simulation/scan_first_profile_laser_scale_0.800_plasma_scale_1.000 ReservationName= Gres= Account=cfel QOS=cfel WcKey= Cluster=maxwell SubmitTime=2019-09-09T12:27:37 EligibleTime=2019-09-09T12:27:37 DerivedExitCode=0:0 ExitCode=0:0 



message.log:
Sep  9 10:06:28 max-adm01 slurmctld[24911]: backfill: Started JobId=3127852 in allgpu on max-wng005
Sep  9 13:29:44 max-adm01 slurmctld[24911]: preempted JobId=3127852 has been requeued to reclaim resources for JobId=3128450
Sep  9 13:29:46 max-adm01 slurmctld[24911]: Requeuing JobId=3127852
Sep  9 13:32:03 max-adm01 slurmctld[24911]: backfill: Started JobId=3127852 in allgpu on max-wng007
Sep  9 13:32:43 max-adm01 slurmctld[24911]: preempted JobId=3127852 has been requeued to reclaim resources for JobId=3128450
Sep  9 13:32:46 max-adm01 slurmctld[24911]: Requeuing JobId=3127852
Sep  9 13:35:04 max-adm01 slurmctld[24911]: backfill: Started JobId=3127852 in allgpu on max-wng005
Sep  9 13:36:43 max-adm01 slurmctld[24911]: preempted JobId=3127852 has been requeued to reclaim resources for JobId=3128450
Sep  9 13:36:45 max-adm01 slurmctld[24911]: Requeuing JobId=3127852
Sep  9 13:39:04 max-adm01 slurmctld[24911]: backfill: Started JobId=3127852 in allgpu on max-wng007
Sep  9 13:39:43 max-adm01 slurmctld[24911]: preempted JobId=3127852 has been requeued to reclaim resources for JobId=3128450
Sep  9 13:39:46 max-adm01 slurmctld[24911]: Requeuing JobId=3127852
...
Sep  9 13:50:17 max-adm01 slurmctld[24911]: sched: Allocate JobId=3128450 NodeList=max-wng[012-013] #CPUs=80 Partition=maxgpu
Sep  9 15:13:30 max-adm01 slurmctld[24911]: _job_complete: JobId=3128450 WEXITSTATUS 0
Sep  9 15:13:30 max-adm01 slurmctld[24911]: _job_complete: JobId=3128450 done


1. when the job kills innocent all-jobs
# sacct -j 3128450 --format=jobid,state,Node,start,end,AllocCPUS,Constraints
       JobID      State        NodeList               Start                 End  AllocCPUS         Constraints
------------ ---------- --------------- ------------------- ------------------- ---------- -------------------
3128450         PENDING   None assigned             Unknown             Unknown          8                V100

2. after finishing
# sacct -j 3128450 --format=jobid,state,Node,start,end,AllocCPUS,Constraints
       JobID      State        NodeList               Start                 End  AllocCPUS         Constraints 
------------ ---------- --------------- ------------------- ------------------- ---------- ------------------- 
3128450       COMPLETED max-wng[012-01+ 2019-09-09T13:50:17 2019-09-09T15:13:30         80                V100 
3128450.bat+  COMPLETED      max-wng012 2019-09-09T13:50:17 2019-09-09T15:13:30         40                     
3128450.0     COMPLETED max-wng[012-01+ 2019-09-09T13:50:17 2019-09-09T15:13:30          2
Comment 1 Jason Booth 2019-09-09 13:27:02 MDT
Hi, Sven Dominik will reach out to more about this ticket but I wanted to get some initial information from you for him. The slurm.conf that we have is for maxwell but he configuration you describe below looks different. Can you attach a recent copy of your slurm.conf for us to review and would you also attach your gres.conf.
Comment 2 Sven Sternberger 2019-09-09 16:35:49 MDT
Created attachment 11520 [details]
maxwell slurm.conf

Find our actual slurm.conf. We don't have a gres.conf

At the moment we see the problem mostly that pending jobs in upex and exfl* partition preempt jobs in the all* partitions.
Comment 3 Dominik Bartkiewicz 2019-09-10 02:43:51 MDT
Hi

If these jobs still exist in the system could you send me the output from:
scontrol show job 3127852
scontrol show job 3128450

Full slurmctld.log will be useful too.

Dominik
Comment 4 Dominik Bartkiewicz 2019-09-10 03:48:25 MDT
Hi

I can recreate this.
I inform you when I find a solution.

Dominik
Comment 5 Sven Sternberger 2019-09-10 06:31:29 MDT
Created attachment 11523 [details]
slurmctld log

We've updated slurm to 19.05.2 on "Mo 02 Sep 2019 10:47"
Problem were reported to me on the weekend 6th - 8th of September
Comment 7 Sven Sternberger 2019-09-12 02:38:43 MDT
Created attachment 11555 [details]
Log  with massic duplcuiate id due to preemption bug

Hi!

last evening we had again 18000 preemption events. We also found
that 50 nodes will be set to DRAIN because of "Duplicate jobid" errors

Sep 12 00:13:00 max-adm01 slurmctld[10333]: backfill: Started JobId=3130441 in all on max-exfl022
Sep 12 01:40:41 max-adm01 slurmctld[10333]: email msg to foo@desy.de: Slurm Job_id=3130441 Name=SSdef2! Ended, Run time 00:00:30, PREEMPTED, ExitCode 0
Sep 12 01:40:41 max-adm01 slurmctld[10333]: preempted JobId=3130441 has been requeued to reclaim resources for JobId=3140569
Sep 12 01:40:43 max-adm01 slurmctld[10333]: Requeuing JobId=3130441

..repeat 50 times ..

Sep 12 01:43:43 max-adm01 slurmctld[10333]: drain_nodes: node max-exfl022 state set to DRAIN
Sep 12 01:43:43 max-adm01 slurmctld[10333]: error: Duplicate jobid on nodes max-exfl022, set to state DRAIN
Comment 8 Dominik Bartkiewicz 2019-09-12 03:36:31 MDT
Hi

We have patch which should solve this issue.
This patch is currently under internal Quality Assurance.
I let you know when it will be in the repo.

Dominik
Comment 10 Dominik Bartkiewicz 2019-09-12 08:32:02 MDT
Hi

This commit should fix this issue:
https://github.com/SchedMD/slurm/commit/0d432caed

It will be included in 19.05.3.
Please let me know if you find any issues after apply it.

Dominik
Comment 11 Sven Sternberger 2019-09-12 08:45:20 MDT
Hello! 

great news. Could you give me an estimate when 19.05.3 will arrive? Or could 
I simply patch it in 19.05.2 and just replace the patched slurmctld? 

best regards! 

> Von: "bugs" <bugs@schedmd.com>
> An: "sven sternberger" <sven.sternberger@desy.de>
> Gesendet: Donnerstag, 12. September 2019 16:32:02
> Betreff: [Bug 7708] Nodes will be preempted even if they don't match constaint

> [ https://bugs.schedmd.com/show_bug.cgi?id=7708#c10 | Comment # 10 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=7708 | bug 7708 ] from [
> mailto:bart@schedmd.com |  Dominik Bartkiewicz ]
> Hi

> This commit should fix this issue: [
> https://github.com/SchedMD/slurm/commit/0d432caed |
> https://github.com/SchedMD/slurm/commit/0d432caed ] It will be included in
> 19.05.3.
> Please let me know if you find any issues after apply it.

> Dominik

> You are receiving this mail because:

>     * You reported the bug.
Comment 12 Dominik Bartkiewicz 2019-09-12 08:57:45 MDT
We plan to release 19.05.3 before end of the month, but we have no strict date yet.

You can apply this patch locally on top of 19.05.2.

Dominik
Comment 13 Sven Sternberger 2019-09-15 13:42:04 MDT
Created attachment 11587 [details]
slurmctld log file after patch

Hello!

I replaced  src/slurmctld/node_scheduler.c in the 19.05.2 sources with the
"new" one and recompiled the slurmctld. I couldn't not only add the
line as there are more changes in the file compared to 19.05.2

I hope this ok, or should I clone the whole repository and rebuild then?

For us it looks ok now, no unnecessary preemption anymore. Attached you
find the actual log.

Best regards and thanks for your help
Sven
Comment 14 Dominik Bartkiewicz 2019-09-16 05:22:10 MDT
Hi

This patch contains only one line.
I'm glad to hear things are working.
Can we drop severity to 3 now, as the patch is already in git repo? 

Dominik
Comment 15 Sven Sternberger 2019-09-16 07:05:34 MDT
Hello! 

so would you recommend only to use the one line? 
and yes since everythink looks ok now we can drop it to 3 

best regards! 

> Von: "bugs" <bugs@schedmd.com>
> An: "sven sternberger" <sven.sternberger@desy.de>
> Gesendet: Montag, 16. September 2019 13:22:10
> Betreff: [Bug 7708] Nodes will be preempted even if they don't match constaint

> [ https://bugs.schedmd.com/show_bug.cgi?id=7708#c14 | Comment # 14 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=7708 | bug 7708 ] from [
> mailto:bart@schedmd.com |  Dominik Bartkiewicz ]
> Hi

> This patch contains only one line.
> I'm glad to hear things are working.
> Can we drop severity to 3 now, as the patch is already in git repo?

> Dominik

> You are receiving this mail because:

>     * You reported the bug.
Comment 24 Dominik Bartkiewicz 2019-10-02 03:23:52 MDT
Hi

Those commits contain additional fixes related to preemption when jobs request multiple features. Both will be included in 19.05.3

https://github.com/SchedMD/slurm/commit/f2fcf3af981
https://github.com/SchedMD/slurm/commit/c2a57967cef

I'm closing the case now as fixed. In case of any questions related to the
issue please feel free to reopen.

Dominik
Comment 25 Dominik Bartkiewicz 2020-01-07 10:53:37 MST
*** Ticket 8283 has been marked as a duplicate of this ticket. ***