Ticket 6593 - Main scheduler prefering regular jobs vs hetjobs when it shouldn't
Summary: Main scheduler prefering regular jobs vs hetjobs when it shouldn't
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 18.08.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-02-26 11:13 MST by Alejandro Sanchez
Modified: 2019-03-20 04:48 MDT (History)
2 users (show)

See Also:
Site: Jülich
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.7 19.05.0pre2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Alejandro Sanchez 2019-02-26 11:13:55 MST
Coming from bug 5579 to decouple this issue from there in a separate bug.
Comment 4 Alejandro Sanchez 2019-03-20 03:01:47 MDT
Hi Jülich colleagues,

this has been fixed in following commit, available since 18.08.7:

https://github.com/SchedMD/slurm/commit/cb599ecfcc24706e

Behavior before:

alex@polaris:~/t$ sbatch --exclusive : --exclusive --wrap "sleep 9999"
Submitted batch job 20001
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20001+0        p1     wrap     alex PD       0:00      1 (None)
           20001+1        p1     wrap     alex PD       0:00      1 (None)
alex@polaris:~/t$ sbatch --exclusive --wrap "sleep 9999"
Submitted batch job 20003
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20001+0        p1     wrap     alex PD       0:00      1 (None)
           20001+1        p1     wrap     alex PD       0:00      1 (None)
             20003        p1     wrap     alex  R       0:01      1 compute1
alex@polaris:~/t$

Hetjob is higher priority but regular job is allocated resources by main scheduler while hetjob waits for backfill cycle.

Behavior after patch:

alex@polaris:~/t$ sbatch --exclusive : --exclusive --wrap "sleep 9999"
Submitted batch job 20010
alex@polaris:~/t$ sbatch --exclusive --wrap "sleep 9999"
Submitted batch job 20012
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20010+0        p1     wrap     alex PD       0:00      1 (None)
           20010+1        p1     wrap     alex PD       0:00      1 (None)
             20012        p1     wrap     alex PD       0:00      1 (Priority)
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           20010+0        p1     wrap     alex PD       0:00      1 (None)
           20010+1        p1     wrap     alex PD       0:00      1 (None)
             20012        p1     wrap     alex PD       0:00      1 (Priority)
alex@polaris:~/t$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             20012        p1     wrap     alex PD       0:00      1 (Priority)
           20010+0        p1     wrap     alex  R       0:04      1 compute1
           20010+1        p1     wrap     alex  R       0:04      1 compute2
alex@polaris:~/t$
Comment 6 Chrysovalantis Paschoulas 2019-03-20 03:50:54 MDT
Hi Alejandro!

Great job, thanks! :)

Any possibility for a backport/patch for 17.11? Still it will take a few more months until we can actually update to 18.08 or jump to 19.05...

Best Regards,
Valantis
Comment 7 Alejandro Sanchez 2019-03-20 04:21:48 MDT
(In reply to Chrysovalantis Paschoulas from comment #6)
> Hi Alejandro!
> 
> Great job, thanks! :)
> 
> Any possibility for a backport/patch for 17.11? Still it will take a few
> more months until we can actually update to 18.08 or jump to 19.05...
> 
> Best Regards,
> Valantis

I'd rather wait for more patches that are to come in bug 6710 and bug 6594, and once checked-in into 18.08 I can prepare a single patch for 17.11 combining all the different fixes related to hetjobs into a single standalone backport, if that sounds good to you. Right now I'm not sure what you currently have backported, and all the different fixes change same area of code that's why I prefer to combine everything once all is checked-in.
Comment 8 Chrysovalantis Paschoulas 2019-03-20 04:48:24 MDT
(In reply to Alejandro Sanchez from comment #7)
> (In reply to Chrysovalantis Paschoulas from comment #6)
> > Hi Alejandro!
> > 
> > Great job, thanks! :)
> > 
> > Any possibility for a backport/patch for 17.11? Still it will take a few
> > more months until we can actually update to 18.08 or jump to 19.05...
> > 
> > Best Regards,
> > Valantis
> 
> I'd rather wait for more patches that are to come in bug 6710 and bug 6594,
> and once checked-in into 18.08 I can prepare a single patch for 17.11
> combining all the different fixes related to hetjobs into a single
> standalone backport, if that sounds good to you. Right now I'm not sure what
> you currently have backported, and all the different fixes change same area
> of code that's why I prefer to combine everything once all is checked-in.

I agree with you, that would be great! Thanks :)

We would also like to avoid any mess with the patches..