Ticket 16246

Summary: Cannot get preempexempttime to work as documented
Product: Slurm Reporter: Rob <rug262>
Component: SchedulingAssignee: Ben Roberts <ben>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: ben, marshall
Version: 22.05.8   
Hardware: Linux   
OS: Linux   
Site: PSU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.2, 23.11.0rc1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Rob 2023-03-10 15:57:13 MST
I'm trying to see PreemptExemptTime in action.  I'm using the QOS to accomplish this instead of the global value.  Here are my pertinent settings:

slurm.conf
PreemptType: preempt/qos
PreemptMode: 'SUSPEND,GANG'
PreemptExemptTime: '00:00:00'

Partitions:
PartitionName=active Default=YES QOS=normal Nodes=t-sc-1101  (this node has 48 cores)
PartitionName=hipri Default=NO QOS=expedite Nodes=t-sc-1101

QOS:
normal:  PreemptExemptTime=00:01:00, PreemptMode=requeue, Priority=50
expedite: Preempt=normal, PreemptMode=requeue, Priority=100


When I use these settings, and I execute a job in the normal partition to use all 48 cores, or simply request 1 node, then that job starts running.  If I then submit the same job but using the hipri partition, then THAT job also starts running.  That means that 2 jobs are running on 1 node, and must be sharing the cpus.  My understanding from the qos documentation was that that would only happen if both jobs came from the same partition.  In this case, they aren't.

Please help me figure out a setup where the preemptexempttime is used, and where jobs don't time share.  Thanks.
Comment 2 Jason Booth 2023-03-10 16:23:20 MST
Hi Rob,

There is no support for this parameter and 'SUSPEND,GANG'.

> https://slurm.schedmd.com/sacctmgr.html#OPT_PreemptExemptTime

PreemptExemptTime

Specifies a minimum run time for jobs of this QOS before they are considered for preemption. This QOS
 option takes precedence over the global PreemptExemptTime. This is only honored for 
 PreemptMode=REQUEUE and PreemptMode=CANCEL.
 Setting to -1 disables the option, allowing another QOS or the global option to take effect. Setting 
 to 0 indicates no minimum run time and supersedes the lower priority QOS (see OverPartQOS) and/or the 
 global option in slurm.conf.
Comment 3 Rob 2023-03-13 07:44:49 MDT
I have found a rather large amount of ambiguity in the slurm docs, since properties such as "PreemptMode" are repeated at multiple levels. Thus, given that it is in the sacctmgr docs, I assumed that the "PreemptMode" mentioned there that cannot be suspend,gang referred to the QOS preemptmode setting (which I showed as set to REQUEUE), not the global setting in slurm.conf.

So you're telling me that, no matter what convoluted combination of global, partition, and QOS settings I choose, there is no way to have PreemptExemptTime and the Gang scheduler (which is needed in order to reschedule suspended jobs) functional in the same cluster?

Thanks,

Rob
Comment 5 Rob 2023-03-13 10:00:34 MDT
I'm not attempting to reconfigure the system for requeue only, no suspending.  I have changed the slurm.conf and qos entries to remove all manner of suspending or gang scheduling.

slurm.conf
PreemptType=preempt/qos
PreemptMode=REQUEUE
PreemptExemptTime=00:00:00
JobRequeue=1

Partitions:
PartitionName=normal Default=YES QOS=normal Oversubscribe=No Nodes=t-sc-1101  (this node has 48 cores)
PartitionName=expedite Default=NO QOS=expedite Oversubscribe=No Nodes=t-sc-1101

QOS:
normal:  PreemptExemptTime=00:01:00, PreemptMode=requeue
expedite: Preempt=normal, PreemptMode=requeue


If I do this, and start a job on the normal partition (taking up all cores on the one node), then start another job on the expedite partition, I see that the expedite job stays pending due to Resources, and will not preempt the normal partition/qos job even after the minute of exempt time has passed.  Please help me figure out why preemption is not happening.
Comment 6 Marshall Garey 2023-03-13 10:50:45 MDT
Hi Rob,

I'm happy to help you get this working.

(1) You are correct that PreemptExemptTime does not work with PreemptMode = GANG or SUSPEND.

(2) With preempt/qos:
* Only job QOS is considered when determining if one job can preempt another job. The Partition QOS Preempt setting is not used to determine if a job can preempt another job.
* Partition QOS PreemptExemptTime overrides job PreemptExemptTime, unless the job's QOS has the flag OverPartQOS.
* A job cannot preempt another job with the same qos unless the QOS has the PreemptMode=WITHIN (new in 23.02).


In your example, you are trying to use Partition QOS to preempt; that won't work. Remove the QOS from the partition definitions and make sure that the jobs request the QOS with the --qos option for salloc, sbatch, and srun. This should give you the behavior that you want: the job that requests --qos=expedite should preempt the job that requests --qos=normal.

Can you let me know if this works for you?

We will work to clarify the documentation about preempt/qos.
Comment 8 Rob 2023-03-13 12:48:05 MDT
Yes, thank you, I did actually see that work that time.

Can you tell me if this statement is true:

If I am going to use PreemptExemptTime at a cluster or QOS level (since there is no setting for partitions), then that means that:

-- The cluster PreemptMode (slurm.conf) cannot be SUSPEND,GANG
-- No partition definition at all can be PreemptMode=SUSPEND,GANG even if I don't want to use PreemptExemptTime in that partition
-- No QOS definition at all can contain a PreemptMode=SUSPEND,GANG even if I don't want to use PreemptExemptTime in that QOS.

Is that all true?

In other words, if I want to use PreemptExemptMode in my cluster anywhere, then I will have to give up the ability to SUSPEND and resume jobs completely, in any way, within that same cluster?

Thanks.  That so far has not been clear.  It is confusing because you say "(1) You are correct that PreemptExemptTime does not work with PreemptMode = GANG or SUSPEND.", but there is a PreemptMode in cluster, partition, and QOS settings.  Which setting?  Or do you mean ALL of them?

Thanks.
Comment 9 Rob 2023-03-13 15:18:28 MDT
I think I have answered that question myself, as I see in the documentation that in order for ANY preemptmode to be suspend, then "gang" has to be specified at the cluster level.  And my testing has showed me that putting gang in the cluster level preemptMode makes the preemptExemptTime setting no longer work.  So I believe I now know for sure that Suspend and ExemptTime cannot exist in the same cluster.

Thanks.
Comment 10 Marshall Garey 2023-03-13 17:00:06 MDT
(In reply to Rob from comment #9)
> I think I have answered that question myself, as I see in the documentation
> that in order for ANY preemptmode to be suspend, then "gang" has to be
> specified at the cluster level.  And my testing has showed me that putting
> gang in the cluster level preemptMode makes the preemptExemptTime setting no
> longer work.  So I believe I now know for sure that Suspend and ExemptTime
> cannot exist in the same cluster.

Yes, that is correct. We will clarify the documentation about PreemptExemptTime and what is allowed with PreemptMode=GANG/SUSPEND.
Comment 11 Marshall Garey 2023-03-14 05:58:47 MDT
I need to make a correction. I'm not sure why I didn't get this to work yesterday, but I did get a QOS PreemptMode=requeue to override the cluster setting SUSPEND,GANG, and PreemptExemptTime worked with it.

slurm.conf:
PreemptType=preempt/qos
PreemptMode=suspend,gang


$ sacctmgr show qos low,high format=name,preempt,preemptexempttime,preemptmode
      Name    Preempt   PreemptExemptTime PreemptMode
---------- ---------- ------------------- -----------
      high        low                         cluster
       low                       00:01:00     requeue


In a default partition with one node, 8 cpus per node, I submit two jobs which take the whole node so that only one can run at a time:

$ sbatch --qos=low -Dtmp -n1 -c8 --wrap='whereami 600'
$ sbatch --qos=high -Dtmp -n1 -c8 --wrap='whereami 600'

After a little more than one minute of runtime, the job in qos low was preempted and requeued, and the job in qos high started running.

Can you test this to see if it works for you?
Comment 12 Rob 2023-03-14 07:54:49 MDT
I've tested it, and preemptexempt time is indeed respected.  That is perfect. I am just going to confirm that suspend/gang also works in its own partitions as well.

I just want to note about the documentation:  The confusion was not about what PreemptMode should be set to, but *WHICH* PreemptMode, since there is a cluster, partition, and QOS one.
Comment 13 Rob 2023-03-14 08:50:10 MDT
I have indeed now confirmed that I can suspend and resume jobs, and that other jobs have a working preemptexempttime.  Thank you so much.  Everything I had done and read had led me to believe that wasn't possible.

Rob
Comment 14 Rob 2023-03-14 08:51:10 MDT
Thanks!
Comment 15 Marshall Garey 2023-03-14 09:03:29 MDT
I'm glad that it's all working for you. I'm reopening the bug to track fixing the documentation. We'll re-close the bug once we've updated the docs.
Comment 24 Ben Roberts 2023-04-14 11:39:15 MDT
Hi Rob,

I wanted to let you know that we have checked in some updates to the documentation to better explain some of the questions you had.  If you're interested you can see the commits here:

https://github.com/SchedMD/slurm/commit/a53f83e34ec7b95898abf40e5ebc20a9bb4f05cd
https://github.com/SchedMD/slurm/commit/3e4762a1dfeb6d222f6027ca21f6f2850c006fab
https://github.com/SchedMD/slurm/commit/8b4c18e93ac9b456c09b5b8589eae8891bc17204

These will show up in the online documentation with the release of 23.02.2.

Thanks,
Ben
Comment 25 Rob 2023-04-14 12:04:43 MDT
I appreciate the chance to review the changes, but I don't see that any of them address what my concerns expressed in comment 8 and 12 were, which was that in cases where PreemptMode had to be a certain value, there was no distinction of WHICH PreemptMode it was referring to, since there is a PreemptMode set globally, in partitions, and in QOS.  It is this ambiguity which I found difficult to navigate.
Comment 27 Ben Roberts 2023-04-17 11:45:31 MDT
Hi Rob,

My apologies, we identified our own shortcomings that we saw in the documentation and fixed those but overlooked the thing that was confusing to you.  I'll reopen this ticket and work on updating the documentation to clarify this aspect of it.

Thanks,
Ben
Comment 28 Rob 2023-04-17 11:48:08 MDT
Thank you Ben.
Comment 35 Ben Roberts 2023-05-24 13:24:02 MDT
Hi Rob,

I'm lowering the priority of this ticket while working on documentation changes.  I'll let you know as things progress.

Thanks,
Ben
Comment 36 Rob 2023-05-24 13:29:35 MDT
Ok, thanks for keeping me updated.
________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, May 24, 2023 3:24 PM
To: Groner, Rob <rug262@psu.edu>
Subject: [Bug 16246] Cannot get preempexempttime to work as documented

Ben Roberts<mailto:ben@schedmd.com> changed bug 16246<https://bugs.schedmd.com/show_bug.cgi?id=16246>
What    Removed Added
Severity        3 - Medium Impact       4 - Minor Issue

Comment # 35<https://bugs.schedmd.com/show_bug.cgi?id=16246#c35> on bug 16246<https://bugs.schedmd.com/show_bug.cgi?id=16246> from Ben Roberts<mailto:ben@schedmd.com>

Hi Rob,

I'm lowering the priority of this ticket while working on documentation
changes.  I'll let you know as things progress.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 37 Rob 2023-06-15 07:19:30 MDT
I've been researching using the preemptexempttime using partition preemption instead of QOS, and I've discovered some interesting things and possibly a bug.  Would you like me to add it here, or start a new ticket?
Comment 38 Ben Roberts 2023-06-19 12:31:26 MDT
Hi Rob,

My apologies for the delayed response, I was out of the office for a couple days at the end of last week.  I think it would be best in a new ticket so that the documentation changes that we're working on don't get lost in the shuffle.  

Thanks,
Ben
Comment 39 Rob 2023-06-20 07:24:20 MDT
Ok, will do.