Ticket 10905 - How do we force a job to run immediately?
Summary: How do we force a job to run immediately?
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 20.02.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-02-19 06:36 MST by Randall Radmer
Modified: 2021-03-02 08:42 MST (History)
0 users

See Also:
Site: SLAC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Randall Radmer 2021-02-19 06:36:34 MST
To test our nodes regularly we want to be able to force a slurm job to run immediately. The only way we have found to do this is with “srun --no-alloc” but that option doesn’t work even with admin privileges. Instead it requires particular user names like “root", which we would like to avoid (see below for documentation excerpt). Is there a better way to force a slurm job to run immediately?  Note that we do not want to kill running jobs to get this done.

-Z, --no-allocate
Run the specified tasks on a set of nodes without creating a Slurm "job" in the
Slurm queue structure, bypassing the normal resource allocation step. The list of
nodes must be specified with the -w, --nodelist option. This is a privileged
option only available for the users "SlurmUser" and "root". This option applies tojob allocations.

Thanks Randy
Comment 1 Ben Roberts 2021-02-19 13:31:21 MST
Hi Randall,

It looks like you have already found the method we have to immediately start a job on a node.  A job submitted normally would have to wait for resources to become available so that it could be scheduled on a particular node.  Alternatively you could use preemption to get a job to start without having to wait for other jobs.  You said that you don't want to kill existing jobs to make this happen, so you wouldn't want to configure preemption to Cancel existing jobs, but you could configure it to Suspend existing jobs.  The thing to keep in mind with this is that the existing job will reside in memory while it's suspended, so the preempting job has to request little enough memory that the existing job and preempting job can both fit in the available memory on the node.  Preemption allows your job to start sooner than it otherwise would, but still has to wait for a scheduling cycle and then allow the existing job(s) time to be suspended, so it's faster, but not immediate.

Outside of preemption the mechanism available is the '--no-allocate' flag for srun.  As the documentation says, it doesn't create a job allocation so it can get on a node even if it is busy.  This ability is limited to 'root' and the 'SlurmUser' because of the potential for this to cause serious problems with existing jobs.  

If using the '--no-allocate' flag doesn't meet your needs, does it sound like suspending jobs with preemption would meet them?  If you have any additional questions about using suspending jobs feel free to let me know.  You can also read more about it in the documentation here:
https://slurm.schedmd.com/preempt.html

Thanks,
Ben
Comment 2 Randall Radmer 2021-02-19 14:00:31 MST
Thanks much Ben.  Feel free to resolve this ticket.

-Randy
Comment 3 Ben Roberts 2021-02-19 20:22:54 MST
You're welcome.  Closing now.
Comment 4 Randall Radmer 2021-03-02 08:05:20 MST
Hi Ben,

Quick follow up question: Is it possible to define a QOS that will preempt jobs running in queues that have PreemptType=preempt/partition_prio set in slurm.conf?
Comment 5 Ben Roberts 2021-03-02 08:28:55 MST
No, I'm afraid not.  PreemptType is a global setting that allows you to do Partition based preemption or QOS based preemption cluster-wide.

Thanks,
Ben
Comment 6 Randall Radmer 2021-03-02 08:42:48 MST
Thank you for the clarification.

________________________________
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, March 2, 2021 7:28 AM
To: Radmer, Randall J. <radmer@slac.stanford.edu>
Subject: [Bug 10905] How do we force a job to run immediately?


Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=10905#c5> on bug 10905<https://bugs.schedmd.com/show_bug.cgi?id=10905> from Ben Roberts<mailto:ben@schedmd.com>

No, I'm afraid not.  PreemptType is a global setting that allows you to do
Partition based preemption or QOS based preemption cluster-wide.

Thanks,
Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.