Ticket 15737

Summary: Root user start job from queue bypassing accounts and queue limits
Product: Slurm Reporter: Jay McGlothlin <mcglow2>
Component: LimitsAssignee: Director of Support <support>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: RPI/CCNI - Rensselaer Polytechnic Institute Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Jay McGlothlin 2023-01-05 08:40:19 MST
Is there a recommended way for a root user to move a job in the queue from Pending reason AssocMaxJobsLimit to Running manually?

The use case is that we set MaxJobs on accounts to limit users from single handedly filling the cluster with small jobs in parallel.  However, every once in a while this will result in a lot of idle nodes if the queue is empty except for that user.  I am looking for an easy way for our admins to release another group of pending jobs to run when they see this.
Comment 1 Caden Ellis 2023-01-09 16:18:31 MST
Jay,

Specifically for what you want, you need to override the limit at the user level. Once the limit is changed, the scheduler will evaluate the jobs again but check the new limits and allow more jobs to run. This will take a several seconds to happen once the limit is changed.

Example:

sacctmgr modify user where name=<username> set maxjobs=<more jobs>

Then once more are scheduled, change it back.

sacctmgr modify user where name=<username> set maxjobs=<old amount>

There may be other limits that you could mix and match instead to avoid needing to manually override like this while still preventing total cluster usage:

https://slurm.schedmd.com/resource_limits.html

But this manual override I suggested will allow more jobs to run for that user until you change it back, like you requested. When I raised the limit and saw in squeue that more got scheduled, I immediately lowered the limit again (so I didn't have to wait for all of the jobs to finish to lower that limit). I saw no issues handling it like this. 

Does this answer your question?

Caden
Comment 2 Caden Ellis 2023-01-17 14:45:23 MST
Do you have an update for me on this?
Comment 3 Caden Ellis 2023-01-27 10:29:09 MST
Feel free to open this back up if you have further questions.

Caden