Ticket 16720

Summary: user reservations
Product: Slurm Reporter: William J Edsall <wjedsall>
Component: reservationsAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.2   
Hardware: Linux   
OS: Linux   
Site: Dow Chemical Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description William J Edsall 2023-05-12 09:18:12 MDT
Hello support,
 We often get requests from users to create a standing reservation for them. Often these users are already flooding the queue but just need a high priority resource that they can access on demand.

 Is there any way that a user can achieve this result within their own control? Our team would prefer not to make scontrol reservations. For example can they salloc and leave that job running for say a week, and send tasks to that allocation as needed? 

 Thanks
 Will
Comment 1 Ben Roberts 2023-05-12 11:34:22 MDT
Hi Will,

I'll throw out a few options that may help you find something that works for you.  If you like the reservation idea, but just don't want to have to create them yourself, there is the option of making another user an 'Operator'.  This gives them some administrative privileges (including the ability to create a reservation), but not a full admin.
https://slurm.schedmd.com/user_permissions.html

It would be possible for a user to submit a job that ran for a long period of time that they could come back to and use when needed.  If you created a special partition for users to do this then you could enforce different limits (including the maximum amount of time allowed per job) than you do for the typical jobs to prevent users from abusing this partition.  Users could do this by starting an interactive job, leaving the terminal open and coming back to the terminal when they need to do something.  You would need to make sure you don't have InactiveLimit set, otherwise the scheduler would kill these jobs after they sat idle for a while.
https://slurm.schedmd.com/slurm.conf.html#OPT_InactiveLimit

It may be difficult for some users to leave a terminal open for long periods of time if their computer shuts down or sleeps regularly.  If this is the case then you may consider configuring the pam_slurm_adopt module.  This would allow users to submit a batch job (rather than an interactive job) that runs for however long you want.  The job can just sleep the whole time it's there, but then users can ssh to the node the job is running on and their session would be adopted into the job's external step.  Note that this would only allow users who have a job running on a node to ssh to that node.  You can find more details about this here:
https://slurm.schedmd.com/pam_slurm_adopt.html

A final thought is that you may consider creating a time_float reservation that's there all the time.  This would only work if it's the same users or account that always need you to create a reservation for them.  This also wouldn't give them immediate access to a node, but would guarantee that then never have to wait more than X minutes for a node.  What this flag does is say that the reservation you're creating is always X minutes in the future.  The scheduler sees that there's a short block of time before this reservation starts, so it can fit short jobs in the available time.  Then if a user who qualifies for the reservation submits a job the scheduler will not backfill any more jobs in that window and the user's job who qualifies for the reservation will be the next to start.  For example, if you created a reservation with time_float that was always 10 minutes out, any job that requested 10 minutes or less of time would be able to start while the node was idle.  When someone needed to use the reservation the longest they would have to wait is 10 minutes.  

Your users may not want to wait this long, but you could shorten that window.  The smaller the window the less functional work can get done by other jobs.  It may not work in your situation, but I wanted to throw it out as an option.  You can read more about it here:
https://slurm.schedmd.com/reservations.html#float

Let me know if any of these sound like they would work for you or if you have any questions.

Thanks,
Ben
Comment 2 Ben Roberts 2023-06-08 13:55:37 MDT
Hi Will,

I wanted to follow up and see if any of the suggestions I sent sounded like something that might work for you.  Let me know if you still need help with this ticket or if it's ok to close.

Thanks,
Ben
Comment 3 William J Edsall 2023-06-08 14:57:24 MDT
Hi Ben! Your notes on pam slurm adopt led us to a solution. One of the big challenges is, if a user's process dies while they're testing a new workflow, they lose all of that time spent in queue. so we suggested srun and/or ssh so they can do some more manual submission. We seem to be OK for now. thanks!
Comment 4 Ben Roberts 2023-06-09 08:10:28 MDT
I'm glad to hear that worked for you.  I'll go ahead and close this ticket.

Thanks,
Ben