Ticket 14719

Summary:	Question Partition/Queue Setup
Product:	Slurm	Reporter:	John Johnston <jbjohnston>
Component:	Configuration	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	21.08.8
Hardware:	Linux
OS:	Linux
Site:	Oakland U	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf file

Description John Johnston 2022-08-10 08:31:32 MDT

Created attachment 26251 [details]
slurm.conf file

We are trying to reconfigure our cluster partitions in a way that will permit auto-segregation of user submitted jobs to the "most appropriate" queue.  We are doing this chiefly to account for buy-in nodes (nodes purchased by research groups). Essentially, buy-in nodes can allow jobs to run for non-account holders, but only for a max time of 4 hours.  So we have a "general-short" queue (all nodes, max runtime 4 hours), a "general-long" queue (all nodes EXCEPT buy-in nodes, max runtime 7 days), and account based queues (all nodes including respective buy-in nodes, max runtime 7 days).

That's the essential background. Please note that we currently just use a single default partition.

Another cluster I have access to (but no admin rights) does not appear to use a "Default" partition/queue.  One doesn't need to specify the partition when submitting the job.  The job appears to be tested against each partition in a particular order, and it is scheduled in the queue where it meets requirements and where resources are available. (e.g., a job with a time=3 hours should run in "general-short", not "general-long").

I've tried to replicate this setup without success.  I'm attaching a copy of our current slurm.conf file for your review.  I tried to use the "PartitionName=DEFAULT" queue as the "Default" but jobs submitted without a partition specified just fail since there are no nodes assigned to DEFAULT. I CAN submit a job and specify a specific partition, and it works fine.  I've also tried submitting a job by specifying all the queues (e.g. sbatch -p general-short,general-long,science jobscript.sh). This sometimes works, but strangely, the job will try to run in the queue/partition that is listed LAST in the "slurm.conf".

Please let me know how I might make this work.

Thanks,
John

Comment 1 Jason Booth 2022-08-10 10:07:21 MDT

Hi John, you should be able to issue a "$ scontrol show config" on that other cluster. What is most likely happening is some type of cli_filter or job_submit which changes aspects of the job submission such as partition based on time limits.

Frankly, without knowing what the other cluster is doing, it is hard to say for sure what they are doing exactly. 


Example job_submit (ran server side by the slurmctld)

> JobSubmitPlugins=lua

https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua

Or cli_filter (ran client side)

> CliFilterPlugins=lua

https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example


What would be helpful is to understand your requirements for how jobs should be routed. Based on your description, it sounds like you want the user to submit without knowing partitions and have the correct partition selected based on the time limit of the job. The only exception here is account based jobs that should be routed to your buy-in nodes. Please let me know if this assumption is incorrect.

Comment 2 John Johnston 2022-08-10 11:54:13 MDT

Hi Jason,

Yes, you are correct - we want the user to be able to submit without 
specifying a partition, and based on time limit (or in the case of 
buy-in users, account) have the job routed to the appropriate queue.

I took your suggestion and ran a "scontrol show config" on the other 
cluster to have another look.  They are using "JobSubmitPlugins=lua" and 
I suspect this is how they're doing it (nothing for CliFilterPlugins).  
I've also looked at the linked lua script provided.  Just for grins, I 
dumped it into the same directory as slurm.conf, and set 
"JobSubmitPlugins=lua", then restarted Slurm.

It did not work, but I can tell from the log it is "hitting" that script 
when the job is submitted.  I'm just trying to better understand how 
this script needs to be modified for our needs.  Do you have a good 
reference or perhaps an example on how this script should/could be 
customized?

The first problem I encountered was that the script was not identifying 
the default account for the user.  I added the following to the canned 
script:

function slurm_job_submit(job_desc, part_list, submit_uid)
         if job_desc.account == nil then
                 local getacct = io.popen("sacctmgr -n list USER $USER 
-o format=DefaultAccount")
                 local defacct = getacct:read("*a")
                 local account = defacct
                 slurm.log_info("slurm_job_submit: job from uid %u, 
setting default account value: %s",
                                 submit_uid, account)
                 job_desc.account = account

This seems to use the current $USER to determine the default account if 
none is specified.  However, the portion where it is supposed to obtain 
a partition list (part_list) seems to be giving me issues now.  I tried 
hard-coding in a list of partitions just to see if it would work, but 
I'm uncertain as to what the format of this should be.  I've tried:

part_list="general-short,general-long,science"

part_list="PartitionName=general-short,PartitionName=general-long,PartitionName=science"

part_list="1:general-short,2:general-long,3:science"

None of these seem to work.  It seems to what some sort of table.  The 
issue is I'm not sure how those things might be passed directly from 
SLURM into that script (without me using shell commands, hardcoded 
strings, or other kludges).  Any insight you can provide would be 
appreciated.

Thanks,

On 8/10/22 12:07, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=14719#c1> on 
> bug 14719 <https://bugs.schedmd.com/show_bug.cgi?id=14719> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Hi John, you should be able to issue a "$ scontrol show config" on that other
> cluster. What is most likely happening is some type of cli_filter or job_submit
> which changes aspects of the job submission such as partition based on time
> limits.
>
> Frankly, without knowing what the other cluster is doing, it is hard to say for
> sure what they are doing exactly.
>
>
> Example job_submit (ran server side by the slurmctld)
>
> > JobSubmitPlugins=lua
>
> https://github.com/SchedMD/slurm/blob/master/contribs/lua/job_submit.lua
>
> Or cli_filter (ran client side)
>
> > CliFilterPlugins=lua
>
> https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example
>
>
> What would be helpful is to understand your requirements for how jobs should be
> routed. Based on your description, it sounds like you want the user to submit
> without knowing partitions and have the correct partition selected based on the
> time limit of the job. The only exception here is account based jobs that
> should be routed to your buy-in nodes. Please let me know if this assumption is
> incorrect.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 3 Ben Glines 2022-08-10 16:21:59 MDT

(In reply to John Johnston from comment #2)
> It did not work, but I can tell from the log it is "hitting" that script 
> when the job is submitted.  I'm just trying to better understand how 
> this script needs to be modified for our needs.  Do you have a good 
> reference or perhaps an example on how this script should/could be 
> customized?
This page explains a lot about the job_submit plugin API, including the functions and passed parameters:
https://slurm.schedmd.com/job_submit_plugins.html

Here you can see an example of how the job_submit plugin can be used for various tasks:
https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example

> The first problem I encountered was that the script was not identifying 
> the default account for the user.  I added the following to the canned 
> script:
> 
> function slurm_job_submit(job_desc, part_list, submit_uid)
>          if job_desc.account == nil then
>                  local getacct = io.popen("sacctmgr -n list USER $USER 
> -o format=DefaultAccount")
>                  local defacct = getacct:read("*a")
>                  local account = defacct
>                  slurm.log_info("slurm_job_submit: job from uid %u, 
> setting default account value: %s",
>                                  submit_uid, account)
>                  job_desc.account = account
You can actually access the default_account directly with the following:
> job_desc.default_account
I also wouldn't recommend making any calls to user command such as sacctmgr, as this has the potential to severly slow down slurmctld, as it will have to do this for every job submitted. If it is absolutely necessary to make such a call, I would suggest calling this less frequently (every day, week, etc), caching the results, and then accessing them that way.

> This seems to use the current $USER to determine the default account if 
> none is specified.  However, the portion where it is supposed to obtain 
> a partition list (part_list) seems to be giving me issues now.  I tried 
> hard-coding in a list of partitions just to see if it would work, but 
> I'm uncertain as to what the format of this should be.  I've tried:
> 
> part_list="general-short,general-long,science"
> 
> part_list="PartitionName=general-short,PartitionName=general-long,
> PartitionName=science"
> 
> part_list="1:general-short,2:general-long,3:science"
> 
> None of these seem to work.  It seems to what some sort of table.  The 
> issue is I'm not sure how those things might be passed directly from 
> SLURM into that script (without me using shell commands, hardcoded 
> strings, or other kludges).  Any insight you can provide would be 
> appreciated.
It seems that you're trying to modify the part_list that is passed into the slum_job_submit function? This parameter is actually an input, as seen in the job_submit plugin documentation (https://slurm.schedmd.com/job_submit_plugins.html#lua):
> part_list (input) List of pointer to partitions which this user is
> authorized to use.
If you want to modify the partitions that a job can land on, you should modify `job_desc.partition`. Format this in the same way that you would format a partition list for a job submitted from the command-line.

Example:
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> A*           up   infinite     10   idle n-[1-10]
> B            up   infinite     10   idle n-[11-20]

Submitting from command-line:
> $ srun --partition="A,B" hostname

job_submit.lua
> function slurm_job_submit(job_desc, part_list, submit_uid)
>         ...
> 	job_desc.partition="A,B"
>         ...

Let me know if you have any other questions

Comment 4 Ben Glines 2022-08-22 11:40:25 MDT

If you don't have any further questions, then I'll close this bug out. Feel free to reopen it though and reply if you have any questions related to the original topic of this bug.