We are fairly new to SLURM, moving from SGE, and need to implement some forn of priority scheme for scheduling to assure all users get a crack at compute nodes. At present we see resources getting swamped by either long-running jobs or massive numbers of relatively small short-running jobs. We have not yet done the "hard work" of discussing what we as an organization want in terms of real priorities, fairshare, preemption..., but we do need something in the interim. We currently have 3 partitions: PartitionName=queue0 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=A[01-12] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF State=UP TotalCPUs=576 TotalNodes=12 SelectTypeParameters=NONE DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED ** default queue (if no queue name is specified) ** no job limits ** 12 nodes [RealMemory=500000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 TmpDisk=1500000] -- complete overlap with nodes with queue1 ** purpose: provide a general purpose "slow lane" for any jobs PartitionName=queue1 AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=01:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=A[01-16] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF State=UP TotalCPUs=768 TotalNodes=16 SelectTypeParameters=NONE DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED ** max job execution time is 1 hr ** 16 nodes [RealMemory=500000 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 TmpDisk=1500000] -- first 10 nodes overlap with queue0, last 4 dedicated to this queue ** purpose: provide a "fast lane" for shorter jobs PartitionName=queue2 AllowGroups=ALL AllowAccounts=seq AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=B[01-08] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO PreemptMode=OFF State=UP TotalCPUs=128 TotalNodes=8 SelectTypeParameters=NONE DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED ** reserved for production users ** 8 nodes [RealMemory=95000 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 TmpDisk=750000] ** purpose: provide dedicated resources for production jobs What we are considering is implementing QOS limits in order to assure no one user can monopolize the full cluster. It looks to me like this might be implemented something like: sacctmgr add qos 80pcnt grpcpus=615 grpnodes=13 grpmemory=6599411 sacctmgr add qos 60pcnt grpcpus=460 grpnodes=10 grpmemory=4949558 sacctmgr modify user name=user1 set qos=60pcnt defaultqos=60pcnt And in slurm.conf AccountingStorageEnforce = associations,limits,qos As I noted, we are new to SLURM, so maybe this is not the right way to go, even as an interim solution. Can you offer comments, suggestions, pointers? Thank you, --Bill
Are you running 16.05.0 as indicated? I'd highly recommend updating to 16.05.8 if possible; there are quite a few bug fixes you may want to have at some point. (In reply to Bill from comment #0) > We are fairly new to SLURM, moving from SGE, and need to implement some forn > of priority scheme > for scheduling to assure all users get a crack at compute nodes. At present > we see resources > getting swamped by either long-running jobs or massive numbers of relatively > small short-running > jobs. We have not yet done the "hard work" of discussing what we as an > organization want in terms > of real priorities, fairshare, preemption..., but we do need something in > the interim. > > We currently have 3 partitions: > PartitionName=queue0 (snip) > What we are considering is implementing QOS limits in order to assure no one > user can monopolize > the full cluster. It looks to me like this might be implemented something > like: > sacctmgr add qos 80pcnt grpcpus=615 grpnodes=13 grpmemory=6599411 > sacctmgr add qos 60pcnt grpcpus=460 grpnodes=10 grpmemory=4949558 > sacctmgr modify user name=user1 set qos=60pcnt defaultqos=60pcnt > And in slurm.conf > AccountingStorageEnforce = associations,limits,qos > > As I noted, we are new to SLURM, so maybe this is not the right way to go, > even as an interim > solution. Can you offer comments, suggestions, pointers? Your QOS limits seem reasonable, although if you're going to make these apply throughout the cluster you may want to use a PartitionQOS rather than just setting the same QOS on every user/account. The QOS is created in the same way; once created you just set the QOS option on the Partition to that node (in slurm.conf) and run 'scontrol reconfigure'. https://slurm.schedmd.com/qos.html#partition I'd also suggest looking into using GrpRunMins or some similar limitations - I assume what matters more is limiting the total pool of resources a user can tie up at any given time more so than the node counts themselves.
Thank you Tim, I looked for documentation on GrpRunMins but found only an enhancement request ticket. Can you point me at anything descriptive? As I look at applying the QOS to the partition, it is unclear to me if the limits will throttle/limit usage *by* a user or usage *of* a partition. I am fine with the partition being fully utilized, just not by one user. It would be nice however to not have to manage this at the user level. Could you please clarify? Thanks, --Bill
> I looked for documentation on GrpRunMins but found only an enhancement > request ticket. > Can you point me at anything descriptive? https://slurm.schedmd.com/resource_limits.html I should have said to use 'GrpTRESRunMins' - when we extended the resource tracking and allocation system to cover anything ("TRES") we moved away from the individual elements. GrpTRESRunMinds=cpu=1000 would be limit a group to (1000 cpus * minutes) under that QOS from any combination of jobs. > As I look at applying the QOS to the partition, it is unclear to me if the > limits will > throttle/limit usage *by* a user or usage *of* a partition. I am fine with > the partition > being fully utilized, just not by one user. It would be nice however to not > have to > manage this at the user level. Could you please clarify? The nomenclature's admittedly a bit confusing at times. For any of the "Grp" limits they'll track and constrain the account that's in use. When used as a partition QOS each account should be tracked separately. If you want to limit usage of the partition as a whole (this can be used to create "virtual partitions" that don't correspond to specific hardware, but are allowed to only use a subset of the system without regard to which specific nodes) you can use the MaxTRES / MaxTRESMins limits within the QOS.
Hey Bill - I'm assuming that was enough to get you going, and am marking this as resolved/infogiven. Please reopen if there was anything else I could address. - Tim