| Summary: | Basic Scheduling Assistance | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | IT <it-slurm> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 16.05.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Altius Institute | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
IT
2017-01-15 18:08:56 MST
Are you running 16.05.0 as indicated? I'd highly recommend updating to 16.05.8 if possible; there are quite a few bug fixes you may want to have at some point. (In reply to Bill from comment #0) > We are fairly new to SLURM, moving from SGE, and need to implement some forn > of priority scheme > for scheduling to assure all users get a crack at compute nodes. At present > we see resources > getting swamped by either long-running jobs or massive numbers of relatively > small short-running > jobs. We have not yet done the "hard work" of discussing what we as an > organization want in terms > of real priorities, fairshare, preemption..., but we do need something in > the interim. > > We currently have 3 partitions: > PartitionName=queue0 (snip) > What we are considering is implementing QOS limits in order to assure no one > user can monopolize > the full cluster. It looks to me like this might be implemented something > like: > sacctmgr add qos 80pcnt grpcpus=615 grpnodes=13 grpmemory=6599411 > sacctmgr add qos 60pcnt grpcpus=460 grpnodes=10 grpmemory=4949558 > sacctmgr modify user name=user1 set qos=60pcnt defaultqos=60pcnt > And in slurm.conf > AccountingStorageEnforce = associations,limits,qos > > As I noted, we are new to SLURM, so maybe this is not the right way to go, > even as an interim > solution. Can you offer comments, suggestions, pointers? Your QOS limits seem reasonable, although if you're going to make these apply throughout the cluster you may want to use a PartitionQOS rather than just setting the same QOS on every user/account. The QOS is created in the same way; once created you just set the QOS option on the Partition to that node (in slurm.conf) and run 'scontrol reconfigure'. https://slurm.schedmd.com/qos.html#partition I'd also suggest looking into using GrpRunMins or some similar limitations - I assume what matters more is limiting the total pool of resources a user can tie up at any given time more so than the node counts themselves. Thank you Tim, I looked for documentation on GrpRunMins but found only an enhancement request ticket. Can you point me at anything descriptive? As I look at applying the QOS to the partition, it is unclear to me if the limits will throttle/limit usage *by* a user or usage *of* a partition. I am fine with the partition being fully utilized, just not by one user. It would be nice however to not have to manage this at the user level. Could you please clarify? Thanks, --Bill > I looked for documentation on GrpRunMins but found only an enhancement > request ticket. > Can you point me at anything descriptive? https://slurm.schedmd.com/resource_limits.html I should have said to use 'GrpTRESRunMins' - when we extended the resource tracking and allocation system to cover anything ("TRES") we moved away from the individual elements. GrpTRESRunMinds=cpu=1000 would be limit a group to (1000 cpus * minutes) under that QOS from any combination of jobs. > As I look at applying the QOS to the partition, it is unclear to me if the > limits will > throttle/limit usage *by* a user or usage *of* a partition. I am fine with > the partition > being fully utilized, just not by one user. It would be nice however to not > have to > manage this at the user level. Could you please clarify? The nomenclature's admittedly a bit confusing at times. For any of the "Grp" limits they'll track and constrain the account that's in use. When used as a partition QOS each account should be tracked separately. If you want to limit usage of the partition as a whole (this can be used to create "virtual partitions" that don't correspond to specific hardware, but are allowed to only use a subset of the system without regard to which specific nodes) you can use the MaxTRES / MaxTRESMins limits within the QOS. Hey Bill - I'm assuming that was enough to get you going, and am marking this as resolved/infogiven. Please reopen if there was anything else I could address. - Tim |