| Summary: | Slurm Evaluation Like to allocate the resource after job submitted | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | mn <mohammed.naseemuddin> |
| Component: | Limits | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da |
| Version: | 14.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | pdf describing user requirement | ||
In Slurm resources are allocated by partitions and access to partitions can be controlled by user groups. Slurm has the concept of submission plugin, this is similar to LSF esub but instead of being executed by the submission program bsub/sbatch it is run by the slurmctld, which is the equivalent of mbatchd. The submission plugin has access and can modify the job submission parameters. For your case 1) you would define a partition per research group. In the case 2) you would configure partitions with different runtime limits then use the submission plugin to assign the correct partition. For the case 3) you could configure group limits so that jobs that will run in the partition dedicated to group 1 will get 1000 core hours the same way for group2. David David,
Thanks for the feed back,
For your case 1) you would define a partition per research group.
In the case
We had this before but we don't like to define so many queues (paritions) we like to accompish this using few common queues. that's why I have custom esub to take care this. could you provide some example of plugins and configuration information.
2) you would configure partitions with different runtime
limits then use the submission plugin to assign the correct partition.
For the case.
I like to have one "queue" partition with default runitme 2hrs and max runtime 72 hours and use plugins to check the runtime and assign the shared resources and reserved resournces
3) you could configure group limits so that jobs that will run
in the partition dedicated to group 1 will get 1000 core hours the same
way for group2.
How this will work how much they have consumed , I like to have something like bank account and assign the resources if they have balance.
Hi, there are examples of submission plugins in the source code tree: src/plugins/job_submit there are many examples that can be easily customize to your needs. It seems that you require one queue where the jobs wait to be scheduled and sets of different attributes that has to apply to different groups. For example jobs belonging to group A have a certain resource limit, jobs of group B different run time and so on. I think the concept of QOS is what you should look at, QOS represents a logical grouping of jobs sharing a common set of limits and parameters. This is similar to what LSF queues are except that queues have jobs physically pending in them. Please see :http://slurm.schedmd.com/qos.html Slurm has hierarchical fairshare pretty much the same way as LSF does. It uses the concept of association which is a tuple (cluster:account:user) from which shares are withdrawn. You could assign shares and then monitor their usage with the sshare command. David
Hello,
do you have further question under this ticket?
David
Hi David, Thanks for the feedback I was going thru the QOS material and Multifactor fair share etc.. I need some more assistant to accomplish I am ready to apply this to test. We are implementing new Cluster , with 7000 cores I like to have following queues "long" , "short" and "idle" , "gpu" and "smp" and interactive for "long" queue I like to have 70% of capacity total capacity will be assign to different research group and users. group-A 10%, group-B 15% etc.. remaining capacity will be available to all the users as fairshare. I am thinking to accomplish this using Multifactor fairshare please help me how to take care this. "short" with 2 hrs run limit got to specific nodes assigned to these partition based on fairshare. usage of "long" shouldn't impact the "short" usage "short" queue with 24hrs to 48 hrs run limit should got other sets of the nodes. "gpu" and "amp" should be for gpu nodes and large memory amp nodes "idle" queue low priority preemtable queue should run the jobs on all the nodes but jobs under this queue will be terminated by high priority jobs "long" or "short". Best Regards Hi, in order to use fairshare you firs have to enable Slurm accounting using the database: http://slurm.schedmd.com/accounting.html Then using the sacctmgr command you create your user hierarchy and assign different shares to users. The fairshare priority of jobs has multiple factors as described here: http://slurm.schedmd.com/slurm.conf.html in the PriorityWeight paragraphs. Which weights to use is your choice based on how you like to prioritize your workload. The fairshare tree is global meaning the usage of resources is accounted across all partitions. Does this help to answer your question? David Yes it answer the ?s about fair share not available by partition
I appreciate if you guide best way to configuration option to accomplish this
1) I like to have queue/partition called "long" 70% of capacity will be pre allocated by different research group
2) 30% capacity will be available to all other users who don't have allocation and also who has allocation
3) another queue/partition called "short" should be available to all the users , some node will be dedicated to this queue and some will be use from reserved allocation capacity.
4) Preempt queue/partition "idle" should be available for all the users should use entire cluster and terminated by high priority queue.
5 "gpu" queue/partition for all the users for gpu nodes fairshare.
6) "smp" queue/partition for smp nodes for all the users fairshare.
when I create all these queues and created accounts for the following group
grp1, grp2 grp3 with fairshare value 20 30 10 and assign the users.
when I try to submit the jobs I am getting following error
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
so I have to associate each of this user to a partition, then how I am able to accomplish about configuration see below my slurm.conf
ControlMachine=ci267
BackupController=ci266
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/usr/local/slurm/14.03.7/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/usr/local/slurm/14.03.7/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/usr/local/slurm/14.03.7/spool/slurmd
SlurmUser=rcslmadm
StateSaveLocation=/ibxadm/slurm/smc_test/14.03.7/state
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=5
PriorityFavorSmall=YES
PriorityMaxAge=7
PriorityWeightAge=10000
PriorityWeightFairshare=100000
PriorityWeightJobSize=10000
PriorityWeightPartition=10000
PriorityWeightQOS=0
AccountingStorageEnforce=limits
AccountingStorageHost=ci267
AccountingStorageLoc=slurm_acct_db
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/usr/local/slurm/14.03.7/work/logdir/slurm_job.log
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
SlurmSchedLogLevel=3
NodeName=ci267,ci[260-266] Sockets=2 CoresPerSocket=8 State=UNKNOWN
PartitionName=long Nodes=ci[260-266] Default=NO DefaultTime=10110 MaxTime=20415 Priority=10 State=UP
PartitionName=idle Nodes=ci[260-266] Default=NO MaxTime=20415 Priority=5 PreemptMode=CANCEL Shared=FORCE State=UP
PartitionName=short Nodes=ci[265-266] Default=NO DefaultTime=130 MaxTime=4330 Priority=10 PreemptMode=CANCEL Shared=FORCE State=UP
PartitionName=interactive Nodes=ci267 Default=NO DefaultTime=740 MaxTime=740 Priority=10 PreemptMode=CANCEL Shared=FORCE State=UP
Hi,
let me try to address your questions.
1) Fairshare.
In order to assign capacity using fairshare you have to create
accounts using the sacctmgr command and then add users to those
accounts. One this has been done then you assign them the desired
shares. Note that assigned shares are not directly percentage values
they are normalized by Slurm. In other words say you define shares
for 3 groups as 3,2 and 1. Slurm sums these number and then normalizes
so the shares will be 3/6 -> 50% 2/6 -> 33% 1/6 -> 16%/
The error: Batch job submission failed: Invalid account or
account/partition combination specified
happens because you have AccountingStorageEnforce=limits set in your
slurm.conf but the user running srun or sbatch does not have a valid
account. You have to add all users explicitly to accounting.
->sacctmgr add account zebra
->sacctmgr add user david account zebra
->sacctmgr show assoc format=account,user
Account User
---------- ----------
root
root root
zebra
zebra david
Now user david can submit jobs.
2) Partition structure.
It seems that you want to configure partitions, long, short etc.
with users having different shares per partition. This is similar
to the LSF queue based fairshare. This is possible in Slurm by
defining the partition when creating accounts.
->sacctmgr add account izar
->sacctmgr add usar david account=izar partition=markab
However we don't recommend this as hard to maintain as every time
you add/remove a partition the accounting configuration needs to change.
Our recommendation is to replace partition with QOS implementing
what we call 'floating partition'. You configure several QOS with
different job limits, scheduling priorities and preemption rules.
You can then use a partition for very specific hardware like gpus.
Let us know your thoughts so we can discuss these options further.
David
David,
I agree what ever is good we will use please walk thru with one QOS example of short runlimit 2hr for all users, and preempt QOS for all users preemptive jobs with high priority jobs "long" jobs.
2) I have defined the account and user now am I able to run the jobs , I don't know this is the correct way to allocate the capacity for each group
In this configuration I have assigned 20% of the total capacity to acct2, 5% to acct3 and 20% to acct5.
I like to make sure each of this account shouldn't exceed their limit if they exceed their limit should their jobs should pend this should be applicable to "long" queue only
but these user can run the jobs using QOS for runtime limit 2hrs , QOS for preempt jobs.
sshare
Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1.000000 165409 1.000000 0.500000
root root 1 0.016667 0 0.000000 1.000000
acct1 1 0.016667 0 0.000000 1.000000
acct2 20 0.333333 67304 0.394754 0.440049
acct3 5 0.083333 32382 0.002838 0.976668
none 1 0.016667 0 0.000000 1.000000
acct4 10 0.166667 0 0.000000 1.000000
test 1 0.016667 0 0.000000 1.000000
acct5 20 0.333333 0 0.000000 1.000000
zebra 1 0.016667 65722 0.602407 0.000000
David,
I agree what ever is good we will use please walk thru with one QOS example of short runlimit 2hr for all users, and preempt QOS for all users preemptive jobs with high priority jobs "long" jobs.
2) I have defined the account and user now am I able to run the jobs , I don't know this is the correct way to allocate the capacity for each group
In this configuration I have assigned 20% of the total capacity to acct2, 5% to acct3 and 20% to acct5.
I like to make sure each of this account shouldn't exceed their limit if they exceed their limit should their jobs should pend this should be applicable to "long" queue only
but these user can run the jobs using QOS for runtime limit 2hrs , QOS for preempt jobs.
sshare
Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1.000000 165409 1.000000 0.500000
root root 1 0.016667 0 0.000000 1.000000
acct1 1 0.016667 0 0.000000 1.000000
acct2 20 0.333333 67304 0.394754 0.440049
acct3 5 0.083333 32382 0.002838 0.976668
none 1 0.016667 0 0.000000 1.000000
acct4 10 0.166667 0 0.000000 1.000000
test 1 0.016667 0 0.000000 1.000000
acct5 20 0.333333 0 0.000000 1.000000
zebra 1 0.016667 65722 0.602407 0.000000
Hi, I will provide with a detail example of QOS. I would like to have a clearer idea about what you mean by limiting users. Do you have in mind an absolute limit say cpu usage that when hit the users should not run anymore or rather capping the resource usage, like the QOS does, for example: GrpCpus: The total count of cpus able to be used at any given time from jobs running from this QOS. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished from this group. Could you please clarify this for me. David David, I sent the resource allocatin document could you please take a look and let me know what is the best way to accomplish this Best Regards NM Created attachment 1353 [details]
pdf describing user requirement
Mohammed,
here are some recommendations how to configure your cluster.
Hope this helps to get you going.
o) In your slurm.conf configure preemption based on qos.
PreemptMode=requeue
PreemptType=preempt/qos
o) Create the appropriate accounts:
->sacctmgr show assoc format=cluste,account,user,share,qos
Cluster Account User Share QOS
---------- ---------- ---------- --------- --------------------
canis_maj+ root 1 normal
canis_maj+ root root 1 normal
canis_maj+ hyppo 1 normal
canis_maj+ hyppo david 1 normal
canis_maj+ hyppo cesare 1 normal
o) Create the QOS.
Create 2 qos with different priorities.
->sacctmgr create qos zebra priority=100
->sacctmgr create qos crock priority=50
After the qos have been created you have to tell Slurm which accounts
are authorized. In our example the users are david and cesare and
we give them access to all qos.
->sacctmgr update user david set qos=normal,zebra,crock
->sacctmgr show assoc format=cluste,account,user,share,qos
Cluster Account User Share QOS
---------- ---------- ---------- --------- --------------------
canis_maj+ root 1 normal
canis_maj+ root root 1 normal
canis_maj+ hyppo 1 normal
canis_maj+ hyppo cesare 1 crock,normal,zebra
canis_maj+ hyppo david 1 crock,normal,zebra
The qos allows you to configure several limits for their users.
Based on what you said we suggest to configure the GrpCPURunMins and
GrpNodes which allow you to control the amount of CPU and number of
nodes used by the qos.
o) The next step is to configure the preemption relationship between
the qos. Let's say qos zebra can preempt qos crock.
->sacctmgr update qos zebra set preempt=crock
->sacctmgr show qos format=name,priority,preempt
Name Priority Preempt
---------- ---------- ----------
normal 2
zebra 100 crock
crock 50
You can test preemption by submitting jobs to lower priority qos
till the cluster is full and then submit one job to higher priority
qos and you will see the job getting preempted right away.
o) Finally in the account tree you configure your hierarchy and your
shares based on your requirement. In this example we have left the shares
to be at the default value 1. The fairshare priority will further prioritize
user jobs inside the qos.
David
David, Thanks I have created default queue created a qos "project" and set the share for each account. this is working as design. PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defaultq* up 14-04:15:0 7 idle ci[260-266] idle up 14-04:15:0 7 idle ci[260-266] interactive up 12:20:00 1 idle ci267 gpu up 12:20:00 1 idle ci267 smp up 12:20:00 1 idle ci267 Cluster Account User Share QOS GrpCPURunMins Partition ---------- ---------- ---------- --------- -------------------- ------------- ---------- cluster root 1 normal cluster root root 1 normal cluster grp1 20 projects 665 cluster grp1 a1 1 projects cluster grp1 a2 1 projects cluster grp2 25 projects 1024 cluster grp2 b1 1 projects cluster grp2 b2 1 projects so far it is working for "project" based requirement. in addition to this i like have following 1) I like still maintain separate partition for preemption called "idle" for easy of user understanding and easy to recognize these jobs. idle queue priority is lower than "defaultq" and jobs will be terminated by "defaultq" I like to have 512 core limit for each user for "idle" queue and this limit shouldn't be calculated and it should be in depended from qos project. allow all the users do I need associate all the users to this "partition" using sacctmgr 2) Also I like to give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs. Allow all users, assign max core limit to each user. what is the best way to configure this, create qos=short and assign all the users how to define user cores limit. David, Thanks I have created default queue created a qos "project" and set the share for each account. this is working as design. PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defaultq* up 14-04:15:0 7 idle ci[260-266] idle up 14-04:15:0 7 idle ci[260-266] interactive up 12:20:00 1 idle ci267 gpu up 12:20:00 1 idle ci267 smp up 12:20:00 1 idle ci267 Cluster Account User Share QOS GrpCPURunMins Partition ---------- ---------- ---------- --------- -------------------- ------------- ---------- cluster root 1 normal cluster root root 1 normal cluster grp1 20 projects 665 cluster grp1 a1 1 projects cluster grp1 a2 1 projects cluster grp2 25 projects 1024 cluster grp2 b1 1 projects cluster grp2 b2 1 projects so far it is working for "project" based requirement. in addition to this i like have following 1) I like still maintain separate partition for preemption called "idle" for easy of user understanding and easy to recognize these jobs. idle queue priority is lower than "defaultq" and jobs will be terminated by "defaultq" I like to have 512 core limit for each user for "idle" queue and this limit shouldn't be calculated and it should be in depended from qos project. allow all the users do I need associate all the users to this "partition" using sacctmgr 2) Also I like to give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs. Allow all users, assign max core limit to each user. what is the best way to configure this, create qos=short and assign all the users how to define user cores limit. Define a preemptable QOS and the preempt action to be requeue. The parameter MaxCpusPerUser will limit the number of cpus per user in a given QOS. David how about option 2 form my last comment 2) Also I like to give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs. Allow all users, assign max core limit to each user. what is the best way to configure this, create qos=short and assign all the users how to define user cores limit. Best Regards
Hi Mohammed,
as we discussed previously this can also be done with qos.
The idea is to have one partition shared by all and control access to resources with qos. qos allows you to have priorities and all sorts of limits to per users, jobs and groups. So the answer to your question is yes, create a qos to satisfy the requirement of high priority and max core limit per user.
David
Info given. David |
Hi, I am new to Slurm we are testing Slurm in our Environment I have following ?s How do I assign the resources based on userid, and or groupid, and job runtime. and used cpu hours for e.g we have following situation 1) group nodes are belongs to different Research group user submit the jobs to "long" queue I like to check who submit the job based on his group membership I like to assign those resources belongs his group only. 2) When user submit the job to short queue with RUNTIME limit 2hrs , I have too assign all the resources, if the RUNTIME limit > 2hrs I need to assign only those resources belongs to user group. 3) I like to assign the resources based cpu hours used. for .e.g each we will have group1 group2 each with 1000 core hours, I like to assign the resources from reserved resources to group1 users and group2 users until the consume all the 1000 cores hours. Currently I accomplish this in LSF using esub modify the job resonances based on above criteria using my own custom scripts for resources allocation. Appreciate your response Best Regards --Naseem