Hi, I am new to Slurm we are testing Slurm in our Environment I have following ?s How do I assign the resources based on userid, and or groupid, and job runtime. and used cpu hours for e.g we have following situation 1) group nodes are belongs to different Research group user submit the jobs to "long" queue I like to check who submit the job based on his group membership I like to assign those resources belongs his group only. 2) When user submit the job to short queue with RUNTIME limit 2hrs , I have too assign all the resources, if the RUNTIME limit > 2hrs I need to assign only those resources belongs to user group. 3) I like to assign the resources based cpu hours used. for .e.g each we will have group1 group2 each with 1000 core hours, I like to assign the resources from reserved resources to group1 users and group2 users until the consume all the 1000 cores hours. Currently I accomplish this in LSF using esub modify the job resonances based on above criteria using my own custom scripts for resources allocation. Appreciate your response Best Regards --Naseem
In Slurm resources are allocated by partitions and access to partitions can be controlled by user groups. Slurm has the concept of submission plugin, this is similar to LSF esub but instead of being executed by the submission program bsub/sbatch it is run by the slurmctld, which is the equivalent of mbatchd. The submission plugin has access and can modify the job submission parameters. For your case 1) you would define a partition per research group. In the case 2) you would configure partitions with different runtime limits then use the submission plugin to assign the correct partition. For the case 3) you could configure group limits so that jobs that will run in the partition dedicated to group 1 will get 1000 core hours the same way for group2. David
David, Thanks for the feed back, For your case 1) you would define a partition per research group. In the case We had this before but we don't like to define so many queues (paritions) we like to accompish this using few common queues. that's why I have custom esub to take care this. could you provide some example of plugins and configuration information. 2) you would configure partitions with different runtime limits then use the submission plugin to assign the correct partition. For the case. I like to have one "queue" partition with default runitme 2hrs and max runtime 72 hours and use plugins to check the runtime and assign the shared resources and reserved resournces 3) you could configure group limits so that jobs that will run in the partition dedicated to group 1 will get 1000 core hours the same way for group2. How this will work how much they have consumed , I like to have something like bank account and assign the resources if they have balance.
Hi, there are examples of submission plugins in the source code tree: src/plugins/job_submit there are many examples that can be easily customize to your needs. It seems that you require one queue where the jobs wait to be scheduled and sets of different attributes that has to apply to different groups. For example jobs belonging to group A have a certain resource limit, jobs of group B different run time and so on. I think the concept of QOS is what you should look at, QOS represents a logical grouping of jobs sharing a common set of limits and parameters. This is similar to what LSF queues are except that queues have jobs physically pending in them. Please see :http://slurm.schedmd.com/qos.html Slurm has hierarchical fairshare pretty much the same way as LSF does. It uses the concept of association which is a tuple (cluster:account:user) from which shares are withdrawn. You could assign shares and then monitor their usage with the sshare command. David
Hello, do you have further question under this ticket? David
Hi David, Thanks for the feedback I was going thru the QOS material and Multifactor fair share etc.. I need some more assistant to accomplish I am ready to apply this to test. We are implementing new Cluster , with 7000 cores I like to have following queues "long" , "short" and "idle" , "gpu" and "smp" and interactive for "long" queue I like to have 70% of capacity total capacity will be assign to different research group and users. group-A 10%, group-B 15% etc.. remaining capacity will be available to all the users as fairshare. I am thinking to accomplish this using Multifactor fairshare please help me how to take care this. "short" with 2 hrs run limit got to specific nodes assigned to these partition based on fairshare. usage of "long" shouldn't impact the "short" usage "short" queue with 24hrs to 48 hrs run limit should got other sets of the nodes. "gpu" and "amp" should be for gpu nodes and large memory amp nodes "idle" queue low priority preemtable queue should run the jobs on all the nodes but jobs under this queue will be terminated by high priority jobs "long" or "short". Best Regards
Hi, in order to use fairshare you firs have to enable Slurm accounting using the database: http://slurm.schedmd.com/accounting.html Then using the sacctmgr command you create your user hierarchy and assign different shares to users. The fairshare priority of jobs has multiple factors as described here: http://slurm.schedmd.com/slurm.conf.html in the PriorityWeight paragraphs. Which weights to use is your choice based on how you like to prioritize your workload. The fairshare tree is global meaning the usage of resources is accounted across all partitions. Does this help to answer your question? David
Yes it answer the ?s about fair share not available by partition I appreciate if you guide best way to configuration option to accomplish this 1) I like to have queue/partition called "long" 70% of capacity will be pre allocated by different research group 2) 30% capacity will be available to all other users who don't have allocation and also who has allocation 3) another queue/partition called "short" should be available to all the users , some node will be dedicated to this queue and some will be use from reserved allocation capacity. 4) Preempt queue/partition "idle" should be available for all the users should use entire cluster and terminated by high priority queue. 5 "gpu" queue/partition for all the users for gpu nodes fairshare. 6) "smp" queue/partition for smp nodes for all the users fairshare. when I create all these queues and created accounts for the following group grp1, grp2 grp3 with fairshare value 20 30 10 and assign the users. when I try to submit the jobs I am getting following error sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified so I have to associate each of this user to a partition, then how I am able to accomplish about configuration see below my slurm.conf ControlMachine=ci267 BackupController=ci266 AuthType=auth/munge CacheGroups=0 CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/usr/local/slurm/14.03.7/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/usr/local/slurm/14.03.7/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/usr/local/slurm/14.03.7/spool/slurmd SlurmUser=rcslmadm StateSaveLocation=/ibxadm/slurm/smc_test/14.03.7/state SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res PriorityType=priority/multifactor PriorityDecayHalfLife=7-0 PriorityCalcPeriod=5 PriorityFavorSmall=YES PriorityMaxAge=7 PriorityWeightAge=10000 PriorityWeightFairshare=100000 PriorityWeightJobSize=10000 PriorityWeightPartition=10000 PriorityWeightQOS=0 AccountingStorageEnforce=limits AccountingStorageHost=ci267 AccountingStorageLoc=slurm_acct_db AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=cluster JobCompLoc=/usr/local/slurm/14.03.7/work/logdir/slurm_job.log JobCompType=jobcomp/filetxt JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm/slurmd.log SlurmSchedLogFile=/var/log/slurm/slurmsched.log SlurmSchedLogLevel=3 NodeName=ci267,ci[260-266] Sockets=2 CoresPerSocket=8 State=UNKNOWN PartitionName=long Nodes=ci[260-266] Default=NO DefaultTime=10110 MaxTime=20415 Priority=10 State=UP PartitionName=idle Nodes=ci[260-266] Default=NO MaxTime=20415 Priority=5 PreemptMode=CANCEL Shared=FORCE State=UP PartitionName=short Nodes=ci[265-266] Default=NO DefaultTime=130 MaxTime=4330 Priority=10 PreemptMode=CANCEL Shared=FORCE State=UP PartitionName=interactive Nodes=ci267 Default=NO DefaultTime=740 MaxTime=740 Priority=10 PreemptMode=CANCEL Shared=FORCE State=UP
Hi, let me try to address your questions. 1) Fairshare. In order to assign capacity using fairshare you have to create accounts using the sacctmgr command and then add users to those accounts. One this has been done then you assign them the desired shares. Note that assigned shares are not directly percentage values they are normalized by Slurm. In other words say you define shares for 3 groups as 3,2 and 1. Slurm sums these number and then normalizes so the shares will be 3/6 -> 50% 2/6 -> 33% 1/6 -> 16%/ The error: Batch job submission failed: Invalid account or account/partition combination specified happens because you have AccountingStorageEnforce=limits set in your slurm.conf but the user running srun or sbatch does not have a valid account. You have to add all users explicitly to accounting. ->sacctmgr add account zebra ->sacctmgr add user david account zebra ->sacctmgr show assoc format=account,user Account User ---------- ---------- root root root zebra zebra david Now user david can submit jobs. 2) Partition structure. It seems that you want to configure partitions, long, short etc. with users having different shares per partition. This is similar to the LSF queue based fairshare. This is possible in Slurm by defining the partition when creating accounts. ->sacctmgr add account izar ->sacctmgr add usar david account=izar partition=markab However we don't recommend this as hard to maintain as every time you add/remove a partition the accounting configuration needs to change. Our recommendation is to replace partition with QOS implementing what we call 'floating partition'. You configure several QOS with different job limits, scheduling priorities and preemption rules. You can then use a partition for very specific hardware like gpus. Let us know your thoughts so we can discuss these options further. David
David, I agree what ever is good we will use please walk thru with one QOS example of short runlimit 2hr for all users, and preempt QOS for all users preemptive jobs with high priority jobs "long" jobs. 2) I have defined the account and user now am I able to run the jobs , I don't know this is the correct way to allocate the capacity for each group In this configuration I have assigned 20% of the total capacity to acct2, 5% to acct3 and 20% to acct5. I like to make sure each of this account shouldn't exceed their limit if they exceed their limit should their jobs should pend this should be applicable to "long" queue only but these user can run the jobs using QOS for runtime limit 2hrs , QOS for preempt jobs. sshare Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 1.000000 165409 1.000000 0.500000 root root 1 0.016667 0 0.000000 1.000000 acct1 1 0.016667 0 0.000000 1.000000 acct2 20 0.333333 67304 0.394754 0.440049 acct3 5 0.083333 32382 0.002838 0.976668 none 1 0.016667 0 0.000000 1.000000 acct4 10 0.166667 0 0.000000 1.000000 test 1 0.016667 0 0.000000 1.000000 acct5 20 0.333333 0 0.000000 1.000000 zebra 1 0.016667 65722 0.602407 0.000000
Hi, I will provide with a detail example of QOS. I would like to have a clearer idea about what you mean by limiting users. Do you have in mind an absolute limit say cpu usage that when hit the users should not run anymore or rather capping the resource usage, like the QOS does, for example: GrpCpus: The total count of cpus able to be used at any given time from jobs running from this QOS. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished from this group. Could you please clarify this for me. David
David, I sent the resource allocatin document could you please take a look and let me know what is the best way to accomplish this Best Regards NM
Created attachment 1353 [details] pdf describing user requirement
Mohammed, here are some recommendations how to configure your cluster. Hope this helps to get you going. o) In your slurm.conf configure preemption based on qos. PreemptMode=requeue PreemptType=preempt/qos o) Create the appropriate accounts: ->sacctmgr show assoc format=cluste,account,user,share,qos Cluster Account User Share QOS ---------- ---------- ---------- --------- -------------------- canis_maj+ root 1 normal canis_maj+ root root 1 normal canis_maj+ hyppo 1 normal canis_maj+ hyppo david 1 normal canis_maj+ hyppo cesare 1 normal o) Create the QOS. Create 2 qos with different priorities. ->sacctmgr create qos zebra priority=100 ->sacctmgr create qos crock priority=50 After the qos have been created you have to tell Slurm which accounts are authorized. In our example the users are david and cesare and we give them access to all qos. ->sacctmgr update user david set qos=normal,zebra,crock ->sacctmgr show assoc format=cluste,account,user,share,qos Cluster Account User Share QOS ---------- ---------- ---------- --------- -------------------- canis_maj+ root 1 normal canis_maj+ root root 1 normal canis_maj+ hyppo 1 normal canis_maj+ hyppo cesare 1 crock,normal,zebra canis_maj+ hyppo david 1 crock,normal,zebra The qos allows you to configure several limits for their users. Based on what you said we suggest to configure the GrpCPURunMins and GrpNodes which allow you to control the amount of CPU and number of nodes used by the qos. o) The next step is to configure the preemption relationship between the qos. Let's say qos zebra can preempt qos crock. ->sacctmgr update qos zebra set preempt=crock ->sacctmgr show qos format=name,priority,preempt Name Priority Preempt ---------- ---------- ---------- normal 2 zebra 100 crock crock 50 You can test preemption by submitting jobs to lower priority qos till the cluster is full and then submit one job to higher priority qos and you will see the job getting preempted right away. o) Finally in the account tree you configure your hierarchy and your shares based on your requirement. In this example we have left the shares to be at the default value 1. The fairshare priority will further prioritize user jobs inside the qos. David
David, Thanks I have created default queue created a qos "project" and set the share for each account. this is working as design. PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defaultq* up 14-04:15:0 7 idle ci[260-266] idle up 14-04:15:0 7 idle ci[260-266] interactive up 12:20:00 1 idle ci267 gpu up 12:20:00 1 idle ci267 smp up 12:20:00 1 idle ci267 Cluster Account User Share QOS GrpCPURunMins Partition ---------- ---------- ---------- --------- -------------------- ------------- ---------- cluster root 1 normal cluster root root 1 normal cluster grp1 20 projects 665 cluster grp1 a1 1 projects cluster grp1 a2 1 projects cluster grp2 25 projects 1024 cluster grp2 b1 1 projects cluster grp2 b2 1 projects so far it is working for "project" based requirement. in addition to this i like have following 1) I like still maintain separate partition for preemption called "idle" for easy of user understanding and easy to recognize these jobs. idle queue priority is lower than "defaultq" and jobs will be terminated by "defaultq" I like to have 512 core limit for each user for "idle" queue and this limit shouldn't be calculated and it should be in depended from qos project. allow all the users do I need associate all the users to this "partition" using sacctmgr 2) Also I like to give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs. Allow all users, assign max core limit to each user. what is the best way to configure this, create qos=short and assign all the users how to define user cores limit.
Define a preemptable QOS and the preempt action to be requeue. The parameter MaxCpusPerUser will limit the number of cpus per user in a given QOS. David
how about option 2 form my last comment 2) Also I like to give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs. Allow all users, assign max core limit to each user. what is the best way to configure this, create qos=short and assign all the users how to define user cores limit. Best Regards
Hi Mohammed, as we discussed previously this can also be done with qos. The idea is to have one partition shared by all and control access to resources with qos. qos allows you to have priorities and all sorts of limits to per users, jobs and groups. So the answer to your question is yes, create a qos to satisfy the requirement of high priority and max core limit per user. David
Info given. David