Ticket 1070

Summary:	Slurm Evaluation Like to allocate the resource after job submitted
Product:	Slurm	Reporter:	mn <mohammed.naseemuddin>
Component:	Limits	Assignee:	David Bigagli <david>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	da
Version:	14.11.x
Hardware:	Linux
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	pdf describing user requirement

Description mn 2014-08-27 21:07:29 MDT

Hi,

I am new to Slurm we are testing Slurm in our Environment 

I have following ?s

   How do I assign the resources based on userid, and or groupid, and job runtime. and used cpu hours

for e.g

  we have following situation

   1) group nodes are belongs to different Research group

      user submit the jobs to "long" queue I like to check who submit the job based on his group membership I like to assign those resources belongs his group only.

    2) When user submit the job to short queue with RUNTIME limit 2hrs , I have too assign all the resources, if the RUNTIME limit > 2hrs I need to assign only those resources belongs to user group.

    3) I like to assign the resources based cpu hours used. for .e.g each we will have group1 group2 each with 1000 core hours, I like to assign the resources from reserved resources to group1 users and group2 users until the consume all the 1000 cores hours.

Currently I accomplish this in LSF using esub modify the job resonances based on above criteria using my own custom scripts for resources allocation.

Appreciate your response
Best Regards
--Naseem

Comment 1 David Bigagli 2014-08-28 04:41:57 MDT

In Slurm resources are allocated by partitions and access to partitions
can be controlled by user groups. Slurm has the concept of submission
plugin, this is similar to LSF esub but instead of being executed
by the submission program bsub/sbatch it is run by the slurmctld, which
is the equivalent of mbatchd. The submission plugin has access and can
modify the job submission parameters.

For your case 1) you would define a partition per research group.
In the case 2) you would configure partitions with different runtime
limits then use the submission plugin to assign the correct partition.
For the case 3) you could configure group limits so that jobs that will run
in the partition dedicated to group 1 will get 1000 core hours the same
way for group2.

David

Comment 2 mn 2014-08-30 17:39:15 MDT

David,

Thanks for the feed back,

For your case 1) you would define a partition per research group.
In the case 
        We had this before but we don't like to define so many queues (paritions) we like to accompish this using few common queues. that's why I have custom esub to take care this. could you provide some example of plugins and configuration information.

2) you would configure partitions with different runtime
limits then use the submission plugin to assign the correct partition.
For the case.
    I like to have one "queue" partition with default runitme 2hrs and max runtime 72 hours and use plugins to check the runtime and assign the shared resources and reserved resournces

 3) you could configure group limits so that jobs that will run
in the partition dedicated to group 1 will get 1000 core hours the same
way for group2.

How this will work how much they have consumed , I like to have something like bank account and assign the resources if they have balance.

Comment 3 David Bigagli 2014-08-31 04:00:44 MDT

Hi,
  there are examples of submission plugins in the source code tree:

src/plugins/job_submit

there are many examples that can be easily customize to your needs.

It seems that you require one queue where the jobs wait to be
scheduled and sets of different attributes that has to apply
to different groups. For example jobs belonging to group A have a certain resource limit, jobs of group B different run time and so on.
I think the concept of QOS is what you should look at, QOS represents a logical grouping of jobs sharing a common set of limits and parameters.
This is similar to what LSF queues are except that queues have jobs 
physically pending in them. Please see :http://slurm.schedmd.com/qos.html

Slurm has hierarchical fairshare pretty much the same way as LSF does.
It uses the concept of association which is a tuple (cluster:account:user)
from which shares are withdrawn. You could assign shares and then monitor
their usage with the sshare command.

David

Comment 4 David Bigagli 2014-09-10 07:57:19 MDT

Hello,
     do you have further question under this ticket?

David

Comment 5 mn 2014-10-14 01:46:08 MDT

Hi David,

Thanks for the feedback I was going thru the QOS material and Multifactor fair share etc..

I need some more assistant to accomplish I am ready to apply this to test.

We are implementing new Cluster , with 7000 cores
 
I like to have following queues "long" , "short" and "idle" , "gpu" and "smp" and interactive

for "long" queue I like to have 70% of capacity total capacity will be assign to different research group and users. group-A 10%, group-B 15% etc..

remaining capacity will be available to all the users as fairshare.

I am thinking to accomplish this using Multifactor fairshare please help me how to take care this.

"short" with 2 hrs run limit got to specific nodes assigned to these partition based on fairshare. usage of "long" shouldn't impact the "short" usage

"short" queue with 24hrs to 48 hrs run limit should got other sets of the nodes.


"gpu" and "amp"  should be for gpu nodes and large memory amp nodes

"idle" queue low priority preemtable queue should run the jobs on all the nodes but jobs under this queue will be terminated by high priority jobs "long" or "short".


Best Regards

Comment 6 David Bigagli 2014-10-14 06:58:23 MDT

Hi,
   in order to use fairshare you firs have
to enable Slurm accounting using the database:

http://slurm.schedmd.com/accounting.html

Then using the sacctmgr command you create
your user hierarchy and assign different
shares to users.

The fairshare priority of jobs has multiple
factors as described here:

http://slurm.schedmd.com/slurm.conf.html

in the PriorityWeight paragraphs. Which
weights to use is your choice based on how
you like to prioritize your workload.

The fairshare tree is global meaning the
usage of resources is accounted across
all partitions.

Does this help to answer your question?

David

Comment 7 mn 2014-10-15 00:04:05 MDT

Yes it answer the ?s about fair share not available by partition

I appreciate if you guide best way to configuration option to accomplish this

    1)   I like to have queue/partition called "long" 70% of capacity will be pre allocated by different research group

    2)   30% capacity will be available to all other users who don't have allocation and also who has allocation 

    3)   another queue/partition called "short" should be available to all the users , some node will be dedicated to this queue and some will be use from reserved allocation capacity.


    4) Preempt queue/partition "idle" should be available for all the users should use entire cluster and terminated by high priority queue.

    5  "gpu" queue/partition for all the users for gpu nodes fairshare.


    6) "smp" queue/partition for smp nodes for all the users fairshare.


when I create all these queues and created accounts for the following group
grp1, grp2 grp3  with fairshare value 20 30 10 and assign the users.

when I try to submit the jobs I am getting following error
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

so I have to associate each of this user to a partition, then how I am able to accomplish about configuration see below my slurm.conf


ControlMachine=ci267
BackupController=ci266
AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/usr/local/slurm/14.03.7/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/usr/local/slurm/14.03.7/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/usr/local/slurm/14.03.7/spool/slurmd
SlurmUser=rcslmadm
StateSaveLocation=/ibxadm/slurm/smc_test/14.03.7/state
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityCalcPeriod=5
PriorityFavorSmall=YES
PriorityMaxAge=7
PriorityWeightAge=10000
PriorityWeightFairshare=100000
PriorityWeightJobSize=10000
PriorityWeightPartition=10000
PriorityWeightQOS=0
AccountingStorageEnforce=limits
AccountingStorageHost=ci267
AccountingStorageLoc=slurm_acct_db
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/usr/local/slurm/14.03.7/work/logdir/slurm_job.log
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmSchedLogFile=/var/log/slurm/slurmsched.log
SlurmSchedLogLevel=3
NodeName=ci267,ci[260-266] Sockets=2 CoresPerSocket=8 State=UNKNOWN
PartitionName=long Nodes=ci[260-266] Default=NO  DefaultTime=10110 MaxTime=20415 Priority=10 State=UP
PartitionName=idle Nodes=ci[260-266] Default=NO MaxTime=20415 Priority=5  PreemptMode=CANCEL Shared=FORCE State=UP
PartitionName=short Nodes=ci[265-266] Default=NO DefaultTime=130 MaxTime=4330 Priority=10  PreemptMode=CANCEL Shared=FORCE State=UP
PartitionName=interactive Nodes=ci267 Default=NO DefaultTime=740 MaxTime=740 Priority=10  PreemptMode=CANCEL Shared=FORCE State=UP

Comment 8 David Bigagli 2014-10-15 06:46:19 MDT

Hi,
  let me try to address your questions.

1) Fairshare.

In order to assign capacity using fairshare you have to create
accounts using the sacctmgr command and then add users to those
accounts. One this has been done then you assign them the desired
shares. Note that assigned shares are not directly percentage values
they are normalized by Slurm. In other words say you define shares
for 3 groups as 3,2 and 1. Slurm sums these number and then normalizes
so the shares will be 3/6 -> 50% 2/6 -> 33% 1/6 -> 16%/

The error:  Batch job submission failed: Invalid account or
account/partition combination specified

happens because you have AccountingStorageEnforce=limits set in your
slurm.conf but the user running srun or sbatch does not have a valid
account. You have to add all users explicitly to accounting.

->sacctmgr add account zebra
->sacctmgr add user david account zebra
->sacctmgr show assoc format=account,user
   Account       User
---------- ----------
      root
      root       root
     zebra
     zebra      david

Now user david can submit jobs.

2) Partition structure.

It seems that you want to configure partitions, long, short etc.
with users having different shares per partition. This is similar
to the LSF queue based fairshare. This is possible in Slurm by
defining the partition when creating accounts.

->sacctmgr add account izar
->sacctmgr add usar david account=izar partition=markab

However we don't recommend this as hard to maintain as every time
you add/remove a partition the accounting configuration needs to change.

Our recommendation is to replace partition with QOS implementing
what we call 'floating partition'. You configure several QOS with
different job limits, scheduling priorities and preemption rules.
You can then use a partition for very specific hardware like gpus.

Let us know your thoughts so we can discuss these options further.

David

Comment 9 mn 2014-10-15 08:18:47 MDT

David,

I agree what ever is good we will use please walk thru with one QOS example of short runlimit 2hr for all users, and preempt QOS for all users preemptive jobs with high priority jobs "long" jobs.

2) I have defined the account and user now am I able to run the jobs , I don't know this is the correct way to allocate the capacity for each group

In this configuration I have assigned 20% of the total capacity to acct2, 5% to acct3 and 20% to acct5.
 
I like to make sure each of this account shouldn't exceed their limit if they exceed their limit should their jobs should pend this should be applicable to "long" queue only

but these user can  run the jobs using QOS for runtime limit 2hrs , QOS for preempt jobs.


 sshare
             Account       User Raw Shares Norm Shares   Raw Usage Effectv Usage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000      165409      1.000000   0.500000 
 root                      root          1    0.016667           0      0.000000   1.000000 
 acct1                                   1    0.016667           0      0.000000   1.000000 
 acct2                                  20    0.333333       67304      0.394754   0.440049 
 acct3                                   5    0.083333       32382      0.002838   0.976668 
 none                                    1    0.016667           0      0.000000   1.000000 
 acct4                                  10    0.166667           0      0.000000   1.000000 
 test                                    1    0.016667           0      0.000000   1.000000 
 acct5                               20    0.333333           0      0.000000   1.000000 
 zebra                                   1    0.016667       65722      0.602407   0.000000

Comment 10 mn 2014-10-15 08:21:20 MDT

David,

I agree what ever is good we will use please walk thru with one QOS example of short runlimit 2hr for all users, and preempt QOS for all users preemptive jobs with high priority jobs "long" jobs.

2) I have defined the account and user now am I able to run the jobs , I don't know this is the correct way to allocate the capacity for each group

In this configuration I have assigned 20% of the total capacity to acct2, 5% to acct3 and 20% to acct5.
 
I like to make sure each of this account shouldn't exceed their limit if they exceed their limit should their jobs should pend this should be applicable to "long" queue only

but these user can  run the jobs using QOS for runtime limit 2hrs , QOS for preempt jobs.

sshare
             Account       User Raw Shares Norm Shares   Raw Usage Effectv Usage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000      165409      1.000000   0.500000 
 root                      root          1    0.016667           0      0.000000   1.000000 
 acct1                                   1    0.016667           0      0.000000   1.000000 
 acct2                                  20    0.333333       67304      0.394754   0.440049 
 acct3                                   5    0.083333       32382      0.002838   0.976668 
 none                                    1    0.016667           0      0.000000   1.000000 
 acct4                                  10    0.166667           0      0.000000   1.000000 
 test                                    1    0.016667           0      0.000000   1.000000 
 acct5                               20    0.333333           0      0.000000   1.000000 
 zebra                                   1    0.016667       65722      0.602407   0.000000

Comment 11 David Bigagli 2014-10-15 11:25:24 MDT

Hi,
  I will provide with a detail example of QOS. I would like to have a clearer idea about what you mean by limiting users. Do you have in mind an absolute
limit say cpu usage that when hit the users should not run anymore or rather
capping the resource usage, like the QOS does, for example:

GrpCpus: The total count of cpus able to be used at any given time from jobs running from this QOS. If this limit is reached new jobs will be queued but only allowed to run after resources have been relinquished from this group. 

Could you please clarify this for me.

David

Comment 12 mn 2014-10-15 19:44:02 MDT

David,

I sent the resource allocatin document could you please take a look and let me know what is the best way to accomplish this 

Best Regards
NM

Comment 13 David Bigagli 2014-10-16 04:44:33 MDT

Created attachment 1353 [details]
pdf describing user requirement

Comment 14 David Bigagli 2014-10-16 06:19:25 MDT

Mohammed,
        here are some recommendations how to configure your cluster.
Hope this helps to get you going.

o) In your slurm.conf configure preemption based on qos.

PreemptMode=requeue
PreemptType=preempt/qos

o) Create the appropriate accounts:

->sacctmgr show assoc format=cluste,account,user,share,qos
   Cluster    Account       User     Share                  QOS
---------- ---------- ---------- --------- --------------------
canis_maj+       root                    1               normal
canis_maj+       root       root         1               normal
canis_maj+      hyppo                    1               normal
canis_maj+      hyppo      david         1               normal
canis_maj+      hyppo     cesare         1               normal

o) Create the QOS.

Create 2 qos with different priorities.

->sacctmgr create qos zebra priority=100
->sacctmgr create qos crock priority=50

After the qos have been created you have to tell Slurm which accounts
are authorized. In our example the users are david and cesare and
we give them access to all qos.

->sacctmgr update user david set qos=normal,zebra,crock

->sacctmgr show assoc format=cluste,account,user,share,qos
   Cluster    Account       User     Share                  QOS
---------- ---------- ---------- --------- --------------------
canis_maj+       root                    1               normal
canis_maj+       root       root         1               normal
canis_maj+      hyppo                    1               normal
canis_maj+      hyppo     cesare         1   crock,normal,zebra
canis_maj+      hyppo      david         1   crock,normal,zebra

The qos allows you to configure several limits for their users.
Based on what you said we suggest to configure the GrpCPURunMins and
GrpNodes which allow you to control the amount of CPU and number of
nodes used by the qos.

o) The next step is to configure the preemption relationship between
the qos. Let's say qos zebra can preempt qos crock.

->sacctmgr update qos zebra set preempt=crock
->sacctmgr show qos format=name,priority,preempt
      Name   Priority    Preempt
---------- ---------- ----------
    normal          2
     zebra        100      crock
     crock         50

You can test preemption by submitting jobs to lower priority qos
till the cluster is full and then submit one job to higher priority
qos and you will see the job getting preempted right away.

o) Finally in the account tree you configure your hierarchy and your
shares based on your requirement. In this example we have left the shares
to be at the default value 1. The fairshare priority will further prioritize
user jobs inside the qos.

David

Comment 15 mn 2014-10-22 08:06:55 MDT

David,

Thanks I have created default queue  created a qos "project" and set the share for each account. this is working as design.

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
defaultq*      up 14-04:15:0      7   idle ci[260-266]
idle           up 14-04:15:0      7   idle ci[260-266]
interactive    up   12:20:00      1   idle ci267
gpu            up   12:20:00      1   idle ci267
smp            up   12:20:00      1   idle ci267


Cluster    Account       User     Share                  QOS   GrpCPURunMins  Partition 
---------- ---------- ---------- --------- -------------------- ------------- ---------- 
   cluster       root                    1               normal                          
   cluster       root       root         1               normal                          
   cluster       grp1                  20             projects           665            
   cluster       grp1         a1         1             projects                          
   cluster       grp1         a2         1             projects               
   cluster       grp2                  25             projects           1024            
   cluster       grp2         b1         1             projects                          
   cluster       grp2         b2         1             projects                    

so far it is working for "project" based requirement. in addition to this i like have following

1) I like still maintain separate partition for preemption called "idle" for easy of user understanding and easy to recognize these jobs. idle queue priority  is lower than  "defaultq" and jobs will be terminated by "defaultq"  I like to have 512 core limit for each user for "idle" queue and this limit shouldn't be calculated and it should be in depended from qos project.

allow all the users

do I need associate all the users to this "partition" using sacctmgr


2) Also I like to  give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs.

Allow all users, assign max core limit to each user.


what is the best way to configure this, create qos=short and assign all the users how to define user cores limit.

Comment 16 mn 2014-10-22 08:07:46 MDT

David,

Thanks I have created default queue  created a qos "project" and set the share for each account. this is working as design.

PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
defaultq*      up 14-04:15:0      7   idle ci[260-266]
idle           up 14-04:15:0      7   idle ci[260-266]
interactive    up   12:20:00      1   idle ci267
gpu            up   12:20:00      1   idle ci267
smp            up   12:20:00      1   idle ci267


Cluster    Account       User     Share                  QOS   GrpCPURunMins  Partition 
---------- ---------- ---------- --------- -------------------- ------------- ---------- 
   cluster       root                    1               normal                          
   cluster       root       root         1               normal                          
   cluster       grp1                  20             projects           665            
   cluster       grp1         a1         1             projects                          
   cluster       grp1         a2         1             projects               
   cluster       grp2                  25             projects           1024            
   cluster       grp2         b1         1             projects                          
   cluster       grp2         b2         1             projects                    

so far it is working for "project" based requirement. in addition to this i like have following

1) I like still maintain separate partition for preemption called "idle" for easy of user understanding and easy to recognize these jobs. idle queue priority  is lower than  "defaultq" and jobs will be terminated by "defaultq"  I like to have 512 core limit for each user for "idle" queue and this limit shouldn't be calculated and it should be in depended from qos project.

allow all the users

do I need associate all the users to this "partition" using sacctmgr


2) Also I like to  give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs.

Allow all users, assign max core limit to each user.


what is the best way to configure this, create qos=short and assign all the users how to define user cores limit.

Comment 17 David Bigagli 2014-10-22 10:44:15 MDT

Define a preemptable QOS and the preempt action to be requeue.
The parameter MaxCpusPerUser will limit the number of cpus per user
in a given QOS.

David

Comment 18 mn 2014-10-22 19:36:15 MDT

how about option 2 form my last comment


2) Also I like to  give high priority to "20 minutes and 2hrs" jobs and reserve some nodes for short jobs.

Allow all users, assign max core limit to each user.


what is the best way to configure this, create qos=short and assign all the users how to define user cores limit.


Best Regards

Comment 19 David Bigagli 2014-10-23 04:47:54 MDT

Hi Mohammed,
            as we discussed previously this can also be done with qos.
The idea is to have one partition shared by all and control access to resources with qos. qos allows you to have priorities and all sorts of limits to per users, jobs and groups. So the answer to your question is yes, create a qos to satisfy the requirement of high priority and max core limit per user.

David

Comment 20 David Bigagli 2014-11-10 05:30:50 MST

Info given.

David