Ticket 17091

Summary:	Require fairshare configuration to prevent one user from monopolizing queue
Product:	Slurm	Reporter:	Clay Fandre <clay.fandre>
Component:	Configuration	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	22.05.2
Hardware:	Linux
OS:	Linux
Site:	Honeywell HPC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Clay Fandre 2023-06-30 09:55:46 MDT

We have a small cluster with 4 compute nodes, each with 24 cores.

The jobs run on this cluster are all single-core jobs, and most users do not adjust the walltime from the default of 3 hours. Jobs typically take 1-2 hours to complete.

Lately a single user has been submitting hundreds of jobs and causing others to have to wait days to get their jobs run. Obviously this has caused some users to be a bit disgruntled.

Please provide a configuration to allow jobs for new users/jobs to get higher priority than existing pending jobs for users that currently have active jobs.

Thanks.

Clay Fandre

Comment 1 Jason Booth 2023-06-30 10:53:20 MDT

There are a few options available to you. You can set a max runnable at any time for the user, or push down their job's priority based on their job's usage/consumption. 

Would you please attach your current slurm.conf so that we can review what you have configured?

[1] https://slurm.schedmd.com/priority_multifactor.html#fairshare
[2] https://slurm.schedmd.com/sacctmgr.html#OPT_FairShare=
[3] https://slurm.schedmd.com/sacctmgr.html#SECTION_EXAMPLES
[4] https://slurm.schedmd.com/sacctmgr.html#OPT_GrpTRESRunMins

Comment 2 Clay Fandre 2023-06-30 11:09:25 MDT

[root@asic-az97n204 slurm]# cat slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
# $Revision: 110 $
#
ClusterName=asic-az97-hpc
SlurmctldHost=asic-az97n204
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
LaunchParameters=use_interactive_step
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
SrunPortRange=60001-63000
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
PriorityWeightFairshare=10000
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=asic-az97n204
#AccountingStoragePass=
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=
#AccountingStoreFlags=
#JobCompHost=
JobCompLoc=/var/log/slurm/slurm.jobcomp.log
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=blazecomp[1-4] CPUs=24 RealMemory=257151 Sockets=1 CoresPerSocket=24 ThreadsPerCore=1 Weight=1 State=UNKNOWN
NodeName=blazeuser[1-4] CPUs=24 RealMemory=257151 Sockets=1 CoresPerSocket=24 ThreadsPerCore=1 Weight=3 State=UNKNOWN
PartitionName=asic Nodes=blazecomp[1-4] Default=YES MaxTime=1440 DefaultTime=180 State=UP

Comment 4 Ben Roberts 2023-06-30 14:21:43 MDT

Hi Clay,

It looks like you have multifactor priority enabled with Fairshare enabled as the only weight.
PriorityType=priority/multifactor
PriorityWeightFairshare=10000

This should be enough to get you started using Fairshare to prevent one user from dominating the queue.  Did you recently enable these settings?  When you have several jobs queued can you run sprio to see what it shows for the priority of the queued jobs?  

Thanks,
Ben

Comment 5 Clay Fandre 2023-06-30 14:27:59 MDT

Yes, I did just add those two options and haven't tested them yet as I wasn't sure that's the best way to do it.

Unfortunately the queue seems to be empty now. I will do some testing when I can to simulate the jobs to see if it solves the problem.

Clay

Comment 6 Clay Fandre 2023-06-30 17:20:21 MDT

So some jobs were submitted, but sprio doesn't seem to be working.

[root@asic-az97n204 ~]# squeue | head
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            573801      asic test_tc_  e503866 PD       0:00      1 (Resources)
            573802      asic test_tt_  e503866 PD       0:00      1 (Priority)
            573803      asic test_tt_  e503866 PD       0:00      1 (Priority)
            573804      asic test_tt_  e503866 PD       0:00      1 (Priority)
            573805      asic test_wdt  e503866 PD       0:00      1 (Priority)
            573806      asic test_wdt  e503866 PD       0:00      1 (Priority)
            573807      asic test_wdt  e503866 PD       0:00      1 (Priority)
            573808      asic test_wdt  e503866 PD       0:00      1 (Priority)
            573809      asic test_wdt  e503866 PD       0:00      1 (Priority)
[root@asic-az97n204 ~]# sprio
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE
[root@asic-az97n204 ~]#




[root@asic-az97n204 ~]# /app/slurm/bin/showuserjobs
Batch job status for cluster asic-az97-hpc at Fri Jun 30 16:18:26 MST 2023

Node states summary:
allocated      4 nodes (100.00%)     96 CPUs (100.00%)
Total          4 nodes (100.00%)     96 CPUs (100.00%)

Job summary: 1391 jobs total (max=10000) in all partitions.

Username/           Runnin         Limit Pendin
Totals      Account   Jobs   CPUs   CPUs   Jobs   CPUs Further info
=========== ======= ====== ====== ====== ====== ====== =============================
ACCT_TOTAL  (null)      96     96    Inf   1295   1295 Running+Pending=1391 CPUs, 3 users
GRAND_TOTAL ALL         96     96    Inf   1295   1295 Running+Pending=1391 CPUs, 3 users
e503866     (null)      96     96    Inf     18     18
h523709     (null)       0      0    Inf   1252   1252
h359520     (null)       0      0    Inf     25     25

Comment 7 Ben Roberts 2023-07-03 09:00:12 MDT

Hi Clay,

That is strange that sprio doesn't show anything for the queued jobs with the multifactor priority plugin enabled.  Can I have you verify that it is recognized correctly by running:
scontrol show config | grep -i priority

If you have jobs queued right now I'd also like to see the show job output for one of them.  Could you run the following command with the appropriate job id in place of <jobid>:
scontrol show job <jobid>

Thanks,
Ben

Comment 8 Clay Fandre 2023-07-03 09:28:28 MDT

[root@asic-az97n204 ~]# scontrol show config | grep -i priority
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           =
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 0
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
PriorityWeightTRES      = (null)


[root@asic-az97n204 ~]# sprio
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE
[root@asic-az97n204 ~]# squeue | wc
   5084   40672  401643

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Monday, July 3, 2023 at 10:00 AM
To: Fandre, Clay <clay.fandre@honeywell.com>
Subject: [External] [Bug 17091] Require fairshare configuration to prevent one user from monopolizing queue
You don't often get email from bugs@schedmd.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
WARNING: This message has originated from an External Source. This may be a phishing email that can result in unauthorized access to Honeywell systems. Please use proper judgment and caution when opening attachments, clicking links or responding.
Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=17091#c7> on bug 17091<https://bugs.schedmd.com/show_bug.cgi?id=17091> from Ben Roberts<mailto:ben@schedmd.com>

Hi Clay,



That is strange that sprio doesn't show anything for the queued jobs with the

multifactor priority plugin enabled.  Can I have you verify that it is

recognized correctly by running:

scontrol show config | grep -i priority



If you have jobs queued right now I'd also like to see the show job output for

one of them.  Could you run the following command with the appropriate job id

in place of <jobid>:

scontrol show job <jobid>



Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 9 Ben Roberts 2023-07-03 10:48:34 MDT

Thanks for verifying that the multifactor plugin is recognized correctly.  Since that's the case can I have you send a copy of your slurm.conf along with any other conf files in the same directory?  

I would also still like to see the job details for a job that's pending:
scontrol show job <jobid>

Thanks,
Ben

Comment 10 Clay Fandre 2023-07-03 12:14:26 MDT

Created attachment 31053 [details]
Slurm conf files

Slurm conf files

Comment 11 Ben Roberts 2023-07-03 14:18:22 MDT

Thanks for sending all your config files for me to review.  I apologize that I forgot to have you remove the database password in your slurmdbd.conf file.  I've marked the attachment as private now so that only SchedMD employees can view the file, but you will probably want to update your password.

It's still not clear what might be preventing sprio from showing priority information about the queued jobs.  I would like to have you enable debug logs that show information about priority calculations long enough to submit a test job.  You can do this without restarting the cluster like this:
scontrol setdebugflags +priority

Once that is enabled you can submit a test job and then turn the debug flag back off like this:
scontrol setdebugflags -priority

Then if you would send the slurmctld log file I'll take a look at what's happening with the priority calculation.

Thanks,
Ben

Comment 12 Clay Fandre 2023-07-06 09:22:09 MDT

So there were no jobs running this weekend so I stopped and restarted slurmctld and the slurmd's. sprio is now working.



[root@asic-az97n204 slurm]# sprio
          JOBID PARTITION   PRIORITY       SITE  FAIRSHARE
         655117 asic               1          0          0
         655118 asic               1          0          0
         655119 asic               1          0          0
         655120 asic               1          0          0
         655121 asic               1          0          0
         655122 asic               1          0          0
         655123 asic               1          0          0
         655124 asic               1          0          0

Comment 13 Ben Roberts 2023-07-06 09:32:24 MDT

I'm glad that sprio is working after a restart.  It doesn't show any fairshare priority for the jobs though.  It's possible that these are all from a high utilization user/account, but I'd like to make sure.  Can you also send the output from:
squeue
sshare -a

Thanks,
Ben

Comment 14 Clay Fandre 2023-07-06 09:35:01 MDT


[root@asic-az97n204 slurm]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            655401      asic tp_max_p  h523709  R      10:06      1 blazecomp2
            655395      asic tp_max_p  h523709  R      10:15      1 blazecomp2
            652252      asic tp_ssp_c  h508001  R    1:31:22      1 blazecomp4
            652256      asic tp_ssp_c  h508001  R    1:31:22      1 blazecomp4
            652258      asic tp_ssp_c  h508001  R    1:31:22      1 blazecomp2
[root@asic-az97n204 slurm]# sshare -a
Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root                                          0.000000           0      1.000000
 root                      root          1    1.000000           0      0.000000   1.000000
[root@asic-az97n204 slurm]#

Comment 15 Ben Roberts 2023-07-06 10:19:54 MDT

Thanks for sending that output.  It looks like some of the initial configuration that needs to happen for Fairshare to work hasn't been done yet.  My apologies that I didn't check for that earlier on.

When you first configure a cluster you can add accounts and users to create a hierarchy that matches your internal organization.  This allows you to track usage by different departments as well as for individual users.  Once you create accounts you can create user associations, which are the combination of the cluster, account, username and optionally the partition they're allowed to use with that account.  Here's an example of how that might look:
$ sacctmgr show assoc tree format=cluster,account,user,partition
   Cluster Account                    User  Partition 
---------- -------------------- ---------- ---------- 
    knight root                                       
    knight  root                      root            
    knight  a1                                        
    knight   sub1                                     
    knight    sub1                     ben            
    knight    sub1                   user1            
    knight    sub1                   user2            
    knight   sub2                                     
    knight    sub2                     ben            
    knight    sub2                   user2            
    knight   sub3                                     
    knight    sub3                     ben            
    knight    sub3                   user3            
    knight  a2                                        
    knight   sub4                                     
    knight    sub4                     ben            
    knight    sub4                   user1            
    knight    sub4                   user4            
    knight   sub5                                     
    knight    sub5                   user5            
    knight   sub6                                     
    knight    sub6                     ben            
    knight    sub6                   user1            
    knight    sub6                   user2            
    knight    sub6                   user3            

You can see that there are 2 primary accounts; a1 and a2.  Beneath the a1 account I have sub1, sub2 and sub3 accounts, each with different users.  The a2 account similarly has different sub-accounts that are children of that account.  I don't have a partition associated with any of these user associations, but that is an option as I mentioned.

I'll show a few examples of how creating these accounts and user associations might look.  To create the a1 account I would run this:
  sacctmgr add account a1

To create the sub1 account as a child of the a1 account I would run this:
  sacctmgr add account sub1 parent=a1

To create my user in the sub1 account I would run this:
  sacctmgr add user ben account=sub1 

You can find more information on creating accounts and users as well as setting limits on those entities in the sacctmgr documentation.
https://slurm.schedmd.com/sacctmgr.html

These accounts and user associations have to exist in order for fairshare to track the usage of the different users and adjust the fairshare priority values accordingly.

There is also an option to require that a user has to have been created with sacctmgr before they are able to submit jobs.  Right now any user can submit a job to the system because it is not enforcing any kind of account hierarchy.  Once you have the hierarchy created that you want you can enable the AccountingStorageEnforce option in your slurm.conf to turn that on.
https://slurm.schedmd.com/slurm.conf.html#OPT_AccountingStorageEnforce

Let me know if you have any questions about any of this configuration.  

Thanks,
Ben

Comment 16 Clay Fandre 2023-07-06 11:54:47 MDT

Ahhhh, ok. That makes sense. I went ahead and created the accounting data for all of the users. The queue is currently empty but once they submit jobs I will verify things are working.

Comment 17 Ben Roberts 2023-07-10 10:04:42 MDT

Hi Clay,

Were you able to verify that things work as expected after creating users with sacctmgr?  Let me know if you still need help with this ticket or if it's ok to close.

Thanks,
Ben

Comment 18 Clay Fandre 2023-07-10 10:34:24 MDT

I believe things are working as expected. Thanks for checking back and please feel free to close out this ticket.

Clay

From: bugs@schedmd.com <bugs@schedmd.com>
Date: Monday, July 10, 2023 at 11:04 AM
To: Fandre, Clay <clay.fandre@honeywell.com>
Subject: [External] [Bug 17091] Require fairshare configuration to prevent one user from monopolizing queue
WARNING: This message has originated from an External Source. This may be a phishing email that can result in unauthorized access to Honeywell systems. Please use proper judgment and caution when opening attachments, clicking links or responding.
Comment # 17<https://bugs.schedmd.com/show_bug.cgi?id=17091#c17> on bug 17091<https://bugs.schedmd.com/show_bug.cgi?id=17091> from Ben Roberts<mailto:ben@schedmd.com>

Hi Clay,

Were you able to verify that things work as expected after creating users with

sacctmgr?  Let me know if you still need help with this ticket or if it's ok to

close.

Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 19 Ben Roberts 2023-07-10 10:37:22 MDT

I'm glad to heard things are working.  Closing now.