Ticket 3115

Summary:	remaining questions to complete our setup
Product:	Slurm	Reporter:	Benoit Marchand <benoit.marchand>
Component:	Other	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	15.08.10
Hardware:	Linux
OS:	Linux
Site:	NYU Abu Dhabi	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.10
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Benoit Marchand 2016-09-25 04:19:35 MDT

We have 5 remaining questions regarding management, configuration, operation, limits about our recent deployment.  Can you please assist in getting answers.

	•	iao213 issue with job hung on GrpCPULimit

We have a user who's part of an account "chemistryny_par" which is itself a sub-account of "chemistryny" which is a sub-account of "nyuny" par of "others" account.  We limit the number of cpus which account can use.  However we have cases where a specific user from an account requires special limits.  So we apply limits to that user.  We understood that the precedence is QOS< USER, ACCOUNT, PARTITION.  So we thought that setting user-specific limits could override the limits of the account he/she belongs to.  What are we missing here?

[root@slurm1 USERS]# sacct -j 6562
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6562          Ant_010_a    par_ext chemistry+        560    PENDING      0:0 

[root@slurm1 USERS]# ./show-account.sh chemistryny_par
------------------------------ ------- ------------- --------- ------- ------------- --------- -----------
                       Account GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall 
------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- 
               chemistryny_par               cpu=112                20        cpu=56        50    12:00:00 
SERS]# ./show-account.sh chemistryny    
------------------------------ ------- ------------- --------- ------- ------------- --------- -----------
                       Account GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall 
------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- 
                   chemistryny               cpu=252                20                      50             
[root@slurm1 USERS]# ./show-account.sh nyuny      
------------------------------ ------- ------------- --------- ------- ------------- --------- -----------
                       Account GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall 
------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- 
                         nyuny              cpu=1540                20                      50             
[root@slurm1 USERS]# ./show-account.sh others
------------------------------ ------- ------------- --------- ------- ------------- --------- -----------
                       Account GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall 
------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- 
                        others     200      cpu=1680       500      20                      50   
[root@slurm1 USERS]# ./show-qos.sh par_ext
---------- ------- ------------- --------- ------- --------- ----------- ---------- ------------- ------------- -------------
      Name GrpJobs       GrpTRES GrpSubmit MaxJobs MaxSubmit     MaxWall MaxNodesPU       MinTRES       MaxTRES     MaxTRESPU 
---------- ------- ------------- --------- ------- --------- ----------- ---------- ------------- ------------- ------------- 
   ser_std                                                                                                                    
   ser_ext                                                                                                                    
   par_std                                                                                                                    
   par_ext                                                                                                                    
 
PartitionName=par_ext
   AllowGroups=ALL AllowAccounts=ALL AllowQos=par_ext
   AllocNodes=login-0-[1-4] Default=NO QoS=par_ext
   DefaultTime=06:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=2 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=compute-[1-13]-[1-18],compute-14-[1-2]
   Priority=25 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=OFF
   State=DOWN TotalCPUs=6608 TotalNodes=236 SelectTypeParameters=N/A
   DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED


	•	how to find limits currently used by a group, account, etc

It would be great if we could interrogate in real-time the current resource usage per account / user.

	•	showscript doesn’t work on terminated jobs
	
how to do postmortem analysis?  We can see a script while a job is running.  But if a job fails we can't see the script...

	•	node condos

some research groups have their own nodes added to the cluster
how to enfore a policy so that their jobs first get dispatched on their nodes before they start to allocate nodes from the cluster
(with other workload management tools we can set the order in which nodes are scanned for job allocation)

	•	qrun equivalent

how to force a job to run regardless of the cpu, node limits

Comment 1 Moe Jette 2016-09-26 11:12:41 MDT

(In reply to Benoit Marchand from comment #0)
> We have 5 remaining questions regarding management, configuration,
> operation, limits about our recent deployment.  Can you please assist in
> getting answers.
> 
> 	•	iao213 issue with job hung on GrpCPULimit
> 
> We have a user who's part of an account "chemistryny_par" which is itself a
> sub-account of "chemistryny" which is a sub-account of "nyuny" par of
> "others" account.  We limit the number of cpus which account can use. 
> However we have cases where a specific user from an account requires special
> limits.  So we apply limits to that user.  We understood that the precedence
> is QOS< USER, ACCOUNT, PARTITION.  So we thought that setting user-specific
> limits could override the limits of the account he/she belongs to.  What are
> we missing here?
> 
> [root@slurm1 USERS]# sacct -j 6562
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 6562          Ant_010_a    par_ext chemistry+        560    PENDING      0:0 
> 
> [root@slurm1 USERS]# ./show-account.sh chemistryny_par
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- -----------
>                        Account GrpJobs       GrpTRES GrpSubmit MaxJobs      
> MaxTRES MaxSubmit     MaxWall 
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- ----------- 
>                chemistryny_par               cpu=112                20      
> cpu=56        50    12:00:00 
> SERS]# ./show-account.sh chemistryny    
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- -----------
>                        Account GrpJobs       GrpTRES GrpSubmit MaxJobs      
> MaxTRES MaxSubmit     MaxWall 
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- ----------- 
>                    chemistryny               cpu=252                20      
> 50             
> [root@slurm1 USERS]# ./show-account.sh nyuny      
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- -----------
>                        Account GrpJobs       GrpTRES GrpSubmit MaxJobs      
> MaxTRES MaxSubmit     MaxWall 
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- ----------- 
>                          nyuny              cpu=1540                20      
> 50             
> [root@slurm1 USERS]# ./show-account.sh others
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- -----------
>                        Account GrpJobs       GrpTRES GrpSubmit MaxJobs      
> MaxTRES MaxSubmit     MaxWall 
> ------------------------------ ------- ------------- --------- -------
> ------------- --------- ----------- 
>                         others     200      cpu=1680       500      20      
> 50   
> [root@slurm1 USERS]# ./show-qos.sh par_ext
> ---------- ------- ------------- --------- ------- --------- -----------
> ---------- ------------- ------------- -------------
>       Name GrpJobs       GrpTRES GrpSubmit MaxJobs MaxSubmit     MaxWall
> MaxNodesPU       MinTRES       MaxTRES     MaxTRESPU 
> ---------- ------- ------------- --------- ------- --------- -----------
> ---------- ------------- ------------- ------------- 
>    ser_std                                                                  
> 
>    ser_ext                                                                  
> 
>    par_std                                                                  
> 
>    par_ext                                                                  
> 
>  
> PartitionName=par_ext
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=par_ext
>    AllocNodes=login-0-[1-4] Default=NO QoS=par_ext
>    DefaultTime=06:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=2 LLN=NO
> MaxCPUsPerNode=UNLIMITED
>    Nodes=compute-[1-13]-[1-18],compute-14-[1-2]
>    Priority=25 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=OFF
>    State=DOWN TotalCPUs=6608 TotalNodes=236 SelectTypeParameters=N/A
>    DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED

Slurm's hierarchical limits are enforced in the following order with Job QOS
and Partition QOS order being reversible by using the QOS flag 'OverPartQOS':

  1.  Partition QOS limit
  2.  Job QOS limit
  3.  User association
  4.  Account association(s), ascending the hierarchy
  5.  Root/Cluster association
  6.  Partition limit
  7.  None

Could you show me the user-specific limit that you describe? I don't see that
in your logs above, just account limits.

More information online here:
http://slurm.schedmd.com/resource_limits.html

==========================================================================

> 	•	how to find limits currently used by a group, account, etc
> 
> It would be great if we could interrogate in real-time the current resource
> usage per account / user.

It is not easy to parse, but the output of "scontrol show cache" does this.
Here ia an example showing an association has a submit job limit of 4 and
there are 2 jobs already submitted against that limit:

ClusterName=linux Account=root UserName=jette(1000) Partition= ID=7
    ...
    GrpSubmitJobs=4(2)


==========================================================================

> 	•	showscript doesn’t work on terminated jobs
> 	
> how to do postmortem analysis?  We can see a script while a job is running. 
> But if a job fails we can't see the script...

The job script is not archived in the job accounting record, but is only
available while the job information is available from the slurmctld daemon.
The job information will be purged after the job completes and the configured
MinJobAge period has passed. You might consider increasing MinJobAge, although
doing so may adversely impact performance.


==========================================================================

> 	•	node condos
> 
> some research groups have their own nodes added to the cluster
> how to enfore a policy so that their jobs first get dispatched on their
> nodes before they start to allocate nodes from the cluster
> (with other workload management tools we can set the order in which nodes
> are scanned for job allocation)

There are a couple of ways to do this, one based upon Slurm partitions/queues
and a second way is using Slurm QOS (Quality Of Service). Probably the easiest
way is to configure Slurm partitions/queues for the groups which own specific
nodes. These nodes can also be configured to be part of a global partition/queue
having a lower priority. The node Weight values should be higher so they are
not used if other nodes (not in a condo) are available for use. You can also
configure preemption rules so that the condo owners can cause jobs from others
using their resources be requeued. Here is what this might look like in the
slurm.conf configuration file.

PreempType=preempt/partition_prio
PreemptMode=requeue
NodeName=chem[0-127] Weight=100 ...
NodeName=nid[0-511]  Weight=1   ...
PartitionName=chem  priority=100 Nodes=chem[0-127] AllowGroups=chemistry Default=no ...
PartitionName=batch priority=1   Nodes=chem[0-127],nid[0-511] Default=yes ...

Users in the "chemistry" group would either need to submit their job's with
the option "--partition=chem" or "-p chem" or a job_submit plugin could route
their jobs automatically to that partition. Jobs can also be placed into
multiple partitions/queues, so something like this is valid
"sbatch -p chem,batch ..."

Using QOS can be more flexible in that specific nodes would not need to be
configured for the research group, only a specific node count. Let us know if
you would prefer that model.

For more information see these documents:
http://slurm.schedmd.com/slurm.conf.html
http://slurm.schedmd.com/sbatch.html
http://slurm.schedmd.com/preempt.html


==========================================================================

> 	•	qrun equivalent
> 
> how to force a job to run regardless of the cpu, node limits

Slurm's salloc, sbatch and srun commands let a Slurm administrator or operator
set a job's priority to any desired value using the --priority option. The
highest supported priority is 4294967293, so "srun --priority=4294967293 ..."
would do this. I also just added the option of "--priority=top" which is
equivalent and easier to remember. That option will be in version 15.05.5
when released.

Comment 2 Moe Jette 2016-09-27 21:26:30 MDT

> Please have a look at the attached file in the original email to help us find why that user's jobs hang on grpcpulimit. 

I'm seeing a job that is requesting 560 CPUs and the account has a 112 CPU limit. You imply there is a separate per-user limit, but I don't see any sign of that provided. Please provide the sacctmgr output for that user association and attach the output of "scontrol show cache", which will show what the slurmctld daemon believe the limit is and how much of that limit has been consumed

Comment 3 Benoit Marchand 2016-09-28 23:31:02 MDT

Here's the control show cache output for the user whose jobs are held on "grpcpulimit".

Also we are looking for the SLURM equivalent to PBS "qrun" which forces a job to run regardless of job priority, or resource usage & limits.

================

Current Association Manager state

User Records

UserName=iao213(2036919) DefAccount=chemistryny_ser DefWckey= AdminLevel=None

Association Records

ClusterName=dalma Account=chemistryny_par UserName=iao213(2036919) Partition=par_ext ID=778
    SharesRaw/Norm/Level/Factor=1/0.00/49/0.00
    UsageRaw/Norm/Efctv=1971468.05/1.00/1.00
    ParentAccount= Lft=902 DefAssoc=No
    GrpJobs=N(0)
    GrpSubmitJobs=N(2) GrpWall=N(586.74)
    GrpTRES=cpu=560(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(32857),mem=N(134585551),energy=N(0),node=N(1173)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=1440
    MaxTRESPJ=cpu=560
    MaxTRESPN=
    MaxTRESMinsPJ=
ClusterName=dalma Account=others UserName= Partition= ID=64
    SharesRaw/Norm/Level/Factor=1/0.25/4/0.00
    UsageRaw/Norm/Efctv=1975047.24/0.00/0.00
    ParentAccount=root(1) Lft=2 DefAssoc=No
    GrpJobs=200(0)
    GrpSubmitJobs=500(2) GrpWall=N(588.35)
    GrpTRES=cpu=1680(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(32917),mem=N(134823577),energy=N(0),node=N(1176)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
ClusterName=dalma Account=nyuny UserName= Partition= ID=65
    SharesRaw/Norm/Level/Factor=1/0.06/4/0.00
    UsageRaw/Norm/Efctv=1975047.24/0.00/0.00
    ParentAccount=others(64) Lft=445 DefAssoc=No
    GrpJobs=N(0)
    GrpSubmitJobs=N(2) GrpWall=N(588.35)
    GrpTRES=cpu=1540(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(32917),mem=N(134823577),energy=N(0),node=N(1176)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
ClusterName=dalma Account=chemistryny UserName= Partition= ID=66
    SharesRaw/Norm/Level/Factor=1/0.01/7/0.00
    UsageRaw/Norm/Efctv=1974988.40/0.00/0.00
    ParentAccount=nyuny(65) Lft=830 DefAssoc=No
    GrpJobs=N(0)
    GrpSubmitJobs=N(2) GrpWall=N(588.35)
    GrpTRES=cpu=252(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(32916),mem=N(134819560),energy=N(0),node=N(1176)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
ClusterName=dalma Account=chemistryny_par UserName= Partition= ID=80
    SharesRaw/Norm/Level/Factor=1/0.00/2/0.00
    UsageRaw/Norm/Efctv=1974891.42/0.00/0.00
    ParentAccount=chemistryny(66) Lft=831 DefAssoc=No
    GrpJobs=N(0)
    GrpSubmitJobs=N(2) GrpWall=N(587.67)
    GrpTRES=cpu=112(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(32914),mem=N(134812940),energy=N(0),node=N(1175)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=720
    MaxTRESPJ=cpu=56
    MaxTRESPN=
    MaxTRESMinsPJ=

QOS Records

QOS=par_ext(6)
    UsageRaw=2082239.217032
    GrpJobs=N(0) GrpSubmitJobs=N(2) GrpWall=N(618.78)
    GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    GrpTRESMins=cpu=N(34703),mem=N(142141216),energy=N(0),node=N(1239)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
    MaxJobs= MaxSubmitJobs= MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESPU=
    MaxTRESMinsPJ=
    MinTRESPJ=

# squeue -j 6562
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              6562   par_ext Ant_010_   iao213 PD       0:00     20 (AssocGrpCpuLimit)

# qstat -f 6562
Job Id:	6562
	Job_Name = Ant_010_a
	Job_Owner = iao213@login-0-3
	job_state = Q
	queue = par_ext
	qtime = Mon Sep 19 21:55:39 2016
	ctime = Mon Sep 26 11:45:45 2016
	Account_Name = chemistryny_par
	Priority = 34
	euser = iao213(2036919)
	egroup = users(100)
	Resource_List.walltime = 24:00:00
	Resource_List.nodect = 20
	Resource_List.ncpus = 560

# scontrol show jobid=6562
JobId=6562 JobName=Ant_010_a
   UserId=iao213(2036919) GroupId=users(100)
   Priority=34 Nice=0 Account=chemistryny_par QOS=par_ext
   JobState=PENDING Reason=AssocGrpCpuLimit Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2016-09-19T21:55:39 EligibleTime=2016-09-19T21:55:39
   StartTime=Unknown EndTime=2016-09-28T12:00:44
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=par_ext AllocNode:Sid=login-0-3:3622
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=compute-2-15,compute-6-[1-8,17-18],compute-9-[6-13,18]
   NumNodes=20-20 NumCPUs=560 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=560,mem=2293760,node=20
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/iao213/Anthracene/Screw_010/1BV_ac/run.slurm
   WorkDir=/scratch/iao213/Anthracene/Screw_010/1BV_ac
   Comment=stdout=/scratch/iao213/Anthracene/Screw_010/1BV_ac/slurm-6562.out 
   StdErr=/scratch/iao213/Anthracene/Screw_010/1BV_ac/slurm-6562.out
   StdIn=/dev/null
   StdOut=/scratch/iao213/Anthracene/Screw_010/1BV_ac/slurm-6562.out
   Power= SICP=0

Comment 4 Danny Auble 2016-10-03 16:33:19 MDT

Benoit, it appears what you are looking for is a QOS with really high limits as that would nullify the limits set on the associations.

Reading your assoc manager output it appears you have a hierarchy of limits

Account chemistryny_par GrpTRES=cpu=112(0)
 - User iao213          GrpTRES=cpu=560(0)

The Group limits are treated a individual limits so the lowest one always will be enforced regardless of the hierarchy.  This isn't the case for the Max limits as the first one going up the tree is the only one that is looked at.  I'll see if I can make this more clear in the documentation.

If you are looking for a qrun like option you should look into adding a QOS called qrun with GrpTRES=CPU=100000 Priority=1000000 (or larger) which will override the limits given in the associations and give a large priority boost hopefully putting the job to the front of the queue.  If you want it to preempt things that can happen as well.  It depends on how big of a hammer you are looking for.
 
Let me know if this works for you or not.

Comment 5 Danny Auble 2016-10-05 13:36:34 MDT

Benoit, is there anything else needed on this?  Or can we close this?

Comment 6 Benoit Marchand 2016-10-06 01:47:17 MDT

Thanks Danny.

Indeed the documentation doesn't reflect that the cpu limit property isn't scanned through the hierarchy picking up the first set value as for other properties.

We will rework the entire account strategy next week and set the limits directly at the user association level and remove account limits.

You can close this ticket thanks.