| Summary: | remaining questions to complete our setup | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Benoit Marchand <benoit.marchand> |
| Component: | Other | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 15.08.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NYU Abu Dhabi | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.10 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
(In reply to Benoit Marchand from comment #0) > We have 5 remaining questions regarding management, configuration, > operation, limits about our recent deployment. Can you please assist in > getting answers. > > • iao213 issue with job hung on GrpCPULimit > > We have a user who's part of an account "chemistryny_par" which is itself a > sub-account of "chemistryny" which is a sub-account of "nyuny" par of > "others" account. We limit the number of cpus which account can use. > However we have cases where a specific user from an account requires special > limits. So we apply limits to that user. We understood that the precedence > is QOS< USER, ACCOUNT, PARTITION. So we thought that setting user-specific > limits could override the limits of the account he/she belongs to. What are > we missing here? > > [root@slurm1 USERS]# sacct -j 6562 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 6562 Ant_010_a par_ext chemistry+ 560 PENDING 0:0 > > [root@slurm1 USERS]# ./show-account.sh chemistryny_par > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > Account GrpJobs GrpTRES GrpSubmit MaxJobs > MaxTRES MaxSubmit MaxWall > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > chemistryny_par cpu=112 20 > cpu=56 50 12:00:00 > SERS]# ./show-account.sh chemistryny > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > Account GrpJobs GrpTRES GrpSubmit MaxJobs > MaxTRES MaxSubmit MaxWall > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > chemistryny cpu=252 20 > 50 > [root@slurm1 USERS]# ./show-account.sh nyuny > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > Account GrpJobs GrpTRES GrpSubmit MaxJobs > MaxTRES MaxSubmit MaxWall > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > nyuny cpu=1540 20 > 50 > [root@slurm1 USERS]# ./show-account.sh others > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > Account GrpJobs GrpTRES GrpSubmit MaxJobs > MaxTRES MaxSubmit MaxWall > ------------------------------ ------- ------------- --------- ------- > ------------- --------- ----------- > others 200 cpu=1680 500 20 > 50 > [root@slurm1 USERS]# ./show-qos.sh par_ext > ---------- ------- ------------- --------- ------- --------- ----------- > ---------- ------------- ------------- ------------- > Name GrpJobs GrpTRES GrpSubmit MaxJobs MaxSubmit MaxWall > MaxNodesPU MinTRES MaxTRES MaxTRESPU > ---------- ------- ------------- --------- ------- --------- ----------- > ---------- ------------- ------------- ------------- > ser_std > > ser_ext > > par_std > > par_ext > > > PartitionName=par_ext > AllowGroups=ALL AllowAccounts=ALL AllowQos=par_ext > AllocNodes=login-0-[1-4] Default=NO QoS=par_ext > DefaultTime=06:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 > Hidden=NO > MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=2 LLN=NO > MaxCPUsPerNode=UNLIMITED > Nodes=compute-[1-13]-[1-18],compute-14-[1-2] > Priority=25 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=OFF > State=DOWN TotalCPUs=6608 TotalNodes=236 SelectTypeParameters=N/A > DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED Slurm's hierarchical limits are enforced in the following order with Job QOS and Partition QOS order being reversible by using the QOS flag 'OverPartQOS': 1. Partition QOS limit 2. Job QOS limit 3. User association 4. Account association(s), ascending the hierarchy 5. Root/Cluster association 6. Partition limit 7. None Could you show me the user-specific limit that you describe? I don't see that in your logs above, just account limits. More information online here: http://slurm.schedmd.com/resource_limits.html ========================================================================== > • how to find limits currently used by a group, account, etc > > It would be great if we could interrogate in real-time the current resource > usage per account / user. It is not easy to parse, but the output of "scontrol show cache" does this. Here ia an example showing an association has a submit job limit of 4 and there are 2 jobs already submitted against that limit: ClusterName=linux Account=root UserName=jette(1000) Partition= ID=7 ... GrpSubmitJobs=4(2) ========================================================================== > • showscript doesn’t work on terminated jobs > > how to do postmortem analysis? We can see a script while a job is running. > But if a job fails we can't see the script... The job script is not archived in the job accounting record, but is only available while the job information is available from the slurmctld daemon. The job information will be purged after the job completes and the configured MinJobAge period has passed. You might consider increasing MinJobAge, although doing so may adversely impact performance. ========================================================================== > • node condos > > some research groups have their own nodes added to the cluster > how to enfore a policy so that their jobs first get dispatched on their > nodes before they start to allocate nodes from the cluster > (with other workload management tools we can set the order in which nodes > are scanned for job allocation) There are a couple of ways to do this, one based upon Slurm partitions/queues and a second way is using Slurm QOS (Quality Of Service). Probably the easiest way is to configure Slurm partitions/queues for the groups which own specific nodes. These nodes can also be configured to be part of a global partition/queue having a lower priority. The node Weight values should be higher so they are not used if other nodes (not in a condo) are available for use. You can also configure preemption rules so that the condo owners can cause jobs from others using their resources be requeued. Here is what this might look like in the slurm.conf configuration file. PreempType=preempt/partition_prio PreemptMode=requeue NodeName=chem[0-127] Weight=100 ... NodeName=nid[0-511] Weight=1 ... PartitionName=chem priority=100 Nodes=chem[0-127] AllowGroups=chemistry Default=no ... PartitionName=batch priority=1 Nodes=chem[0-127],nid[0-511] Default=yes ... Users in the "chemistry" group would either need to submit their job's with the option "--partition=chem" or "-p chem" or a job_submit plugin could route their jobs automatically to that partition. Jobs can also be placed into multiple partitions/queues, so something like this is valid "sbatch -p chem,batch ..." Using QOS can be more flexible in that specific nodes would not need to be configured for the research group, only a specific node count. Let us know if you would prefer that model. For more information see these documents: http://slurm.schedmd.com/slurm.conf.html http://slurm.schedmd.com/sbatch.html http://slurm.schedmd.com/preempt.html ========================================================================== > • qrun equivalent > > how to force a job to run regardless of the cpu, node limits Slurm's salloc, sbatch and srun commands let a Slurm administrator or operator set a job's priority to any desired value using the --priority option. The highest supported priority is 4294967293, so "srun --priority=4294967293 ..." would do this. I also just added the option of "--priority=top" which is equivalent and easier to remember. That option will be in version 15.05.5 when released. > Please have a look at the attached file in the original email to help us find why that user's jobs hang on grpcpulimit.
I'm seeing a job that is requesting 560 CPUs and the account has a 112 CPU limit. You imply there is a separate per-user limit, but I don't see any sign of that provided. Please provide the sacctmgr output for that user association and attach the output of "scontrol show cache", which will show what the slurmctld daemon believe the limit is and how much of that limit has been consumed
Here's the control show cache output for the user whose jobs are held on "grpcpulimit".
Also we are looking for the SLURM equivalent to PBS "qrun" which forces a job to run regardless of job priority, or resource usage & limits.
================
Current Association Manager state
User Records
UserName=iao213(2036919) DefAccount=chemistryny_ser DefWckey= AdminLevel=None
Association Records
ClusterName=dalma Account=chemistryny_par UserName=iao213(2036919) Partition=par_ext ID=778
SharesRaw/Norm/Level/Factor=1/0.00/49/0.00
UsageRaw/Norm/Efctv=1971468.05/1.00/1.00
ParentAccount= Lft=902 DefAssoc=No
GrpJobs=N(0)
GrpSubmitJobs=N(2) GrpWall=N(586.74)
GrpTRES=cpu=560(0),mem=N(0),energy=N(0),node=N(0)
GrpTRESMins=cpu=N(32857),mem=N(134585551),energy=N(0),node=N(1173)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=1440
MaxTRESPJ=cpu=560
MaxTRESPN=
MaxTRESMinsPJ=
ClusterName=dalma Account=others UserName= Partition= ID=64
SharesRaw/Norm/Level/Factor=1/0.25/4/0.00
UsageRaw/Norm/Efctv=1975047.24/0.00/0.00
ParentAccount=root(1) Lft=2 DefAssoc=No
GrpJobs=200(0)
GrpSubmitJobs=500(2) GrpWall=N(588.35)
GrpTRES=cpu=1680(0),mem=N(0),energy=N(0),node=N(0)
GrpTRESMins=cpu=N(32917),mem=N(134823577),energy=N(0),node=N(1176)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESMinsPJ=
ClusterName=dalma Account=nyuny UserName= Partition= ID=65
SharesRaw/Norm/Level/Factor=1/0.06/4/0.00
UsageRaw/Norm/Efctv=1975047.24/0.00/0.00
ParentAccount=others(64) Lft=445 DefAssoc=No
GrpJobs=N(0)
GrpSubmitJobs=N(2) GrpWall=N(588.35)
GrpTRES=cpu=1540(0),mem=N(0),energy=N(0),node=N(0)
GrpTRESMins=cpu=N(32917),mem=N(134823577),energy=N(0),node=N(1176)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESMinsPJ=
ClusterName=dalma Account=chemistryny UserName= Partition= ID=66
SharesRaw/Norm/Level/Factor=1/0.01/7/0.00
UsageRaw/Norm/Efctv=1974988.40/0.00/0.00
ParentAccount=nyuny(65) Lft=830 DefAssoc=No
GrpJobs=N(0)
GrpSubmitJobs=N(2) GrpWall=N(588.35)
GrpTRES=cpu=252(0),mem=N(0),energy=N(0),node=N(0)
GrpTRESMins=cpu=N(32916),mem=N(134819560),energy=N(0),node=N(1176)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESMinsPJ=
ClusterName=dalma Account=chemistryny_par UserName= Partition= ID=80
SharesRaw/Norm/Level/Factor=1/0.00/2/0.00
UsageRaw/Norm/Efctv=1974891.42/0.00/0.00
ParentAccount=chemistryny(66) Lft=831 DefAssoc=No
GrpJobs=N(0)
GrpSubmitJobs=N(2) GrpWall=N(587.67)
GrpTRES=cpu=112(0),mem=N(0),energy=N(0),node=N(0)
GrpTRESMins=cpu=N(32914),mem=N(134812940),energy=N(0),node=N(1175)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
MaxJobs=20(0) MaxSubmitJobs=50(2) MaxWallPJ=720
MaxTRESPJ=cpu=56
MaxTRESPN=
MaxTRESMinsPJ=
QOS Records
QOS=par_ext(6)
UsageRaw=2082239.217032
GrpJobs=N(0) GrpSubmitJobs=N(2) GrpWall=N(618.78)
GrpTRES=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
GrpTRESMins=cpu=N(34703),mem=N(142141216),energy=N(0),node=N(1239)
GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0)
MaxJobs= MaxSubmitJobs= MaxWallPJ=
MaxTRESPJ=
MaxTRESPN=
MaxTRESPU=
MaxTRESMinsPJ=
MinTRESPJ=
# squeue -j 6562
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
6562 par_ext Ant_010_ iao213 PD 0:00 20 (AssocGrpCpuLimit)
# qstat -f 6562
Job Id: 6562
Job_Name = Ant_010_a
Job_Owner = iao213@login-0-3
job_state = Q
queue = par_ext
qtime = Mon Sep 19 21:55:39 2016
ctime = Mon Sep 26 11:45:45 2016
Account_Name = chemistryny_par
Priority = 34
euser = iao213(2036919)
egroup = users(100)
Resource_List.walltime = 24:00:00
Resource_List.nodect = 20
Resource_List.ncpus = 560
# scontrol show jobid=6562
JobId=6562 JobName=Ant_010_a
UserId=iao213(2036919) GroupId=users(100)
Priority=34 Nice=0 Account=chemistryny_par QOS=par_ext
JobState=PENDING Reason=AssocGrpCpuLimit Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2016-09-19T21:55:39 EligibleTime=2016-09-19T21:55:39
StartTime=Unknown EndTime=2016-09-28T12:00:44
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=par_ext AllocNode:Sid=login-0-3:3622
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=compute-2-15,compute-6-[1-8,17-18],compute-9-[6-13,18]
NumNodes=20-20 NumCPUs=560 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=560,mem=2293760,node=20
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/iao213/Anthracene/Screw_010/1BV_ac/run.slurm
WorkDir=/scratch/iao213/Anthracene/Screw_010/1BV_ac
Comment=stdout=/scratch/iao213/Anthracene/Screw_010/1BV_ac/slurm-6562.out
StdErr=/scratch/iao213/Anthracene/Screw_010/1BV_ac/slurm-6562.out
StdIn=/dev/null
StdOut=/scratch/iao213/Anthracene/Screw_010/1BV_ac/slurm-6562.out
Power= SICP=0
Benoit, it appears what you are looking for is a QOS with really high limits as that would nullify the limits set on the associations. Reading your assoc manager output it appears you have a hierarchy of limits Account chemistryny_par GrpTRES=cpu=112(0) - User iao213 GrpTRES=cpu=560(0) The Group limits are treated a individual limits so the lowest one always will be enforced regardless of the hierarchy. This isn't the case for the Max limits as the first one going up the tree is the only one that is looked at. I'll see if I can make this more clear in the documentation. If you are looking for a qrun like option you should look into adding a QOS called qrun with GrpTRES=CPU=100000 Priority=1000000 (or larger) which will override the limits given in the associations and give a large priority boost hopefully putting the job to the front of the queue. If you want it to preempt things that can happen as well. It depends on how big of a hammer you are looking for. Let me know if this works for you or not. Benoit, is there anything else needed on this? Or can we close this? Thanks Danny. Indeed the documentation doesn't reflect that the cpu limit property isn't scanned through the hierarchy picking up the first set value as for other properties. We will rework the entire account strategy next week and set the limits directly at the user association level and remove account limits. You can close this ticket thanks. |
We have 5 remaining questions regarding management, configuration, operation, limits about our recent deployment. Can you please assist in getting answers. • iao213 issue with job hung on GrpCPULimit We have a user who's part of an account "chemistryny_par" which is itself a sub-account of "chemistryny" which is a sub-account of "nyuny" par of "others" account. We limit the number of cpus which account can use. However we have cases where a specific user from an account requires special limits. So we apply limits to that user. We understood that the precedence is QOS< USER, ACCOUNT, PARTITION. So we thought that setting user-specific limits could override the limits of the account he/she belongs to. What are we missing here? [root@slurm1 USERS]# sacct -j 6562 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 6562 Ant_010_a par_ext chemistry+ 560 PENDING 0:0 [root@slurm1 USERS]# ./show-account.sh chemistryny_par ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- Account GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- chemistryny_par cpu=112 20 cpu=56 50 12:00:00 SERS]# ./show-account.sh chemistryny ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- Account GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- chemistryny cpu=252 20 50 [root@slurm1 USERS]# ./show-account.sh nyuny ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- Account GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- nyuny cpu=1540 20 50 [root@slurm1 USERS]# ./show-account.sh others ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- Account GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall ------------------------------ ------- ------------- --------- ------- ------------- --------- ----------- others 200 cpu=1680 500 20 50 [root@slurm1 USERS]# ./show-qos.sh par_ext ---------- ------- ------------- --------- ------- --------- ----------- ---------- ------------- ------------- ------------- Name GrpJobs GrpTRES GrpSubmit MaxJobs MaxSubmit MaxWall MaxNodesPU MinTRES MaxTRES MaxTRESPU ---------- ------- ------------- --------- ------- --------- ----------- ---------- ------------- ------------- ------------- ser_std ser_ext par_std par_ext PartitionName=par_ext AllowGroups=ALL AllowAccounts=ALL AllowQos=par_ext AllocNodes=login-0-[1-4] Default=NO QoS=par_ext DefaultTime=06:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=2 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=compute-[1-13]-[1-18],compute-14-[1-2] Priority=25 RootOnly=NO ReqResv=NO Shared=EXCLUSIVE PreemptMode=OFF State=DOWN TotalCPUs=6608 TotalNodes=236 SelectTypeParameters=N/A DefMemPerCPU=4096 MaxMemPerNode=UNLIMITED • how to find limits currently used by a group, account, etc It would be great if we could interrogate in real-time the current resource usage per account / user. • showscript doesn’t work on terminated jobs how to do postmortem analysis? We can see a script while a job is running. But if a job fails we can't see the script... • node condos some research groups have their own nodes added to the cluster how to enfore a policy so that their jobs first get dispatched on their nodes before they start to allocate nodes from the cluster (with other workload management tools we can set the order in which nodes are scanned for job allocation) • qrun equivalent how to force a job to run regardless of the cpu, node limits