Ticket 5193

Summary:	Advice on setting up a high priority "test" QOS
Product:	Slurm	Reporter:	David Baker <d.j.baker>
Component:	Configuration	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex
Version:	17.02.8
Hardware:	Linux
OS:	Linux
Site:	OCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	Southampton University
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description David Baker 2018-05-21 08:35:43 MDT

Hello,

Could you please advice me on setting up a QOS in SLURM.

We have a SLURM cluster with 3 partitions defined. They are:

PartitionName=batch  nodes=red[001-464] ExclusiveUser=YES Default=Yes MaxCPUsPerNode=40 DefaultTime=120 MaxTime=48:00:00 State=UP

PartitionName=gpu nodes=indigo[51-60] ExclusiveUser=YES MaxCPUsPerNode=40 DefaultTime=120 MaxTime=48:00:00 QOS=gpu State=UP

PartitionName=gtx1080 nodes=pink[51-60] ExclusiveUser=YES MaxCPUsPerNode=54 DefaultTime=120 MaxTime=48:00:00 QOS=gpu State=UP

What I would like to do is define an additional QOS called "test" which allows users to run test jobs for a short time and at a higher priority. I would like users to be able to apply this QOS to their jobs using "--qos test". 

The idea of the test QOS is for users to run test jobs in the above partitions -- let's say for a max wall time of 2 hours over 2 compute nodes. 
-- Users can only run one job at once and have one job in the queue (for example). 
-- The global total number of test jobs also needs to be limited. 
-- Furthermore, of course, the test QOS needs to be able to override the partition QOS on gpu and gtx1080 (Using OverPartQOS ?).

Could you please advise me on setting up our test QOS.

Best regards,
David

Comment 1 Alejandro Sanchez 2018-05-22 06:21:57 MDT

(In reply to David Baker from comment #0)
> What I would like to do is define an additional QOS called "test" which
> allows users to run test jobs for a short time and at a higher priority. I
> would like users to be able to apply this QOS to their jobs using "--qos
> test".

QOS are defined in Slurm through the sacctmgr command:

https://slurm.schedmd.com/sacctmgr.html#lbAW

$ sacctmgr create qos test <... parameters>

if you want this QOS to have higher priority than other QOS, then set the QOS 'Priority' option higher than the rest. You could also consider setting preempt/qos if you want jobs from higher priority QOS to preempt jobs from lower ones:

https://slurm.schedmd.com/preempt.html

> The idea of the test QOS is for users to run test jobs in the above
> partitions -- let's say for a max wall time of 2 hours over 2 compute nodes.

These are the related QOS options to address this:

MaxWall Maximum wall clock time each job is able to use.

GrpWall Maximum wall clock time running jobs are able to be allocated in aggregate for this QOS. If this limit is reached submission requests will be denied and the running jobs will be killed.

MaxTRESPerUser=node=X

and/or

GrpTRES=node=Y

> -- Users can only run one job at once and have one job in the queue (for
> example). 

MaxJobsPerUser

and/or

MaxSubmitJobsPerUser

> -- The global total number of test jobs also needs to be limited.

GrpJobs Maximum number of running jobs in aggregate for this QOS.

> -- Furthermore, of course, the test QOS needs to be able to override the
> partition QOS on gpu and gtx1080 (Using OverPartQOS ?).

Flags
  PartitionMaxNodes If set jobs using this QOS will be able to override the requested partition's MaxNodes limit.

  PartitionMinNodes If set jobs using this QOS will be able to override the requested partition's MinNodes limit.

  OverPartQOS If set jobs using this QOS will be able to override any limits used by the requested partition's QOS limits.

  PartitionTimeLimit If set jobs using this QOS will be able to override the requested partition's TimeLimit.

Additional info:

https://slurm.schedmd.com/resource_limits.html


PS: you opened the bug with Slurm version 17.02.8. I'd take this opportunity to encourage you to upgrade to the latest stable version as soon as possible for two reasons:

1. Many bugs have been fixed since 17.02.8 (and I remember fixes related to all these limits enforcing).
2. Version 18.08 is planned to be released in August, and 17.02 will then be unsupported.

Please, let me know if you have further questions. Thank you.

Comment 2 David Baker 2018-05-24 09:57:36 MDT

Hello,

Thank you for your reply. I’ve defined a test QOS with some initial settings and I’m experimenting. So far so good.

Thank you for your note regarding the future version of SLURM and the timescale. We’ll bear that in mind over the next weeks. We are still developing the cluster and so upgrading SURM is better done now than later (as well as the support issue).

Could I ask an additional question in this ticket or should I really only a new report? I’m just thinking that this issue may be related to the version of slurm. It looks like I cannot easily get the memory usage from jobs (completed or running). For example…

Running job..

[root@blue53 ~]# sacct -X -j 84448 --format=MaxRSS   ==> no value reported for MaxRSS

Completed jobs.

[root@blue53 ~]# seff 83169
Job ID: 83169
Cluster: i5
User/Group: yc11e14/mm
State: CANCELLED (exit code 0)
Nodes: 2
Cores per node: 40
CPU Utilized: 101-14:45:11
CPU Efficiency: 98.20% of 103-11:29:20 core-walltime
Memory Utilized: 46.22 GB (estimated maximum)
Memory Efficiency: 0.00% of 0.00 MB (0.00 MB/node)

Free feel to start a new report/conversation. It is odd that I am getting undefined memory usage figures (eg MaxRSS) from sacct.

JobAcctGatherType=jobacct_gather/linux
AccountingStorageType=accounting_storage/slurmdbd

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, May 22, 2018 1:22 PM
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5193] Advice on setting up a high priority "test" QOS

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5193#c1> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com>

(In reply to David Baker from comment #0<show_bug.cgi?id=5193#c0>)

> What I would like to do is define an additional QOS called "test" which

> allows users to run test jobs for a short time and at a higher priority. I

> would like users to be able to apply this QOS to their jobs using "--qos

> test".



QOS are defined in Slurm through the sacctmgr command:



https://slurm.schedmd.com/sacctmgr.html#lbAW



$ sacctmgr create qos test <... parameters>



if you want this QOS to have higher priority than other QOS, then set the QOS

'Priority' option higher than the rest. You could also consider setting

preempt/qos if you want jobs from higher priority QOS to preempt jobs from

lower ones:



https://slurm.schedmd.com/preempt.html



> The idea of the test QOS is for users to run test jobs in the above

> partitions -- let's say for a max wall time of 2 hours over 2 compute nodes.



These are the related QOS options to address this:



MaxWall Maximum wall clock time each job is able to use.



GrpWall Maximum wall clock time running jobs are able to be allocated in

aggregate for this QOS. If this limit is reached submission requests will be

denied and the running jobs will be killed.



MaxTRESPerUser=node=X



and/or



GrpTRES=node=Y



> -- Users can only run one job at once and have one job in the queue (for

> example).



MaxJobsPerUser



and/or



MaxSubmitJobsPerUser



> -- The global total number of test jobs also needs to be limited.



GrpJobs Maximum number of running jobs in aggregate for this QOS.



> -- Furthermore, of course, the test QOS needs to be able to override the

> partition QOS on gpu and gtx1080 (Using OverPartQOS ?).



Flags

  PartitionMaxNodes If set jobs using this QOS will be able to override the

requested partition's MaxNodes limit.



  PartitionMinNodes If set jobs using this QOS will be able to override the

requested partition's MinNodes limit.



  OverPartQOS If set jobs using this QOS will be able to override any limits

used by the requested partition's QOS limits.



  PartitionTimeLimit If set jobs using this QOS will be able to override the

requested partition's TimeLimit.



Additional info:



https://slurm.schedmd.com/resource_limits.html





PS: you opened the bug with Slurm version 17.02.8. I'd take this opportunity to

encourage you to upgrade to the latest stable version as soon as possible for

two reasons:



1. Many bugs have been fixed since 17.02.8 (and I remember fixes related to all

these limits enforcing).

2. Version 18.08 is planned to be released in August, and 17.02 will then be

unsupported.



Please, let me know if you have further questions. Thank you.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 David Baker 2018-05-25 07:45:50 MDT

Hello,

Thank you again for your advice regarding a “test” (high priority) QOS.

One interesting question comes to mind, please. We are testing the QOS at the moment, and will be applying the QOS to “exclusive” partitions in our cluster. Can a QOS override the “exclusive” setting on a partition so that user  (small) test jobs are potentially grouped together a small subset of the nodes in the cluster? In other words, making the partition act with a “shared” setting, if that makes sense.

I’m on leave next week, but please feel free to cc in my colleagues David Hempston & Keith Daly who may have time to continue the testing next week.

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: 22 May 2018 13:22
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5193] Advice on setting up a high priority "test" QOS

Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5193#c1> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com>

(In reply to David Baker from comment #0<show_bug.cgi?id=5193#c0>)

> What I would like to do is define an additional QOS called "test" which

> allows users to run test jobs for a short time and at a higher priority. I

> would like users to be able to apply this QOS to their jobs using "--qos

> test".



QOS are defined in Slurm through the sacctmgr command:



https://slurm.schedmd.com/sacctmgr.html#lbAW



$ sacctmgr create qos test <... parameters>



if you want this QOS to have higher priority than other QOS, then set the QOS

'Priority' option higher than the rest. You could also consider setting

preempt/qos if you want jobs from higher priority QOS to preempt jobs from

lower ones:



https://slurm.schedmd.com/preempt.html



> The idea of the test QOS is for users to run test jobs in the above

> partitions -- let's say for a max wall time of 2 hours over 2 compute nodes.



These are the related QOS options to address this:



MaxWall Maximum wall clock time each job is able to use.



GrpWall Maximum wall clock time running jobs are able to be allocated in

aggregate for this QOS. If this limit is reached submission requests will be

denied and the running jobs will be killed.



MaxTRESPerUser=node=X



and/or



GrpTRES=node=Y



> -- Users can only run one job at once and have one job in the queue (for

> example).



MaxJobsPerUser



and/or



MaxSubmitJobsPerUser



> -- The global total number of test jobs also needs to be limited.



GrpJobs Maximum number of running jobs in aggregate for this QOS.



> -- Furthermore, of course, the test QOS needs to be able to override the

> partition QOS on gpu and gtx1080 (Using OverPartQOS ?).



Flags

  PartitionMaxNodes If set jobs using this QOS will be able to override the

requested partition's MaxNodes limit.



  PartitionMinNodes If set jobs using this QOS will be able to override the

requested partition's MinNodes limit.



  OverPartQOS If set jobs using this QOS will be able to override any limits

used by the requested partition's QOS limits.



  PartitionTimeLimit If set jobs using this QOS will be able to override the

requested partition's TimeLimit.



Additional info:



https://slurm.schedmd.com/resource_limits.html





PS: you opened the bug with Slurm version 17.02.8. I'd take this opportunity to

encourage you to upgrade to the latest stable version as soon as possible for

two reasons:



1. Many bugs have been fixed since 17.02.8 (and I remember fixes related to all

these limits enforcing).

2. Version 18.08 is planned to be released in August, and 17.02 will then be

unsupported.



Please, let me know if you have further questions. Thank you.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Alejandro Sanchez 2018-05-28 04:48:42 MDT

(In reply to David Baker from comment #2)
> Hello,
> 
> Thank you for your reply. I’ve defined a test QOS with some initial settings
> and I’m experimenting. So far so good.
> 
> Thank you for your note regarding the future version of SLURM and the
> timescale. We’ll bear that in mind over the next weeks. We are still
> developing the cluster and so upgrading SURM is better done now than later
> (as well as the support issue).
> 
> Could I ask an additional question in this ticket or should I really only a
> new report? I’m just thinking that this issue may be related to the version
> of slurm. It looks like I cannot easily get the memory usage from jobs
> (completed or running). For example…
> 
> Running job..
> 
> [root@blue53 ~]# sacct -X -j 84448 --format=MaxRSS   ==> no value reported
> for MaxRSS
> 
> Completed jobs.
> 
> [root@blue53 ~]# seff 83169
> Job ID: 83169
> Cluster: i5
> User/Group: yc11e14/mm
> State: CANCELLED (exit code 0)
> Nodes: 2
> Cores per node: 40
> CPU Utilized: 101-14:45:11
> CPU Efficiency: 98.20% of 103-11:29:20 core-walltime
> Memory Utilized: 46.22 GB (estimated maximum)
> Memory Efficiency: 0.00% of 0.00 MB (0.00 MB/node)
> 
> Free feel to start a new report/conversation. It is odd that I am getting
> undefined memory usage figures (eg MaxRSS) from sacct.
> 
> JobAcctGatherType=jobacct_gather/linux
> AccountingStorageType=accounting_storage/slurmdbd
> 
> Best regards,
> David

In order to display different statistics and status information of a running job/step, you can use the 'sstat' command. Please make sure to read the "-j" and "-a" options in the sstat man page:

alex@ibiza:~/t$ sbatch --mem=4096 --wrap "srun ./mem_eater"
Submitted batch job 20006
alex@ibiza:~/t$ sstat -j 20006 -o jobid,maxrss
       JobID     MaxRSS 
------------ ---------- 
20006.0        1536116K 
alex@ibiza:~/t$ sstat -j 20006 -a -o jobid,maxrss
       JobID     MaxRSS 
------------ ---------- 
20006.0        2048120K 
alex@ibiza:~/t$ sstat -j 20006.batch -o jobid,maxrss
       JobID     MaxRSS 
------------ ---------- 
20006.batch       4084K 
alex@ibiza:~/t$

Note that if your application is directly run inside an allocation without running inside a step (i.e: launched via srun) then you'll need to request the status information of your <jobid>.batch step. Otherwise, you can query all the steps with "-a" or the specific <jobid>[.<stepid].

For finished jobs you can use 'sacct':

alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss
       JobID     MaxRSS 
------------ ---------- 
20002                   
20002.batch       2891K 
20002.0        4051185K 
alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss -X
       JobID     MaxRSS 
------------ ---------- 
20002                   
alex@ibiza:~/t$

Note tha if you use -X then only statistics relevant to job allocation itself will be shown, and since MaxRSS is a step metric then nothing will be displayed for this metric with this option. If you query the disaggregated information about all the job steps then each step will display its own MaxRSS.

Does it make sense?

(In reply to David Baker from comment #3)
> Hello,
> 
> Thank you again for your advice regarding a “test” (high priority) QOS.
> 
> One interesting question comes to mind, please. We are testing the QOS at
> the moment, and will be applying the QOS to “exclusive” partitions in our
> cluster. Can a QOS override the “exclusive” setting on a partition so that
> user  (small) test jobs are potentially grouped together a small subset of
> the nodes in the cluster? In other words, making the partition act with a
> “shared” setting, if that makes sense.

There's no such flag specification at QOS level that could override a partition OverSubscribe behavior.

At most what you can do is to change/force a job's --oversubscribe/--exclusive option at submit/update time based off the job's QOS through a Job Submit Plugin:

https://slurm.schedmd.com/job_submit_plugins.html

> I’m on leave next week, but please feel free to cc in my colleagues David
> Hempston & Keith Daly who may have time to continue the testing next week.

I don't have these two people as any of our SchedMD's supported users and I don't know their e-mail address.
 
> Best regards,
> David

Comment 5 David Baker 2018-06-05 05:16:33 MDT

Hello,

Thank you again. My apologies that I haven’t followed things up for a while. I have just got back from leave and I’m sorting through my emails.
I’m running a few tests this morning and I see that you advice about “seeing” memory requirements is really useful, thank you.

I’m still to muse over the QOS issue. I’ll get back on that if I need to to. Thank you for that.

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Monday, May 28, 2018 11:49 AM
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5193] Advice on setting up a high priority "test" QOS

Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=5193#c4> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com>

(In reply to David Baker from comment #2<show_bug.cgi?id=5193#c2>)

> Hello,

>

> Thank you for your reply. I’ve defined a test QOS with some initial settings

> and I’m experimenting. So far so good.

>

> Thank you for your note regarding the future version of SLURM and the

> timescale. We’ll bear that in mind over the next weeks. We are still

> developing the cluster and so upgrading SURM is better done now than later

> (as well as the support issue).

>

> Could I ask an additional question in this ticket or should I really only a

> new report? I’m just thinking that this issue may be related to the version

> of slurm. It looks like I cannot easily get the memory usage from jobs

> (completed or running). For example…

>

> Running job..

>

> [root@blue53 ~]# sacct -X -j 84448 --format=MaxRSS   ==> no value reported

> for MaxRSS

>

> Completed jobs.

>

> [root@blue53 ~]# seff 83169

> Job ID: 83169

> Cluster: i5

> User/Group: yc11e14/mm

> State: CANCELLED (exit code 0)

> Nodes: 2

> Cores per node: 40

> CPU Utilized: 101-14:45:11

> CPU Efficiency: 98.20% of 103-11:29:20 core-walltime

> Memory Utilized: 46.22 GB (estimated maximum)

> Memory Efficiency: 0.00% of 0.00 MB (0.00 MB/node)

>

> Free feel to start a new report/conversation. It is odd that I am getting

> undefined memory usage figures (eg MaxRSS) from sacct.

>

> JobAcctGatherType=jobacct_gather/linux

> AccountingStorageType=accounting_storage/slurmdbd

>

> Best regards,

> David



In order to display different statistics and status information of a running

job/step, you can use the 'sstat' command. Please make sure to read the "-j"

and "-a" options in the sstat man page:



alex@ibiza:~/t$ sbatch --mem=4096 --wrap "srun ./mem_eater"

Submitted batch job 20006

alex@ibiza:~/t$ sstat -j 20006 -o jobid,maxrss

       JobID     MaxRSS

------------ ----------

20006.0        1536116K

alex@ibiza:~/t$ sstat -j 20006 -a -o jobid,maxrss

       JobID     MaxRSS

------------ ----------

20006.0        2048120K

alex@ibiza:~/t$ sstat -j 20006.batch -o jobid,maxrss

       JobID     MaxRSS

------------ ----------

20006.batch       4084K

alex@ibiza:~/t$



Note that if your application is directly run inside an allocation without

running inside a step (i.e: launched via srun) then you'll need to request the

status information of your <jobid>.batch step. Otherwise, you can query all the

steps with "-a" or the specific <jobid>[.<stepid].



For finished jobs you can use 'sacct':



alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss

       JobID     MaxRSS

------------ ----------

20002

20002.batch       2891K

20002.0        4051185K

alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss -X

       JobID     MaxRSS

------------ ----------

20002

alex@ibiza:~/t$



Note tha if you use -X then only statistics relevant to job allocation itself

will be shown, and since MaxRSS is a step metric then nothing will be displayed

for this metric with this option. If you query the disaggregated information

about all the job steps then each step will display its own MaxRSS.



Does it make sense?



(In reply to David Baker from comment #3<show_bug.cgi?id=5193#c3>)

> Hello,

>

> Thank you again for your advice regarding a “test” (high priority) QOS.

>

> One interesting question comes to mind, please. We are testing the QOS at

> the moment, and will be applying the QOS to “exclusive” partitions in our

> cluster. Can a QOS override the “exclusive” setting on a partition so that

> user  (small) test jobs are potentially grouped together a small subset of

> the nodes in the cluster? In other words, making the partition act with a

> “shared” setting, if that makes sense.



There's no such flag specification at QOS level that could override a partition

OverSubscribe behavior.



At most what you can do is to change/force a job's --oversubscribe/--exclusive

option at submit/update time based off the job's QOS through a Job Submit

Plugin:



https://slurm.schedmd.com/job_submit_plugins.html



> I’m on leave next week, but please feel free to cc in my colleagues David

> Hempston & Keith Daly who may have time to continue the testing next week.



I don't have these two people as any of our SchedMD's supported users and I

don't know their e-mail address.



> Best regards,

> David

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 6 Alejandro Sanchez 2018-06-20 04:06:39 MDT

Hi David. Can we go ahead and close this bug? thanks.

Comment 7 David Baker 2018-06-20 08:03:56 MDT

Hi,

Apologies to have left this ticket open. We have sorted out our high priority “test” QOS and it is working as expected.

Best regards,
David

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, June 20, 2018 11:07 AM
To: Baker D.J. <D.J.Baker@soton.ac.uk>
Subject: [Bug 5193] Advice on setting up a high priority "test" QOS

Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=5193#c6> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com>

Hi David. Can we go ahead and close this bug? thanks.

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 8 Alejandro Sanchez 2018-06-20 08:22:33 MDT

(In reply to David Baker from comment #7)
> Hi,
> 
> Apologies to have left this ticket open. We have sorted out our high
> priority “test” QOS and it is working as expected.
> 
> Best regards,
> David

Glad to know. Thanks.