Hello, Could you please advice me on setting up a QOS in SLURM. We have a SLURM cluster with 3 partitions defined. They are: PartitionName=batch nodes=red[001-464] ExclusiveUser=YES Default=Yes MaxCPUsPerNode=40 DefaultTime=120 MaxTime=48:00:00 State=UP PartitionName=gpu nodes=indigo[51-60] ExclusiveUser=YES MaxCPUsPerNode=40 DefaultTime=120 MaxTime=48:00:00 QOS=gpu State=UP PartitionName=gtx1080 nodes=pink[51-60] ExclusiveUser=YES MaxCPUsPerNode=54 DefaultTime=120 MaxTime=48:00:00 QOS=gpu State=UP What I would like to do is define an additional QOS called "test" which allows users to run test jobs for a short time and at a higher priority. I would like users to be able to apply this QOS to their jobs using "--qos test". The idea of the test QOS is for users to run test jobs in the above partitions -- let's say for a max wall time of 2 hours over 2 compute nodes. -- Users can only run one job at once and have one job in the queue (for example). -- The global total number of test jobs also needs to be limited. -- Furthermore, of course, the test QOS needs to be able to override the partition QOS on gpu and gtx1080 (Using OverPartQOS ?). Could you please advise me on setting up our test QOS. Best regards, David
(In reply to David Baker from comment #0) > What I would like to do is define an additional QOS called "test" which > allows users to run test jobs for a short time and at a higher priority. I > would like users to be able to apply this QOS to their jobs using "--qos > test". QOS are defined in Slurm through the sacctmgr command: https://slurm.schedmd.com/sacctmgr.html#lbAW $ sacctmgr create qos test <... parameters> if you want this QOS to have higher priority than other QOS, then set the QOS 'Priority' option higher than the rest. You could also consider setting preempt/qos if you want jobs from higher priority QOS to preempt jobs from lower ones: https://slurm.schedmd.com/preempt.html > The idea of the test QOS is for users to run test jobs in the above > partitions -- let's say for a max wall time of 2 hours over 2 compute nodes. These are the related QOS options to address this: MaxWall Maximum wall clock time each job is able to use. GrpWall Maximum wall clock time running jobs are able to be allocated in aggregate for this QOS. If this limit is reached submission requests will be denied and the running jobs will be killed. MaxTRESPerUser=node=X and/or GrpTRES=node=Y > -- Users can only run one job at once and have one job in the queue (for > example). MaxJobsPerUser and/or MaxSubmitJobsPerUser > -- The global total number of test jobs also needs to be limited. GrpJobs Maximum number of running jobs in aggregate for this QOS. > -- Furthermore, of course, the test QOS needs to be able to override the > partition QOS on gpu and gtx1080 (Using OverPartQOS ?). Flags PartitionMaxNodes If set jobs using this QOS will be able to override the requested partition's MaxNodes limit. PartitionMinNodes If set jobs using this QOS will be able to override the requested partition's MinNodes limit. OverPartQOS If set jobs using this QOS will be able to override any limits used by the requested partition's QOS limits. PartitionTimeLimit If set jobs using this QOS will be able to override the requested partition's TimeLimit. Additional info: https://slurm.schedmd.com/resource_limits.html PS: you opened the bug with Slurm version 17.02.8. I'd take this opportunity to encourage you to upgrade to the latest stable version as soon as possible for two reasons: 1. Many bugs have been fixed since 17.02.8 (and I remember fixes related to all these limits enforcing). 2. Version 18.08 is planned to be released in August, and 17.02 will then be unsupported. Please, let me know if you have further questions. Thank you.
Hello, Thank you for your reply. I’ve defined a test QOS with some initial settings and I’m experimenting. So far so good. Thank you for your note regarding the future version of SLURM and the timescale. We’ll bear that in mind over the next weeks. We are still developing the cluster and so upgrading SURM is better done now than later (as well as the support issue). Could I ask an additional question in this ticket or should I really only a new report? I’m just thinking that this issue may be related to the version of slurm. It looks like I cannot easily get the memory usage from jobs (completed or running). For example… Running job.. [root@blue53 ~]# sacct -X -j 84448 --format=MaxRSS ==> no value reported for MaxRSS Completed jobs. [root@blue53 ~]# seff 83169 Job ID: 83169 Cluster: i5 User/Group: yc11e14/mm State: CANCELLED (exit code 0) Nodes: 2 Cores per node: 40 CPU Utilized: 101-14:45:11 CPU Efficiency: 98.20% of 103-11:29:20 core-walltime Memory Utilized: 46.22 GB (estimated maximum) Memory Efficiency: 0.00% of 0.00 MB (0.00 MB/node) Free feel to start a new report/conversation. It is odd that I am getting undefined memory usage figures (eg MaxRSS) from sacct. JobAcctGatherType=jobacct_gather/linux AccountingStorageType=accounting_storage/slurmdbd Best regards, David From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, May 22, 2018 1:22 PM To: Baker D.J. <D.J.Baker@soton.ac.uk> Subject: [Bug 5193] Advice on setting up a high priority "test" QOS Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5193#c1> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com> (In reply to David Baker from comment #0<show_bug.cgi?id=5193#c0>) > What I would like to do is define an additional QOS called "test" which > allows users to run test jobs for a short time and at a higher priority. I > would like users to be able to apply this QOS to their jobs using "--qos > test". QOS are defined in Slurm through the sacctmgr command: https://slurm.schedmd.com/sacctmgr.html#lbAW $ sacctmgr create qos test <... parameters> if you want this QOS to have higher priority than other QOS, then set the QOS 'Priority' option higher than the rest. You could also consider setting preempt/qos if you want jobs from higher priority QOS to preempt jobs from lower ones: https://slurm.schedmd.com/preempt.html > The idea of the test QOS is for users to run test jobs in the above > partitions -- let's say for a max wall time of 2 hours over 2 compute nodes. These are the related QOS options to address this: MaxWall Maximum wall clock time each job is able to use. GrpWall Maximum wall clock time running jobs are able to be allocated in aggregate for this QOS. If this limit is reached submission requests will be denied and the running jobs will be killed. MaxTRESPerUser=node=X and/or GrpTRES=node=Y > -- Users can only run one job at once and have one job in the queue (for > example). MaxJobsPerUser and/or MaxSubmitJobsPerUser > -- The global total number of test jobs also needs to be limited. GrpJobs Maximum number of running jobs in aggregate for this QOS. > -- Furthermore, of course, the test QOS needs to be able to override the > partition QOS on gpu and gtx1080 (Using OverPartQOS ?). Flags PartitionMaxNodes If set jobs using this QOS will be able to override the requested partition's MaxNodes limit. PartitionMinNodes If set jobs using this QOS will be able to override the requested partition's MinNodes limit. OverPartQOS If set jobs using this QOS will be able to override any limits used by the requested partition's QOS limits. PartitionTimeLimit If set jobs using this QOS will be able to override the requested partition's TimeLimit. Additional info: https://slurm.schedmd.com/resource_limits.html PS: you opened the bug with Slurm version 17.02.8. I'd take this opportunity to encourage you to upgrade to the latest stable version as soon as possible for two reasons: 1. Many bugs have been fixed since 17.02.8 (and I remember fixes related to all these limits enforcing). 2. Version 18.08 is planned to be released in August, and 17.02 will then be unsupported. Please, let me know if you have further questions. Thank you. ________________________________ You are receiving this mail because: * You reported the bug.
Hello, Thank you again for your advice regarding a “test” (high priority) QOS. One interesting question comes to mind, please. We are testing the QOS at the moment, and will be applying the QOS to “exclusive” partitions in our cluster. Can a QOS override the “exclusive” setting on a partition so that user (small) test jobs are potentially grouped together a small subset of the nodes in the cluster? In other words, making the partition act with a “shared” setting, if that makes sense. I’m on leave next week, but please feel free to cc in my colleagues David Hempston & Keith Daly who may have time to continue the testing next week. Best regards, David From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: 22 May 2018 13:22 To: Baker D.J. <D.J.Baker@soton.ac.uk> Subject: [Bug 5193] Advice on setting up a high priority "test" QOS Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=5193#c1> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com> (In reply to David Baker from comment #0<show_bug.cgi?id=5193#c0>) > What I would like to do is define an additional QOS called "test" which > allows users to run test jobs for a short time and at a higher priority. I > would like users to be able to apply this QOS to their jobs using "--qos > test". QOS are defined in Slurm through the sacctmgr command: https://slurm.schedmd.com/sacctmgr.html#lbAW $ sacctmgr create qos test <... parameters> if you want this QOS to have higher priority than other QOS, then set the QOS 'Priority' option higher than the rest. You could also consider setting preempt/qos if you want jobs from higher priority QOS to preempt jobs from lower ones: https://slurm.schedmd.com/preempt.html > The idea of the test QOS is for users to run test jobs in the above > partitions -- let's say for a max wall time of 2 hours over 2 compute nodes. These are the related QOS options to address this: MaxWall Maximum wall clock time each job is able to use. GrpWall Maximum wall clock time running jobs are able to be allocated in aggregate for this QOS. If this limit is reached submission requests will be denied and the running jobs will be killed. MaxTRESPerUser=node=X and/or GrpTRES=node=Y > -- Users can only run one job at once and have one job in the queue (for > example). MaxJobsPerUser and/or MaxSubmitJobsPerUser > -- The global total number of test jobs also needs to be limited. GrpJobs Maximum number of running jobs in aggregate for this QOS. > -- Furthermore, of course, the test QOS needs to be able to override the > partition QOS on gpu and gtx1080 (Using OverPartQOS ?). Flags PartitionMaxNodes If set jobs using this QOS will be able to override the requested partition's MaxNodes limit. PartitionMinNodes If set jobs using this QOS will be able to override the requested partition's MinNodes limit. OverPartQOS If set jobs using this QOS will be able to override any limits used by the requested partition's QOS limits. PartitionTimeLimit If set jobs using this QOS will be able to override the requested partition's TimeLimit. Additional info: https://slurm.schedmd.com/resource_limits.html PS: you opened the bug with Slurm version 17.02.8. I'd take this opportunity to encourage you to upgrade to the latest stable version as soon as possible for two reasons: 1. Many bugs have been fixed since 17.02.8 (and I remember fixes related to all these limits enforcing). 2. Version 18.08 is planned to be released in August, and 17.02 will then be unsupported. Please, let me know if you have further questions. Thank you. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to David Baker from comment #2) > Hello, > > Thank you for your reply. I’ve defined a test QOS with some initial settings > and I’m experimenting. So far so good. > > Thank you for your note regarding the future version of SLURM and the > timescale. We’ll bear that in mind over the next weeks. We are still > developing the cluster and so upgrading SURM is better done now than later > (as well as the support issue). > > Could I ask an additional question in this ticket or should I really only a > new report? I’m just thinking that this issue may be related to the version > of slurm. It looks like I cannot easily get the memory usage from jobs > (completed or running). For example… > > Running job.. > > [root@blue53 ~]# sacct -X -j 84448 --format=MaxRSS ==> no value reported > for MaxRSS > > Completed jobs. > > [root@blue53 ~]# seff 83169 > Job ID: 83169 > Cluster: i5 > User/Group: yc11e14/mm > State: CANCELLED (exit code 0) > Nodes: 2 > Cores per node: 40 > CPU Utilized: 101-14:45:11 > CPU Efficiency: 98.20% of 103-11:29:20 core-walltime > Memory Utilized: 46.22 GB (estimated maximum) > Memory Efficiency: 0.00% of 0.00 MB (0.00 MB/node) > > Free feel to start a new report/conversation. It is odd that I am getting > undefined memory usage figures (eg MaxRSS) from sacct. > > JobAcctGatherType=jobacct_gather/linux > AccountingStorageType=accounting_storage/slurmdbd > > Best regards, > David In order to display different statistics and status information of a running job/step, you can use the 'sstat' command. Please make sure to read the "-j" and "-a" options in the sstat man page: alex@ibiza:~/t$ sbatch --mem=4096 --wrap "srun ./mem_eater" Submitted batch job 20006 alex@ibiza:~/t$ sstat -j 20006 -o jobid,maxrss JobID MaxRSS ------------ ---------- 20006.0 1536116K alex@ibiza:~/t$ sstat -j 20006 -a -o jobid,maxrss JobID MaxRSS ------------ ---------- 20006.0 2048120K alex@ibiza:~/t$ sstat -j 20006.batch -o jobid,maxrss JobID MaxRSS ------------ ---------- 20006.batch 4084K alex@ibiza:~/t$ Note that if your application is directly run inside an allocation without running inside a step (i.e: launched via srun) then you'll need to request the status information of your <jobid>.batch step. Otherwise, you can query all the steps with "-a" or the specific <jobid>[.<stepid]. For finished jobs you can use 'sacct': alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss JobID MaxRSS ------------ ---------- 20002 20002.batch 2891K 20002.0 4051185K alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss -X JobID MaxRSS ------------ ---------- 20002 alex@ibiza:~/t$ Note tha if you use -X then only statistics relevant to job allocation itself will be shown, and since MaxRSS is a step metric then nothing will be displayed for this metric with this option. If you query the disaggregated information about all the job steps then each step will display its own MaxRSS. Does it make sense? (In reply to David Baker from comment #3) > Hello, > > Thank you again for your advice regarding a “test” (high priority) QOS. > > One interesting question comes to mind, please. We are testing the QOS at > the moment, and will be applying the QOS to “exclusive” partitions in our > cluster. Can a QOS override the “exclusive” setting on a partition so that > user (small) test jobs are potentially grouped together a small subset of > the nodes in the cluster? In other words, making the partition act with a > “shared” setting, if that makes sense. There's no such flag specification at QOS level that could override a partition OverSubscribe behavior. At most what you can do is to change/force a job's --oversubscribe/--exclusive option at submit/update time based off the job's QOS through a Job Submit Plugin: https://slurm.schedmd.com/job_submit_plugins.html > I’m on leave next week, but please feel free to cc in my colleagues David > Hempston & Keith Daly who may have time to continue the testing next week. I don't have these two people as any of our SchedMD's supported users and I don't know their e-mail address. > Best regards, > David
Hello, Thank you again. My apologies that I haven’t followed things up for a while. I have just got back from leave and I’m sorting through my emails. I’m running a few tests this morning and I see that you advice about “seeing” memory requirements is really useful, thank you. I’m still to muse over the QOS issue. I’ll get back on that if I need to to. Thank you for that. Best regards, David From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Monday, May 28, 2018 11:49 AM To: Baker D.J. <D.J.Baker@soton.ac.uk> Subject: [Bug 5193] Advice on setting up a high priority "test" QOS Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=5193#c4> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com> (In reply to David Baker from comment #2<show_bug.cgi?id=5193#c2>) > Hello, > > Thank you for your reply. I’ve defined a test QOS with some initial settings > and I’m experimenting. So far so good. > > Thank you for your note regarding the future version of SLURM and the > timescale. We’ll bear that in mind over the next weeks. We are still > developing the cluster and so upgrading SURM is better done now than later > (as well as the support issue). > > Could I ask an additional question in this ticket or should I really only a > new report? I’m just thinking that this issue may be related to the version > of slurm. It looks like I cannot easily get the memory usage from jobs > (completed or running). For example… > > Running job.. > > [root@blue53 ~]# sacct -X -j 84448 --format=MaxRSS ==> no value reported > for MaxRSS > > Completed jobs. > > [root@blue53 ~]# seff 83169 > Job ID: 83169 > Cluster: i5 > User/Group: yc11e14/mm > State: CANCELLED (exit code 0) > Nodes: 2 > Cores per node: 40 > CPU Utilized: 101-14:45:11 > CPU Efficiency: 98.20% of 103-11:29:20 core-walltime > Memory Utilized: 46.22 GB (estimated maximum) > Memory Efficiency: 0.00% of 0.00 MB (0.00 MB/node) > > Free feel to start a new report/conversation. It is odd that I am getting > undefined memory usage figures (eg MaxRSS) from sacct. > > JobAcctGatherType=jobacct_gather/linux > AccountingStorageType=accounting_storage/slurmdbd > > Best regards, > David In order to display different statistics and status information of a running job/step, you can use the 'sstat' command. Please make sure to read the "-j" and "-a" options in the sstat man page: alex@ibiza:~/t$ sbatch --mem=4096 --wrap "srun ./mem_eater" Submitted batch job 20006 alex@ibiza:~/t$ sstat -j 20006 -o jobid,maxrss JobID MaxRSS ------------ ---------- 20006.0 1536116K alex@ibiza:~/t$ sstat -j 20006 -a -o jobid,maxrss JobID MaxRSS ------------ ---------- 20006.0 2048120K alex@ibiza:~/t$ sstat -j 20006.batch -o jobid,maxrss JobID MaxRSS ------------ ---------- 20006.batch 4084K alex@ibiza:~/t$ Note that if your application is directly run inside an allocation without running inside a step (i.e: launched via srun) then you'll need to request the status information of your <jobid>.batch step. Otherwise, you can query all the steps with "-a" or the specific <jobid>[.<stepid]. For finished jobs you can use 'sacct': alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss JobID MaxRSS ------------ ---------- 20002 20002.batch 2891K 20002.0 4051185K alex@ibiza:~/t$ sacct -j 20002 -o jobid,maxrss -X JobID MaxRSS ------------ ---------- 20002 alex@ibiza:~/t$ Note tha if you use -X then only statistics relevant to job allocation itself will be shown, and since MaxRSS is a step metric then nothing will be displayed for this metric with this option. If you query the disaggregated information about all the job steps then each step will display its own MaxRSS. Does it make sense? (In reply to David Baker from comment #3<show_bug.cgi?id=5193#c3>) > Hello, > > Thank you again for your advice regarding a “test” (high priority) QOS. > > One interesting question comes to mind, please. We are testing the QOS at > the moment, and will be applying the QOS to “exclusive” partitions in our > cluster. Can a QOS override the “exclusive” setting on a partition so that > user (small) test jobs are potentially grouped together a small subset of > the nodes in the cluster? In other words, making the partition act with a > “shared” setting, if that makes sense. There's no such flag specification at QOS level that could override a partition OverSubscribe behavior. At most what you can do is to change/force a job's --oversubscribe/--exclusive option at submit/update time based off the job's QOS through a Job Submit Plugin: https://slurm.schedmd.com/job_submit_plugins.html > I’m on leave next week, but please feel free to cc in my colleagues David > Hempston & Keith Daly who may have time to continue the testing next week. I don't have these two people as any of our SchedMD's supported users and I don't know their e-mail address. > Best regards, > David ________________________________ You are receiving this mail because: * You reported the bug.
Hi David. Can we go ahead and close this bug? thanks.
Hi, Apologies to have left this ticket open. We have sorted out our high priority “test” QOS and it is working as expected. Best regards, David From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, June 20, 2018 11:07 AM To: Baker D.J. <D.J.Baker@soton.ac.uk> Subject: [Bug 5193] Advice on setting up a high priority "test" QOS Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=5193#c6> on bug 5193<https://bugs.schedmd.com/show_bug.cgi?id=5193> from Alejandro Sanchez<mailto:alex@schedmd.com> Hi David. Can we go ahead and close this bug? thanks. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to David Baker from comment #7) > Hi, > > Apologies to have left this ticket open. We have sorted out our high > priority “test” QOS and it is working as expected. > > Best regards, > David Glad to know. Thanks.