Ticket 5435 - Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance
Summary: Issue with jobs in a free QOS queue getting counted against a user's CPU minu...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.11.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Jason Booth
QA Contact:
URL:
: 5944 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2018-07-16 14:26 MDT by Lee Reynolds
Modified: 2019-06-12 10:05 MDT (History)
1 user (show)

See Also:
Site: ASU
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (11.20 KB, text/plain)
2018-07-16 14:26 MDT, Lee Reynolds
Details
mybalance (1.42 KB, application/octet-stream)
2018-07-20 17:23 MDT, Lee Reynolds
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Lee Reynolds 2018-07-16 14:26:51 MDT
Created attachment 7317 [details]
slurm.conf

We’re running Slurm version 17.11.7-1 on Centos 7.4.

I’m including our slurm.conf file as an attachment.

Here is some background information to understand the nature of our problem.

Our system is configured so that every customer receives 150000 CPU minutes each month, and these hours refresh every month.

These CPU minutes are consumed in our default QOS queue.

We also have a secondary QOS queue called Wildfire where users can run jobs.  When a job is run in the wildfire QOS queue, it does not count against the user’s CPU minute allocation.  The UsageFactor parameter is set to 0.0.  Jobs in this QOS queue are preemptable by jobs running in the default queue.

When a user submits a job to the cluster in the normal QOS queue, and there are not enough CPU minutes available to run the job, the job will be put into a pending state with AssocGrpCPUMinutesLimit as the reason code.  These jobs are then automatically updated by a shell script that runs in the background to run in the wildfire QOS queue instead.

All of this is working.

Here is the problem we’re having:

Let’s say a user has jobs running on the cluster in the wildfire QOS:

JOBID PARTITION    QOS      NAME               USER       STATE      TIME     TIME_LIMIT   CPUS   NODELIST(REASON)     GRES
316385 parallel     wildfire Production         tmarianc   RUNNING    21:06:03 4-00:00:00   64     cg15-[3,7-8,14]      (null)              
315945 parallel     wildfire Pull               tmarianc   RUNNING    23:13:14 4-00:00:00   32     cg13-[14,16],cg14-2  (null)              
315926 parallel     wildfire Pull               tmarianc   RUNNING    23:18:07 4-00:00:00   32     cg11-[9,13],cg12-6   (null)

This user also has quite a few minutes remaining in his or her allocation:

CPU_MINUTES_ALLOCATED          : 1500010             
CPU_MINUTES_USED               : 1277570             
CPU_MINUTES_AVAILABLE          : 222440

If this user submits a new job, requesting a single core with a run-time of 5 minutes, this job will be flagged as pending due to AssocGrpCPUMinutesLimit and then moved to the wildfire QOS queue ( jobid # 318309 )

JOBID PARTITION    QOS      NAME               USER       STATE      TIME     TIME_LIMIT   CPUS   NODELIST(REASON)     GRES
318309 serial       wildfire _interactive       tmarianc   RUNNING    0:00     5:00         1      cg17-11              (null)              
316385 parallel     wildfire Production         tmarianc   RUNNING    21:16:59 4-00:00:00   64     cg15-[3,7-8,14]      (null)              
315945 parallel     wildfire Pull               tmarianc   RUNNING    23:24:10 4-00:00:00   32     cg13-[14,16],cg14-2  (null)              
315926 parallel     wildfire Pull               tmarianc   RUNNING    23:29:03 4-00:00:00   32     cg11-[9,13],cg12-6   (null)

However, when a user with a low CPU minute balance, who is not running any jobs, submits a new job, this new job will not be put into a pending state, but will run normally.

CPU_MINUTES_ALLOCATED          : 100                 
CPU_MINUTES_USED               : 11                  
CPU_MINUTES_AVAILABLE          : 89

JOBID PARTITION    QOS      NAME               USER       STATE      TIME     TIME_LIMIT   CPUS   NODELIST(REASON)     GRES
318303 serial       normal   _interactive       leereyno   RUNNING    0:14     5:00         1      cg17-9               (null)

Our suspicion is that slurm’s accounting mechanisms are seeing the jobs running in the wildfire QOS queue and counting the time that these jobs are expected to consume against the user’s available balance of CPU minutes, even though the UsageFactor value for this QOS queue is set to 0.0.

What can we do to resolve this issue?


-- 
Lee Reynolds
Systems Analyst Principal
ASU Research Computing
https://rcstatus.asu.edu


GWC-558
480.965.9460 (Office)
480.516.7622 (Mobile)
Comment 1 Jason Booth 2018-07-16 16:50:06 MDT
Hi Lee Reynolds,

 I have looked over your slurm.conf and the information you provided. I do have some questions. I would like to know how you are querying/reporting these metric values.

CPU_MINUTES_ALLOCATED                      
CPU_MINUTES_USED                       
CPU_MINUTES_AVAILABLE  

Are you looking at "scontrol show assoc" to determine these values?

Would you also send the output of the following:

sacctmgr show assoc user=tmarianc
sacctmgr show assoc user=leereyno

Kind regards,
Jason
Comment 2 Lee Reynolds 2018-07-20 15:46:28 MDT
We’re using sshare to get the metrics.

I’m including the shell script that reports these metric values as an attachment.  In case it doesn’t come through, we’re doing the following:

CPU_MINUTES_ALLOCATED=$(sshare -n -P -A $DEFAULT_ACCOUNT -u $USER -o GrpTRESMins | sed 1d | awk -F= '{ print $2 }')

CPU_MINUTES_USED=$(sshare -n -P -A $DEFAULT_ACCOUNT -u $USER -o GrpTRESRaw | sed 1d | awk -F'cpu=' '{ print $2 }' | awk -F, '{ print $1 }')

CPU_MINUTES_AVAILABLE=$(( CPU_MINUTES_ALLOCATED - CPU_MINUTES_USED ))

sacctmgr -p show assoc user=tmarianc | pp

Cluster                cluster
Account             sozkan
User                     tmarianc
Partition
Share                   1
GrpJobs
GrpTRES
GrpSubmit
GrpWall
GrpTRESMins                   cpu=1500010
MaxJobs             50
MaxTRES
MaxTRESPerNode
MaxSubmit                       100
MaxWall
MaxTRESMins
QOS                     normal,sulcgpu1,wildfire
Def QOS             normal
GrpTRESRunMins



sacctmgr -p show assoc user=leereyno | pp

Cluster                cluster
Account             arc
User                     leereyno
Partition
Share                   1
GrpJobs
GrpTRES
GrpSubmit
GrpWall
GrpTRESMins                   cpu=1500000
MaxJobs
MaxTRES
MaxTRESPerNode
MaxSubmit
MaxWall
MaxTRESMins
QOS                   admin,cidsegpu1,debug,loantest,normal,physicsgpu1,polite,sulcgpu1,wildfire
Def QOS             normal
GrpTRESRunMins



--
Lee Reynolds
https://rcstatus.asu.edu

rchelp@asu.edu<mailto:rchelp@asu.edu>



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, July 16, 2018 3:50 PM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance

Jason Booth<mailto:jbooth@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=>
What

Removed

Added

Assignee

support@schedmd.com<mailto:support@schedmd.com>

jbooth@schedmd.com<mailto:jbooth@schedmd.com>

Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435-23c1&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=BVg4zdYj1Ee4yDD4W0I72DFneOoypw1WPof40Idv-9g&e=> on bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> from Jason Booth<mailto:jbooth@schedmd.com>

Hi Lee Reynolds,



 I have looked over your slurm.conf and the information you provided. I do have

some questions. I would like to know how you are querying/reporting these

metric values.



CPU_MINUTES_ALLOCATED

CPU_MINUTES_USED

CPU_MINUTES_AVAILABLE



Are you looking at "scontrol show assoc" to determine these values?



Would you also send the output of the following:



sacctmgr show assoc user=tmarianc

sacctmgr show assoc user=leereyno



Kind regards,

Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 3 Lee Reynolds 2018-07-20 17:23:29 MDT
Created attachment 7367 [details]
mybalance

--
Lee Reynolds
https://rcstatus.asu.edu

rchelp@asu.edu<mailto:rchelp@asu.edu>



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, July 16, 2018 3:50 PM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance

Jason Booth<mailto:jbooth@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=>
What

Removed

Added

Assignee

support@schedmd.com<mailto:support@schedmd.com>

jbooth@schedmd.com<mailto:jbooth@schedmd.com>

Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435-23c1&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=BVg4zdYj1Ee4yDD4W0I72DFneOoypw1WPof40Idv-9g&e=> on bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> from Jason Booth<mailto:jbooth@schedmd.com>

Hi Lee Reynolds,



 I have looked over your slurm.conf and the information you provided. I do have

some questions. I would like to know how you are querying/reporting these

metric values.



CPU_MINUTES_ALLOCATED

CPU_MINUTES_USED

CPU_MINUTES_AVAILABLE



Are you looking at "scontrol show assoc" to determine these values?



Would you also send the output of the following:



sacctmgr show assoc user=tmarianc

sacctmgr show assoc user=leereyno



Kind regards,

Jason

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 6 Jason Booth 2018-08-16 15:23:23 MDT
Hi Lee,

 I wanted to send you an update about this case. I have a patch which we are reviewing internally, and I will let you know when this is reviewed/accepted.

-Jason
Comment 10 Lee Reynolds 2018-10-03 14:53:01 MDT
Anything new on this issue?



Lee Reynolds
Systems Analyst Principal
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, October 3, 2018 1:39 PM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance

Danny Auble<mailto:da@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=95XVGZ2lFOs3XTOWfJPOFtmlnS6OjRUQXAsROnE-69E&s=5tXGtRqCMknDiRnzsyMPfVsknR4kBRmPuaoS9Lgtxo8&e=>
What

Removed

Added

CC

da@schedmd.com<mailto:da@schedmd.com>



________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 11 Jason Booth 2018-10-03 15:06:05 MDT
Hi Lee,

 I have been discussing this internally to make sure I have the right solution to the issue. My previous suggestion revealed another aspect to this issue that I did not notice at first. Unfortunately, we believe the changes that will be needed will only be suitable for 19.05+. What I have found is that the UsageFactor should be applied to the internal wall TRES usage used for limits and determining priority. Currently, it is only used for determining priority.

 I am currently looking into modifying the accounting policy to take into account UsageFactor.

-Jason
Comment 13 Lee Reynolds 2018-10-09 15:05:57 MDT
I will be out of the office and in a remote location without internet access from Friday October 4th until Sunday October 14th.
If you need help with Research Computing resources such as Agave, Saguaro or Ocotillo, please submit a service request through our support portal:
https://rcstatus.asu.edu/servicerequest/
Comment 21 Jason Booth 2018-11-14 15:30:20 MST
Hi Lee,
 I wanted to give you an update. I have a partial patch completed for this issue which does address the problem you are seeing. I did run into another instance where usage factor could be considered so I am looking into this. As mentioned previously, this will be in the 19.05 release since it changes functionality.
Comment 29 Jason Booth 2019-04-01 22:39:05 MDT
*** Ticket 5944 has been marked as a duplicate of this ticket. ***
Comment 34 Jason Booth 2019-04-30 10:43:01 MDT
Hi Lee Reynolds,

This has been checked in:

https://github.com/SchedMD/slurm/commit/43ef4f7535d89250670f7ab047d93d45be6e1308

    Expanded usagefactor to match the documentation
    
    Usagefactor matches the documentation and now multiplies TRES time
    limits and usage.

I am closing this out as resolved. Please feel free to update the ticket if you have further questions. Note that you can also use the usagefactor of 0 which will also apply to the TRES time and allow a credential under this QOS to run when their TRES limit would otherwise prevent the job from running.

-Jason