| Summary: | Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Lee Reynolds <Lee.Reynolds> |
| Component: | Scheduling | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | anthony.delsorbo |
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=7215 | ||
| Site: | ASU | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 19.05.rc1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
mybalance |
||
|
Description
Lee Reynolds
2018-07-16 14:26:51 MDT
Hi Lee Reynolds, I have looked over your slurm.conf and the information you provided. I do have some questions. I would like to know how you are querying/reporting these metric values. CPU_MINUTES_ALLOCATED CPU_MINUTES_USED CPU_MINUTES_AVAILABLE Are you looking at "scontrol show assoc" to determine these values? Would you also send the output of the following: sacctmgr show assoc user=tmarianc sacctmgr show assoc user=leereyno Kind regards, Jason We’re using sshare to get the metrics.
I’m including the shell script that reports these metric values as an attachment. In case it doesn’t come through, we’re doing the following:
CPU_MINUTES_ALLOCATED=$(sshare -n -P -A $DEFAULT_ACCOUNT -u $USER -o GrpTRESMins | sed 1d | awk -F= '{ print $2 }')
CPU_MINUTES_USED=$(sshare -n -P -A $DEFAULT_ACCOUNT -u $USER -o GrpTRESRaw | sed 1d | awk -F'cpu=' '{ print $2 }' | awk -F, '{ print $1 }')
CPU_MINUTES_AVAILABLE=$(( CPU_MINUTES_ALLOCATED - CPU_MINUTES_USED ))
sacctmgr -p show assoc user=tmarianc | pp
Cluster cluster
Account sozkan
User tmarianc
Partition
Share 1
GrpJobs
GrpTRES
GrpSubmit
GrpWall
GrpTRESMins cpu=1500010
MaxJobs 50
MaxTRES
MaxTRESPerNode
MaxSubmit 100
MaxWall
MaxTRESMins
QOS normal,sulcgpu1,wildfire
Def QOS normal
GrpTRESRunMins
sacctmgr -p show assoc user=leereyno | pp
Cluster cluster
Account arc
User leereyno
Partition
Share 1
GrpJobs
GrpTRES
GrpSubmit
GrpWall
GrpTRESMins cpu=1500000
MaxJobs
MaxTRES
MaxTRESPerNode
MaxSubmit
MaxWall
MaxTRESMins
QOS admin,cidsegpu1,debug,loantest,normal,physicsgpu1,polite,sulcgpu1,wildfire
Def QOS normal
GrpTRESRunMins
--
Lee Reynolds
https://rcstatus.asu.edu
rchelp@asu.edu<mailto:rchelp@asu.edu>
From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, July 16, 2018 3:50 PM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance
Jason Booth<mailto:jbooth@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=>
What
Removed
Added
Assignee
support@schedmd.com<mailto:support@schedmd.com>
jbooth@schedmd.com<mailto:jbooth@schedmd.com>
Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435-23c1&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=BVg4zdYj1Ee4yDD4W0I72DFneOoypw1WPof40Idv-9g&e=> on bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> from Jason Booth<mailto:jbooth@schedmd.com>
Hi Lee Reynolds,
I have looked over your slurm.conf and the information you provided. I do have
some questions. I would like to know how you are querying/reporting these
metric values.
CPU_MINUTES_ALLOCATED
CPU_MINUTES_USED
CPU_MINUTES_AVAILABLE
Are you looking at "scontrol show assoc" to determine these values?
Would you also send the output of the following:
sacctmgr show assoc user=tmarianc
sacctmgr show assoc user=leereyno
Kind regards,
Jason
________________________________
You are receiving this mail because:
* You reported the bug.
Created attachment 7367 [details] mybalance -- Lee Reynolds https://rcstatus.asu.edu rchelp@asu.edu<mailto:rchelp@asu.edu> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, July 16, 2018 3:50 PM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance Jason Booth<mailto:jbooth@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> jbooth@schedmd.com<mailto:jbooth@schedmd.com> Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435-23c1&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=BVg4zdYj1Ee4yDD4W0I72DFneOoypw1WPof40Idv-9g&e=> on bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> from Jason Booth<mailto:jbooth@schedmd.com> Hi Lee Reynolds, I have looked over your slurm.conf and the information you provided. I do have some questions. I would like to know how you are querying/reporting these metric values. CPU_MINUTES_ALLOCATED CPU_MINUTES_USED CPU_MINUTES_AVAILABLE Are you looking at "scontrol show assoc" to determine these values? Would you also send the output of the following: sacctmgr show assoc user=tmarianc sacctmgr show assoc user=leereyno Kind regards, Jason ________________________________ You are receiving this mail because: * You reported the bug. Hi Lee, I wanted to send you an update about this case. I have a patch which we are reviewing internally, and I will let you know when this is reviewed/accepted. -Jason Anything new on this issue? Lee Reynolds Systems Analyst Principal ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, October 3, 2018 1:39 PM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance Danny Auble<mailto:da@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=95XVGZ2lFOs3XTOWfJPOFtmlnS6OjRUQXAsROnE-69E&s=5tXGtRqCMknDiRnzsyMPfVsknR4kBRmPuaoS9Lgtxo8&e=> What Removed Added CC da@schedmd.com<mailto:da@schedmd.com> ________________________________ You are receiving this mail because: * You reported the bug. Hi Lee, I have been discussing this internally to make sure I have the right solution to the issue. My previous suggestion revealed another aspect to this issue that I did not notice at first. Unfortunately, we believe the changes that will be needed will only be suitable for 19.05+. What I have found is that the UsageFactor should be applied to the internal wall TRES usage used for limits and determining priority. Currently, it is only used for determining priority. I am currently looking into modifying the accounting policy to take into account UsageFactor. -Jason I will be out of the office and in a remote location without internet access from Friday October 4th until Sunday October 14th. If you need help with Research Computing resources such as Agave, Saguaro or Ocotillo, please submit a service request through our support portal: https://rcstatus.asu.edu/servicerequest/ Hi Lee, I wanted to give you an update. I have a partial patch completed for this issue which does address the problem you are seeing. I did run into another instance where usage factor could be considered so I am looking into this. As mentioned previously, this will be in the 19.05 release since it changes functionality. *** Ticket 5944 has been marked as a duplicate of this ticket. *** Hi Lee Reynolds, This has been checked in: https://github.com/SchedMD/slurm/commit/43ef4f7535d89250670f7ab047d93d45be6e1308 Expanded usagefactor to match the documentation Usagefactor matches the documentation and now multiplies TRES time limits and usage. I am closing this out as resolved. Please feel free to update the ticket if you have further questions. Note that you can also use the usagefactor of 0 which will also apply to the TRES time and allow a credential under this QOS to run when their TRES limit would otherwise prevent the job from running. -Jason |