Created attachment 7317 [details] slurm.conf We’re running Slurm version 17.11.7-1 on Centos 7.4. I’m including our slurm.conf file as an attachment. Here is some background information to understand the nature of our problem. Our system is configured so that every customer receives 150000 CPU minutes each month, and these hours refresh every month. These CPU minutes are consumed in our default QOS queue. We also have a secondary QOS queue called Wildfire where users can run jobs. When a job is run in the wildfire QOS queue, it does not count against the user’s CPU minute allocation. The UsageFactor parameter is set to 0.0. Jobs in this QOS queue are preemptable by jobs running in the default queue. When a user submits a job to the cluster in the normal QOS queue, and there are not enough CPU minutes available to run the job, the job will be put into a pending state with AssocGrpCPUMinutesLimit as the reason code. These jobs are then automatically updated by a shell script that runs in the background to run in the wildfire QOS queue instead. All of this is working. Here is the problem we’re having: Let’s say a user has jobs running on the cluster in the wildfire QOS: JOBID PARTITION QOS NAME USER STATE TIME TIME_LIMIT CPUS NODELIST(REASON) GRES 316385 parallel wildfire Production tmarianc RUNNING 21:06:03 4-00:00:00 64 cg15-[3,7-8,14] (null) 315945 parallel wildfire Pull tmarianc RUNNING 23:13:14 4-00:00:00 32 cg13-[14,16],cg14-2 (null) 315926 parallel wildfire Pull tmarianc RUNNING 23:18:07 4-00:00:00 32 cg11-[9,13],cg12-6 (null) This user also has quite a few minutes remaining in his or her allocation: CPU_MINUTES_ALLOCATED : 1500010 CPU_MINUTES_USED : 1277570 CPU_MINUTES_AVAILABLE : 222440 If this user submits a new job, requesting a single core with a run-time of 5 minutes, this job will be flagged as pending due to AssocGrpCPUMinutesLimit and then moved to the wildfire QOS queue ( jobid # 318309 ) JOBID PARTITION QOS NAME USER STATE TIME TIME_LIMIT CPUS NODELIST(REASON) GRES 318309 serial wildfire _interactive tmarianc RUNNING 0:00 5:00 1 cg17-11 (null) 316385 parallel wildfire Production tmarianc RUNNING 21:16:59 4-00:00:00 64 cg15-[3,7-8,14] (null) 315945 parallel wildfire Pull tmarianc RUNNING 23:24:10 4-00:00:00 32 cg13-[14,16],cg14-2 (null) 315926 parallel wildfire Pull tmarianc RUNNING 23:29:03 4-00:00:00 32 cg11-[9,13],cg12-6 (null) However, when a user with a low CPU minute balance, who is not running any jobs, submits a new job, this new job will not be put into a pending state, but will run normally. CPU_MINUTES_ALLOCATED : 100 CPU_MINUTES_USED : 11 CPU_MINUTES_AVAILABLE : 89 JOBID PARTITION QOS NAME USER STATE TIME TIME_LIMIT CPUS NODELIST(REASON) GRES 318303 serial normal _interactive leereyno RUNNING 0:14 5:00 1 cg17-9 (null) Our suspicion is that slurm’s accounting mechanisms are seeing the jobs running in the wildfire QOS queue and counting the time that these jobs are expected to consume against the user’s available balance of CPU minutes, even though the UsageFactor value for this QOS queue is set to 0.0. What can we do to resolve this issue? -- Lee Reynolds Systems Analyst Principal ASU Research Computing https://rcstatus.asu.edu GWC-558 480.965.9460 (Office) 480.516.7622 (Mobile)
Hi Lee Reynolds, I have looked over your slurm.conf and the information you provided. I do have some questions. I would like to know how you are querying/reporting these metric values. CPU_MINUTES_ALLOCATED CPU_MINUTES_USED CPU_MINUTES_AVAILABLE Are you looking at "scontrol show assoc" to determine these values? Would you also send the output of the following: sacctmgr show assoc user=tmarianc sacctmgr show assoc user=leereyno Kind regards, Jason
We’re using sshare to get the metrics. I’m including the shell script that reports these metric values as an attachment. In case it doesn’t come through, we’re doing the following: CPU_MINUTES_ALLOCATED=$(sshare -n -P -A $DEFAULT_ACCOUNT -u $USER -o GrpTRESMins | sed 1d | awk -F= '{ print $2 }') CPU_MINUTES_USED=$(sshare -n -P -A $DEFAULT_ACCOUNT -u $USER -o GrpTRESRaw | sed 1d | awk -F'cpu=' '{ print $2 }' | awk -F, '{ print $1 }') CPU_MINUTES_AVAILABLE=$(( CPU_MINUTES_ALLOCATED - CPU_MINUTES_USED )) sacctmgr -p show assoc user=tmarianc | pp Cluster cluster Account sozkan User tmarianc Partition Share 1 GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins cpu=1500010 MaxJobs 50 MaxTRES MaxTRESPerNode MaxSubmit 100 MaxWall MaxTRESMins QOS normal,sulcgpu1,wildfire Def QOS normal GrpTRESRunMins sacctmgr -p show assoc user=leereyno | pp Cluster cluster Account arc User leereyno Partition Share 1 GrpJobs GrpTRES GrpSubmit GrpWall GrpTRESMins cpu=1500000 MaxJobs MaxTRES MaxTRESPerNode MaxSubmit MaxWall MaxTRESMins QOS admin,cidsegpu1,debug,loantest,normal,physicsgpu1,polite,sulcgpu1,wildfire Def QOS normal GrpTRESRunMins -- Lee Reynolds https://rcstatus.asu.edu rchelp@asu.edu<mailto:rchelp@asu.edu> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, July 16, 2018 3:50 PM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance Jason Booth<mailto:jbooth@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> jbooth@schedmd.com<mailto:jbooth@schedmd.com> Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435-23c1&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=BVg4zdYj1Ee4yDD4W0I72DFneOoypw1WPof40Idv-9g&e=> on bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> from Jason Booth<mailto:jbooth@schedmd.com> Hi Lee Reynolds, I have looked over your slurm.conf and the information you provided. I do have some questions. I would like to know how you are querying/reporting these metric values. CPU_MINUTES_ALLOCATED CPU_MINUTES_USED CPU_MINUTES_AVAILABLE Are you looking at "scontrol show assoc" to determine these values? Would you also send the output of the following: sacctmgr show assoc user=tmarianc sacctmgr show assoc user=leereyno Kind regards, Jason ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 7367 [details] mybalance -- Lee Reynolds https://rcstatus.asu.edu rchelp@asu.edu<mailto:rchelp@asu.edu> From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, July 16, 2018 3:50 PM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance Jason Booth<mailto:jbooth@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> What Removed Added Assignee support@schedmd.com<mailto:support@schedmd.com> jbooth@schedmd.com<mailto:jbooth@schedmd.com> Comment # 1<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435-23c1&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=BVg4zdYj1Ee4yDD4W0I72DFneOoypw1WPof40Idv-9g&e=> on bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=79CY7EiqIlifP9TtejBn0gl0m57127RGKYPrO0zEWXI&s=0sXcRetOfuNF3QbES24bhMOF-0j2JbTN5isBjen226A&e=> from Jason Booth<mailto:jbooth@schedmd.com> Hi Lee Reynolds, I have looked over your slurm.conf and the information you provided. I do have some questions. I would like to know how you are querying/reporting these metric values. CPU_MINUTES_ALLOCATED CPU_MINUTES_USED CPU_MINUTES_AVAILABLE Are you looking at "scontrol show assoc" to determine these values? Would you also send the output of the following: sacctmgr show assoc user=tmarianc sacctmgr show assoc user=leereyno Kind regards, Jason ________________________________ You are receiving this mail because: * You reported the bug.
Hi Lee, I wanted to send you an update about this case. I have a patch which we are reviewing internally, and I will let you know when this is reviewed/accepted. -Jason
Anything new on this issue? Lee Reynolds Systems Analyst Principal ASU Research Computing T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu> researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/> How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, October 3, 2018 1:39 PM To: Lee Reynolds <Lee.Reynolds@asu.edu> Subject: [Bug 5435] Issue with jobs in a free QOS queue getting counted against a user's CPU minute balance Danny Auble<mailto:da@schedmd.com> changed bug 5435<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D5435&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=iK3h7S5I3_IcrYiNkLB_xrtIHfKFub9H1uzARj2UeEw&m=95XVGZ2lFOs3XTOWfJPOFtmlnS6OjRUQXAsROnE-69E&s=5tXGtRqCMknDiRnzsyMPfVsknR4kBRmPuaoS9Lgtxo8&e=> What Removed Added CC da@schedmd.com<mailto:da@schedmd.com> ________________________________ You are receiving this mail because: * You reported the bug.
Hi Lee, I have been discussing this internally to make sure I have the right solution to the issue. My previous suggestion revealed another aspect to this issue that I did not notice at first. Unfortunately, we believe the changes that will be needed will only be suitable for 19.05+. What I have found is that the UsageFactor should be applied to the internal wall TRES usage used for limits and determining priority. Currently, it is only used for determining priority. I am currently looking into modifying the accounting policy to take into account UsageFactor. -Jason
I will be out of the office and in a remote location without internet access from Friday October 4th until Sunday October 14th. If you need help with Research Computing resources such as Agave, Saguaro or Ocotillo, please submit a service request through our support portal: https://rcstatus.asu.edu/servicerequest/
Hi Lee, I wanted to give you an update. I have a partial patch completed for this issue which does address the problem you are seeing. I did run into another instance where usage factor could be considered so I am looking into this. As mentioned previously, this will be in the 19.05 release since it changes functionality.
*** Ticket 5944 has been marked as a duplicate of this ticket. ***
Hi Lee Reynolds, This has been checked in: https://github.com/SchedMD/slurm/commit/43ef4f7535d89250670f7ab047d93d45be6e1308 Expanded usagefactor to match the documentation Usagefactor matches the documentation and now multiplies TRES time limits and usage. I am closing this out as resolved. Please feel free to update the ticket if you have further questions. Note that you can also use the usagefactor of 0 which will also apply to the TRES time and allow a credential under this QOS to run when their TRES limit would otherwise prevent the job from running. -Jason