9828 – time QOS charged incorrectly

Ticket 9828 - time QOS charged incorrectly

Summary: time QOS charged incorrectly

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	19.05.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-09-15 14:26 MDT by Todd Merritt
Modified:	2020-09-23 10:25 MDT (History)
CC List:	0 users

See Also:
Site:	U of AZ
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmdbd log (7.34 KB, text/plain) 2020-09-18 13:33 MDT, Todd Merritt	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Todd Merritt 2020-09-15 14:26:48 MDT

I've configured our accounts/qos as suggested in ticket 9163. However, now when I run a job using the higher priority QoS, it's charging the account rather than the QoS, viz:

tmerritt@junonia:~/puma $ scontrol show assoc flags=assoc account=parent_49
Current Association Manager state

Association Records

ClusterName=puma Account=parent_49 UserName= Partition= Priority=0 ID=5806
    SharesRaw/Norm/Level/Factor=1/0.00/1175/0.00
    UsageRaw/Norm/Efctv=0.00/0.00/0.00
    ParentAccount=uarizona(40) Lft=8645 DefAssoc=No
    GrpJobs=2000(19) GrpJobsAccrue=N(0)
    GrpSubmitJobs=2000(19) GrpWall=N(0.00)
    GrpTRES=cpu=3290(1824),mem=10485760(9800770),energy=N(0),node=N(19),billing=N(1824),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=8(0)
    GrpTRESMins=cpu=4200000(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    GrpTRESRunMins=cpu=N(1444560),mem=N(7761951925),energy=N(0),node=N(15047),billing=N(1444560),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    MaxJobs=500(19) MaxJobsAccrue= MaxSubmitJobs=3000(19) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=

Then I run a job using a qos

tmerritt@junonia:~/puma $ scontrol show job 131817
JobId=131817 JobName=slurm-standard-test
   UserId=tmerritt(7862) GroupId=hpcteam(30001) MCS_label=N/A
   Priority=2 Nice=0 Account=tmerritt QOS=user_qos_tmerritt
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:03:57 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2020-09-15T13:13:38 EligibleTime=2020-09-15T13:13:38
   AccrueTime=2020-09-15T13:13:38
   StartTime=2020-09-15T13:13:39 EndTime=2020-09-15T13:23:39 Deadline=N/A
   PreemptEligibleTime=2020-09-15T13:13:39 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-15T13:13:39
   Partition=standard AllocNode:Sid=junonia:16811
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=r3u07n1
   BatchHost=r3u07n1
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/u11/tmerritt/puma/puma-standard.scr
   WorkDir=/home/u11/tmerritt/puma
   StdErr=/home/u11/tmerritt/puma/slurm-standard-test.out
   StdIn=/dev/null
   StdOut=/home/u11/tmerritt/puma/slurm-standard-test.out
   Power=

When the job ends, I would expect to the QoS charged, but instead the account is charged and the amount of time charged does not seem to be 1 cpu at 5 minutes:

tmerritt@junonia:~/puma $ scontrol show assoc flags=assoc account=parent_49
Current Association Manager state

Association Records

ClusterName=puma Account=parent_49 UserName= Partition= Priority=0 ID=5806
    SharesRaw/Norm/Level/Factor=1/0.00/1175/0.00
    UsageRaw/Norm/Efctv=1641909.00/0.00/0.33
    ParentAccount=uarizona(40) Lft=8645 DefAssoc=No
    GrpJobs=2000(19) GrpJobsAccrue=N(0)
    GrpSubmitJobs=2000(19) GrpWall=N(290.15)
    GrpTRES=cpu=3290(1824),mem=10485760(9800770),energy=N(0),node=N(19),billing=N(1824),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=8(0)
    GrpTRESMins=cpu=4200000(27365),mem=N(147016823),energy=N(0),node=N(290),billing=N(27365),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    GrpTRESRunMins=cpu=N(1417200),mem=N(7614940375),energy=N(0),node=N(14762),billing=N(1417200),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    MaxJobs=500(19) MaxJobsAccrue= MaxSubmitJobs=3000(19) MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh=

The QoS though does seem to be charged appropriately

tmerritt@junonia:~/puma $ scontrol show assoc flags=qos qos=user_qos_tmerritt
Current Association Manager state

QOS Records

QOS=user_qos_tmerritt(14)
    UsageRaw=309.000000
    GrpJobs=2000(0) GrpJobsAccrue=N(0) GrpSubmitJobs=2000(0) GrpWall=N(5.15)
    GrpTRES=cpu=2112(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    GrpTRESMins=cpu=21000000(5),mem=N(5273),energy=N(0),node=N(5),billing=N(5),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    GrpTRESRunMins=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    MaxWallPJ=
    MaxTRESPJ=
    MaxTRESPN=
    MaxTRESMinsPJ=
    MinPrioThresh= 
    MinTRESPJ=
    PreemptMode=OFF
    Priority=5
    Account Limits
      tmerritt
        MaxJobsPA=N(0) MaxJobsAccruePA=N(0) MaxSubmitJobsPA=N(0)
        MaxTRESPA=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)
    User Limits
      7862
        MaxJobsPU=N(0) MaxJobsAccruePU=N(0) MaxSubmitJobsPU=N(0)
        MaxTRESPU=cpu=N(0),mem=N(0),energy=N(0),node=N(0),billing=N(0),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu:volta=N(0)

Thanks,
Todd

Comment 1 Ben Roberts 2020-09-15 16:28:10 MDT

Hi Todd,

If I understand this correctly, it looks like you're seeing that the job you're looking at is generating usage for both the Account and QOS.  Is that right?  If so, that is expected behavior.  You can define limits on the number of GrpTRESMins for either an Account or QOS and those limits should be enforced correctly, but Slurm will still keep track of the usage for both the Account and QOS since they are both valid credentials on the job.  

Let me know if this helps or if there's something I'm missing in what you're trying to show here.

Thanks,
Ben

Comment 2 Todd Merritt 2020-09-16 06:41:33 MDT

Thanks Ben,
            This is counter-intuitive though. The next job that the user submits against their time bank is not going to be run because the job using the qos used up all of its time. Is there a way that I can make that not happen and have it just charge against the QoS?

The use case is that each research has an account with cputime attached. Some faculty contribute money and get additional high priority time which we allocate to them through the QoS. We use the account to report on usage through xdmod so I was trying to keep them using the same account. I thought it seemed to behave as I had expected on 19.05 when I set this up but maybe there was something else that I'm missing.

Thanks,
Todd

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, September 15, 2020 at 3:28 PM
To: "Merritt, Todd R - (tmerritt)" <tmerritt@arizona.edu>
Subject: [EXT][Bug 9828] time QOS charged incorrectly

External Email
Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=9828#c1> on bug 9828<https://bugs.schedmd.com/show_bug.cgi?id=9828> from Ben Roberts<mailto:ben@schedmd.com>

Hi Todd,

If I understand this correctly, it looks like you're seeing that the job you're

looking at is generating usage for both the Account and QOS.  Is that right?

If so, that is expected behavior.  You can define limits on the number of

GrpTRESMins for either an Account or QOS and those limits should be enforced

correctly, but Slurm will still keep track of the usage for both the Account

and QOS since they are both valid credentials on the job.

Let me know if this helps or if there's something I'm missing in what you're

trying to show here.

Thanks,

Ben

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Ben Roberts 2020-09-16 15:47:09 MDT

I did some testing with 19.05 today to confirm that the behavior is the same as you're seeing in 20.02, and it is.  When you have a job that is unable to run because of a limit, it should make it pretty clear which limit is being hit in the Reason field.  For example, if there is a GrpTRESMins limit on the QOS it will show "QOSGrpCPUMinutesLimit", or if it's a similar limit on the account or user association it will show "AssocGrpCPUMinutesLimit".  

In any case, I would expect the usage for the different credentials to increase as jobs run with those credentials.  You can define the TRESBillingWeights to affect the impact usage has on the Fairshare calculation.  

The thing that I've been puzzling over is the 27365 you got for GrpTRESMins on the account after running a single CPU 5 minute job.  This looks related to the issue we're working on in bug 9811.  I'm still looking into that ticket, but wanted to provide a quick update here.  

Thanks,
Ben

Comment 4 Todd Merritt 2020-09-17 07:05:17 MDT

Thanks Ben, I guess I somehow missed that when I was testing. I can clearly tell why the job is not running it just doesn't make any sense to me that it would be accounted for in both places. It certainly doesn't make any sense from a GAAP financial accounting standpoint anyway :) 

I'll need to find a workaround for my situation I guess. At the moment I can't think of a viable workaround for this though. I have multiple accounts that need to have different limits on resource usage applied to them with their own grptresmins for cpu time. I don't want to have to push out updates to slurm.conf every time we get a new high priority account. I could create a separate high priority account with unlimited time and attach all of the high priority users to it I guess but is there a way to enforce that they use a user qos when they submit? Do you have any thoughts on what the best way to acheive this might be?

Comment 5 Ben Roberts 2020-09-17 09:34:37 MDT

Hi Todd,

I can understand why you would want to have it work that way, unfortunately I don't have a way to make it happen and it's a change that I don't think is likely to be made either.  

I can help you come up with a workaround though.  The nice thing about accounts is that you don't actually need to update your slurm.conf when adding them.  You add the account with sacctmgr, which adds the information to the database directly and will be communicated back to slutmctld.  Then you can add the user to the account (also with sacctmgr) so they have access to it.  You can define the QOS(s) the account or user has access to, along with the default QOS.  If you need to have a single QOS used with a certain account you can limit that user/account association's access to that single QOS and make it the default so the user doesn't need to remember to request it when using the account.  You also have the option of defining a certain account as the default for a user.  

I'll show an example scenario that follows what I described above.  First I'll add a qos named 'user_qos_user1':
$ sacctmgr add qos user_qos_user1 grptresmins=cpu=300


Then I'll add an account for the user, associating the above QOS with the account and making it the default:
$ sacctmgr add account user1_owned qos=user_qos_user1 defaultqos=user_qos_user1


Then I'll add the user to the account:
$ sacctmgr add user user1 account=user1_owned


And finally I'll make that account the default for the user (this step is optional for the scenario you describe):
$ sacctmgr modify user user1 set defaultaccount=user1_owned



Below you can see the information from sacctmgr:

$ sacctmgr show assoc tree account=user1_owned format=cluster,account,user,qos,defaultqos%15
   Cluster              Account       User                  QOS         Def QOS 
---------- -------------------- ---------- -------------------- --------------- 
    knight user1_owned                           user_qos_user1  user_qos_user1 
    knight  user1_owned              user1       user_qos_user1  user_qos_user1 



$ sacctmgr show qos user_qos_user1 format=name%15,grptresmins
           Name   GrpTRESMins 
--------------- ------------- 
 user_qos_user1       cpu=300 



$ sacctmgr show user user1
      User   Def Acct     Admin 
---------- ---------- --------- 
     user1 user1_own+      None 



Then without any modification of the slurm.conf or restart of the controller I can become user1 and submit a job, having it use the Account and QOS combination I just created as the default:
$ sbatch -n1 -t1:00 --wrap='srun sleep 30'
Submitted batch job 104

$ scontrol show job 104 | head -n3
JobId=104 JobName=wrap
   UserId=user1(1001) GroupId=user1(1001) MCS_label=user1_owned
   Priority=11810 Nice=0 Account=user1_owned QOS=user_qos_user1



I hope this helps.  Let me know if you have questions about it.

Thanks,
Ben

Comment 6 Todd Merritt 2020-09-17 10:07:30 MDT

Thanks, Ben. That's what I'm already doing for the standard queue where everyone has the same priority and limits. I guess a more detailed description is in order. I have a partition, standard with a partition QOS. Every research group gets an account that is allowed access to this partition/qos. That's the account that they use for submission and I need that account to match the PI so that it rolls up into xdmod where we report on usage by PI. That's why I want the account to be the same for high priority usage. We have another partition with unlimited access but low priority called windfall that uses a shared windfall account. I re-map that account onto the PIs group when I ingest it into xdmod and I could do a similar thing if I create a shared high priority account but getting users attached to that account in an automated fashion like I'm doing for the other accounts would be challenging. I don't have to deal with that as it is right now because I can just create the high priority qos manually and attach it to the PIs account and let the automated process manage the users attached to the account exactly as it is right now.

Comment 7 Ben Roberts 2020-09-17 15:00:11 MDT

Hi Todd,

Thanks for the additional detail.  When you talk about it being challenging to automate associating the users with a different account, it sounds like you're not talking about the process of creating the user association.  And it sounds like you have a working method of getting just their usage from a shared account.  So I think you're talking about making sure they request the correct account/partition combination when they want to have access to the higher priority account.  Is that right?  

If that's the case we do allow for a job submit filter that you can configure to look for certain attributes and take action based on your defined conditions.  You could create a filter that looked for a job that requested the high priority account but had the wrong partition and send a message that they need to use a different partition, and you could cover the opposite case of the high priority partition, but their normal account.  You could also configure the script to fix their request for them, but if you have a lot of users with unique accounts that would become unmanageable quickly.  

You can read more about the job submit filter here:
https://slurm.schedmd.com/job_submit_plugins.html

Let me know if this sounds like something that would work for you or if I'm trying to address the wrong problem.

Thanks,
Ben

Comment 8 Todd Merritt 2020-09-18 08:42:17 MDT

Hey Ben, I'm talking about the associations. I don't mean to imply that slurm makes it difficult to create the associations, I just mean that it would be difficult to manage that with my automated process because the information about which groups should have high priority access is not available to that script presently. We're already using a job submit plugin to massage submissions but I don't think that that would be helpful here. I think my structured question is:

Is there any way for a user QoS to override tracking of grptresmins from the account?

Is sounds like there's not. If that's the case, then I think I need to create a separate partition and separate accounts for this purpose. If I need to do that, I don't think that there's anything value to the user QoS since I can set all of those same limits on the account, right?

Also, not to entirely derail this ticket :) Any update on why the charge to the underlying account seems so far out of whack (27365 for 5 minutes of walltime)?

Thanks!

Comment 9 Ben Roberts 2020-09-18 13:13:08 MDT

You're correct, there isn't a way to have the QoS override the tracking of GrpTRESMins for the account.  You're also correct that in your scenario, where you're creating separate accounts to have the usage tracked separately, adding a QOS doesn't allow you to limit the GrpTRESMins in a different way.  

In order to figure out how the usage increased by so much we're going to have to dig deeper.  I would like to have you send me the result of a database query:
select id_job,id_user,tres_alloc,tres_req from knight_job_table where id_job='131817';

If you don't remember the credentials to access the database, you can find them in your slurmdbd.conf file.  I would also like to state that making alterations to the database directly can render it inoperable (as far as Slurm is concerned) and is not supported.   

If you can reproduce this behavior then it would be useful to see some of these entries in the slurmdbd.log file by enabling the DB_JOB debug flag in your slurmdbd.conf.
DebugFlags=DB_JOB

Let me know if you have problems gathering this information.

Thanks,
Ben

Comment 10 Todd Merritt 2020-09-18 13:31:30 MDT

Ok, thanks, I'll look at a different setup for managing these buy-in accounts. I suppose a potential work around might be to just allocate additional time to their standard accounts to match the high priority time.

MariaDB [slurm_acct_db]> select id_job,id_user,tres_alloc,tres_req from puma_job_table where id_job='131817';
+--------+---------+--------------------+--------------------+
| id_job | id_user | tres_alloc         | tres_req           |
+--------+---------+--------------------+--------------------+
| 131817 |    7862 | 1=1,2=1024,4=1,5=1 | 1=1,2=1024,4=1,5=1 |
+--------+---------+--------------------+--------------------+
1 row in set (0.00 sec)


I can reproduce. Here's the result for a new job, 136277. I'll attach the slurmdbd log.

MariaDB [slurm_acct_db]> select id_job,id_user,tres_alloc,tres_req from puma_job_table where id_job='136277';
+--------+---------+--------------------+--------------------+
| id_job | id_user | tres_alloc         | tres_req           |
+--------+---------+--------------------+--------------------+
| 136277 |    7862 | 1=1,2=1024,4=1,5=1 | 1=1,2=1024,4=1,5=1 |
+--------+---------+--------------------+--------------------+
1 row in set (0.00 sec)

Thanks!

Comment 11 Todd Merritt 2020-09-18 13:33:29 MDT

Created attachment 15957 [details]
slurmdbd log

Comment 12 Ben Roberts 2020-09-21 09:29:17 MDT

Thanks for gathering this input.  It looks like these jobs were charged for a reasonable amount of time.  The codes for the different TRES' are:
1 = CPU
2 = Memory
4 = Node
5 = Billing

The logs line up with what you show in the database for job 136277 just being charged for 1 CPU minute.

I think the most likely explanation for the large difference in GrpTRESMins shown is that there was another job (or multiple jobs) finishing around the same time that caused the large amount of time to be registered.  

We should be able to confirm this with a a test, gathering information about the value of GrpTRESMins before a test job along with the epoch time stamp, running the job and then looking at the GrpTRESMins again after it's done.  Here are the commands I'd like to have you run:

date +'%s'; sshare -Aparent_49; scontrol show assoc flags=assoc account=parent_49 | egrep 'ClusterName|GrpTRESMins'

<submit test job>

date +'%s'; sshare -Aparent_49; scontrol show assoc flags=assoc account=parent_49 | egrep 'ClusterName|GrpTRESMins'


Once that is done I would like to have you go in the database and get a list of jobs that finished in that window of time.

select mod_time,account,id_job,id_user,tres_alloc,tres_req from puma_job_table where mod_time>(first time stamp) and mod_time<(second time stamp);


You would want to use the time stamps that you got from the date +'%s' commands you ran earlier.  I'll review that information once you're able to collect it.  

Thanks,
Ben

Comment 13 Todd Merritt 2020-09-23 10:05:51 MDT

Thanks Ben,

I had forgotten about an idle cycle background job that also runs against that same account. After I disabled that the amount charged matches what was charged to the QoS. You can close this ticket out.

Comment 14 Ben Roberts 2020-09-23 10:25:00 MDT

I'm glad we were able to narrow down the source of the extra charge.  Let us know if there's anything else we can do to help.

Thanks,
Ben