Ticket 14027

Summary: Setting/updating GrpTRESMinutes based on TRESBIllingWeights
Product: Slurm Reporter: hpc-ops
Component: LimitsAssignee: Ben Roberts <ben>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.9   
Hardware: Linux   
OS: Linux   
Site: Ghent Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description hpc-ops 2022-05-09 08:31:50 MDT
Hello, 

I am looking for feedback regarding the following scheme I have in mind. 

Our setup requires us to restrict the computation time accounts (projects) are allowed to use. For this we have assigned a QoS to each account, with limits set through GrpTRESMins. This works fine, as long as the jobs stick with the default combinations of TRES, i.e., they use no more than the default memory per core, use the correct number of cpus for each gpu they reserve in the job, etc.

However, if they deviate from this, and say request all the RAM and a single core, they still occupy a full node, meaning obviously this node is not available to other jobs. Hence, we rely on MAX_TRES and TRESBillingWeights to derive the correct billing. 

Each of these things work, but the TRESBillingWeights does not (afaik, see also my other question https://bugs.schedmd.com/show_bug.cgi?id=13878) affect the GrpTRESMins accumulated by jobs, as this value _only_ looks at cpu time (wall clock time for the job * number of cores if I understood correctly) -- similar for gpu time for the GPU limits.


We do have external means of tracking actual usage, derived from sacct data, where we would update the usage according to the billing, not to the cpu/gpu usage themselves. So for example, suppose we have set the QoS GrpTRESMins at 100 hours for a job with 

billing=24,cpu=8,gres/gpu=2,mem=20G,node=1

running 1 hour, I would count usage as

cpu time: 24 hours
gpu time: 2 hours

Yet, the accumulated GrpTRESMins will only increase by 8 hours for cpu, so the QoS would have 92 hours of cpu compute time left. Would that be correct?

If yes, then I'd need to take care of updating the QoS, reflecting the remaining compute time and reduce that to 100 - 24 = 76 hours. I was considering doing that on a daily basis, i.e., get the sacct data of all finished (completed, failed ...) jobs, get their billing data and retrieve from our tool how much compute time is left, adjust the QoS accordingly and reset the QoS usage. 

Am I missing something here? Is there a different/better/other way of achieving this goal?


Kind regards,
-- Andy
Comment 1 Ben Roberts 2022-05-10 12:14:56 MDT
Hi Andy,

Thanks for the detail in your description, I think I have a good understanding of the problem you are trying to solve.  You are right about the behavior when users submit jobs that request an uneven ration of different types of Trackable Resources.  

Your solution is a viable option, though I understand it is probably not ideal.  When you have daily scripts running it introduces fragility and potential for the introduction of errors that shouldn't exist if it were handled in the code directly.

One potential alternative I thought might help would be to set a limit on the 'billing' TRES directly.  When a job runs it will calculate the billing amount based on the TRES with the highest amount requested (since you have MAX_TRES defined).  This 'billing' amount is tracked by slurmctld like the other TRES values.  Here's an example of how this might look.

$ sacctmgr show qos member format=name,grptresmins%20,flags
      Name          GrpTRESMins                Flags 
---------- -------------------- -------------------- 
    member          billing=600              NoDecay 




After running a few short test jobs you can see that the billing TRES shows the 27 units of usage out of the 600 I have defined.

$ scontrol show assoc_mgr flags=qos qos=member | grep GrpTRESMins
    GrpTRESMins=cpu=N(9),mem=N(18500),energy=N(0),node=N(3),billing=600(27),fs/disk=N(0),vmem=N(0),pages=N(0),gres/asdf=N(0),gres/gpu=N(0),gres/gpu:tesla=N(0),gres/test=N(0),license/local=N(0),license/testlic=N(0)



Does this sound like something that might work for your needs, or is there something that would require you to keep the limits split out by the different types of resource?

Thanks,
Ben
Comment 2 hpc-ops 2022-05-11 10:29:27 MDT
Hi Ben,

Thanks for the swift response. 

It is correct I can simply add this billing= limit, this will keep the current tracked values, so if they would already be over the limit due to a discrepancy in billing vs. cpu, this will not wreak havoc?


-- Andy
Comment 3 Ben Roberts 2022-05-11 11:46:45 MDT
No, it would not cause problems for jobs that are already running.  Limits defined with sacctmgr will be applied when the scheduler looks at starting new jobs, but it won't look at jobs that are already running and try to enforce newly introduced limits.

Thanks,
Ben
Comment 4 hpc-ops 2022-05-13 03:01:43 MDT
This seems to have worked as indicated. Thanks.