| Summary: | Setting/updating GrpTRESMinutes based on TRESBIllingWeights | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | hpc-admin |
| Component: | Limits | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 20.11.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ghent | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
hpc-admin
2022-05-09 08:31:50 MDT
Hi Andy,
Thanks for the detail in your description, I think I have a good understanding of the problem you are trying to solve. You are right about the behavior when users submit jobs that request an uneven ration of different types of Trackable Resources.
Your solution is a viable option, though I understand it is probably not ideal. When you have daily scripts running it introduces fragility and potential for the introduction of errors that shouldn't exist if it were handled in the code directly.
One potential alternative I thought might help would be to set a limit on the 'billing' TRES directly. When a job runs it will calculate the billing amount based on the TRES with the highest amount requested (since you have MAX_TRES defined). This 'billing' amount is tracked by slurmctld like the other TRES values. Here's an example of how this might look.
$ sacctmgr show qos member format=name,grptresmins%20,flags
Name GrpTRESMins Flags
---------- -------------------- --------------------
member billing=600 NoDecay
After running a few short test jobs you can see that the billing TRES shows the 27 units of usage out of the 600 I have defined.
$ scontrol show assoc_mgr flags=qos qos=member | grep GrpTRESMins
GrpTRESMins=cpu=N(9),mem=N(18500),energy=N(0),node=N(3),billing=600(27),fs/disk=N(0),vmem=N(0),pages=N(0),gres/asdf=N(0),gres/gpu=N(0),gres/gpu:tesla=N(0),gres/test=N(0),license/local=N(0),license/testlic=N(0)
Does this sound like something that might work for your needs, or is there something that would require you to keep the limits split out by the different types of resource?
Thanks,
Ben
Hi Ben, Thanks for the swift response. It is correct I can simply add this billing= limit, this will keep the current tracked values, so if they would already be over the limit due to a discrepancy in billing vs. cpu, this will not wreak havoc? -- Andy No, it would not cause problems for jobs that are already running. Limits defined with sacctmgr will be applied when the scheduler looks at starting new jobs, but it won't look at jobs that are already running and try to enforce newly introduced limits. Thanks, Ben This seems to have worked as indicated. Thanks. |