| Summary: | sreport -t percent -T node user topusers doesn't give percentages | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ryan Day <day36> |
| Component: | User Commands | Assignee: | Broderick Gardner <broderick> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | sts |
| Version: | 17.11.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 18.08.5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Ryan Day
2018-10-30 17:33:06 MDT
I'm investigating this for you. I see the problem with node tres usage, I'll figure out why that is not correct. What do you expect to see for Tres Minutes? Sorry, I guess I wasn't clear on that. I don't think there's anything wrong with the -T minutes report. My first thought on the -t percent -T node output was that it might be reporting the TRES minutes instead of the percent, and so I included that output to show that it wasn't. Thanks for the clarification, that makes more sense. It turns out that it is showing a percent, but the percent is so large that the decimals and % are truncated. So the total time it is divided by is wrong somehow. I have replicated the issue, so I'll let you know when I have a fix. The percent doesn't work because cluster node usage is no longer tracked, so there is no total usage to normalize the user node usage. The value shown is actually the number of node seconds * 100. Cluster node usage isn't tracked because nodes are no longer the smallest unit of trackable computing resources; that is the CPU. User node usage is also not particularly useful because nodes are double counted if the user has multiple jobs on a single node. TRES node in general will likely be fully reworked in the future to fix this, though it is not yet clear what node usage accounting should mean. CPU usage is the proper TRES for accounting purposes. Because of this, I will likely add a warning to sreport output indicating that TRES node usage is invalid. Were you able to find the runaway job usage? Was it missing usage due to jobs fixed by `sacctmgr show runawayjobs`? we did sort out the runaway job usage by completely re-rolling the various assoc_usage tables. The cleanup from 'sacctmgr show runaways' wasn't enough by itself. See bug 5924. Okay. There have been some fixes to `sacctmgr show runawayjobs` in version 18.08, both to identify some even more rare runaway job cases and to fix rollup issues. Hopefully in the future that is enough. Sounds good. Thank you. This has now been patched; there is now an error message when trying to request reports that include cluster node TRES utilization or percentages. This will be included in version 18.08.5 commit 58fb2de1e4d6 Thanks |