| Summary: | understanding sacct output of job data wrt GRES (and an anomaly) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | Accounting | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | albert.gil |
| Version: | 19.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Job 10342997 data | ||
|
Description
Greg Wickham
2020-06-03 20:04:10 MDT
Greg, 1): This issue looks like it is the same one as in 8024. There hasn't been any update about it as well. 2): This one is interesting. It is also weird that ReqTres is asking for 1 node bug there were 2 nodes allocated. I'll look into it further and get back to you. 3): Batch isn't allocated any gpus according to AllocTRES. But like in Q2 it disagrees with AllocGRES again. I'll have to get back to you. -Scott Greg, Do you know how this job was created? i.e. (sbatch ...) Is this just an issue with this one job or are you getting similar discrepancies with many jobs? If so are that any patterns that lead to it? Could you also send me the output to sacct -P -j 10342997 --format ALL Thanks, Scott Created attachment 14644 [details]
Job 10342997 data
output of "sacct -P -j 10342997 --format ALL" attached.
(In reply to Scott Hilton from comment #4) > Greg, > > Do you know how this job was created? i.e. (sbatch ...) Is this just an > issue with this one job or are you getting similar discrepancies with many > jobs? If so are that any patterns that lead to it? We only know what slurm records. The anomaly was only discovered while reviewing the accounting records when creating this case. -Greg Greg, For your second question, trust AllocTRES. AllocGres (and ReqGres) are just printing the job value not the individual step values. In other words, each row (.extern .batch .0) will always be the same as the top row as it is now written. I think this is confusing and will look into changing it. -Scott Greg, We are now planning on removing AllocGRES and ReqGRES options in the 20.11 slurm release. Just focus on the output of ReqTRES and AllocTRES. Does this answer all your questions? -Scott I am going to go ahead and close this bug as info given. Thanks Scott. |