Summary: | problems with usage in 20.02.6 | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ryan Day <day36> |
Component: | slurmctld | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | sts |
Version: | 20.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | LLNL | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Ryan Day
2021-02-22 09:38:00 MST
It looks like people who are actually running have reasonable values for RawUsage, but users who haven't run are getting the underflowed value, which is then propagated up the tree: [day36@rztopaz194:~]$ sshare -a Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- ----------- ------------- ---------- root 0.000000 9223372036854775808 1.000000 root root 1 0.000010 33923250 nan 0.000477 ... wci 86917 0.966765 9223372036854775808 nan cmetal 17 0.000196 9223372036854775808 nan cbronze 10 0.588235 9223372036854775808 nan cbronze abcbitz 1 0.004902 0 nan 0.934700 cbronze abdulla 1 0.004902 9223372036854775808 nan 0.936130 cbronze adams106 1 0.004902 9223372036854775808 nan 0.971401 cbronze adler5 1 0.004902 7814080 nan 0.963775 cbronze afeyan 1 0.004902 9223372036854775808 nan 0.970448 cbronze agrusa1 1 0.004902 0 nan 0.960439 cbronze alan2 1 0.004902 9223372036854775808 nan 0.952812 cbronze alead 1 0.004902 0 nan 0.934223 cbronze ames6 1 0.004902 85760 nan 0.933746 Values for usage in the slurmdb appear to be okay as well. Ryan this might be a duplicate of bug#10824. https://github.com/SchedMD/slurm/commit/c57311f19d2ec9a258162909699aba9505e368b8 commit c57311f19d2ec9a258162909699aba9505e368b8 Author: Albert Gil <albert.gil@schedmd.com> AuthorDate: Fri Feb 12 18:41:37 2021 +0100 Work around glibc bug where "0" as a long double is printed as "nan". On broken glibc versions, the zeroes in the association state file will be saved as "nan" in packlongdouble(). Detect if this has happened in unpacklongdouble() and convert back to zero. https://bugzilla.redhat.com/show_bug.cgi?id=1925204 Yes. It does look like we also updated to the broken 322 build of glibc at the same time as we updated to 20.02.6. I'm not quite clear from the discussion of that bug whether just fixing glibc will be sufficient to completely fix this, or if we'll still have to do more to clean up the NaNs that were introduced by the 322 build of glibc. Thanks, Ryan > [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES cpu grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained. > [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES node grp_used_tres_run_secs underflow, tried to remove 3000 seconds when only 2997 remained. > [2021-02-17T17:13:09.046] error: _handle_assoc_tres_run_secs: job 4968110: assoc 1349 TRES billing grp_used_tres_run_secs underflow, tried to remove 108000 seconds when only 107892 remained. This might also be a duplicate of bug#7375. We will let you know. (In reply to Ryan Day from comment #3) > Yes. It does look like we also updated to the broken 322 build of glibc at > the same time as we updated to 20.02.6. I'm not quite clear from the > discussion of that bug whether just fixing glibc will be sufficient to > completely fix this, or if we'll still have to do more to clean up the NaNs > that were introduced by the 322 build of glibc. Since NaNs have now been introduced, I believe you'll need to update Slurm as well. Alternatively, you could just cherrypick Albert's commit and apply it to your 20.02 Slurm locally - it's a really small commit so it should be easy enough to apply. This way you won't have to upgrade Slurm and deal with everything that goes along with an upgrade. You can always just upgrade glibc and try it out (see if the NaNs go away) but I think you'll also need the patch. Can you let us know when you've upgraded glibc and/or applied the patch to Slurm and if it works for you? I'll look into the assoc underflow errors and get back to you - they might be related to 7375, but they might be something else. Thanks. I've pulled Albert's commit and applied it to 20.02.6. We'll include that with the newer glibc and let you know if we still see any issues. Hey Ryan, did everything work out with Albert's patch and the glibc update? (In reply to Marshall Garey from comment #7) > Hey Ryan, did everything work out with Albert's patch and the glibc update? Hey Marshall, Yes. It's looking good. Thanks for the fast responses on everything. We're really glad we got the glibc issue caught before it could bite our users. Ryan Sounds good. I'll close this as a duplicate of bug 10824 If you keep seeing those association underflow errors, you can open a new bug report about it. I don't think it's a duplicate of 7375 which deals with qos underflow errors. *** This ticket has been marked as a duplicate of ticket 10824 *** |