Ticket 10836

Summary: RawUsage numbers suddenly impossibly high after upgrade
Product: Slurm Reporter: Kaylea Nelson <kaylea.nelson>
Component: AccountingAssignee: Albert Gil <albert.gil>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: adam.munro
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
Site: Yale Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: RHEL Machine Name: Grace
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: sshare output
current conf

Description Kaylea Nelson 2021-02-10 12:42:32 MST
Created attachment 17868 [details]
sshare output

We just updated our Grace cluster from 20.02.3 to 20.02.6 and are now seeing many many impossibly high numbers in sshare for RawUsage and nan values for EffectvUsage (see attached). 

During the update we also moved from cons_res to cons_tres (not sure if that is relevant, but it was one of the only configuration changes made).


Also, We have noticed a pattern where it is appears that many (if not all) of the users with RawUsage=9223372036854775808 are the users who should have (i.e. used to have) RawUsage=0.

Our issue seems perhaps similar to this issue just reported, although we are on a different version: https://bugs.schedmd.com/show_bug.cgi?id=10824


Kaylea
Comment 1 Kaylea Nelson 2021-02-10 12:43:00 MST
Created attachment 17869 [details]
current conf
Comment 2 Kaylea Nelson 2021-02-10 12:51:37 MST
We also found that prior to a slurmctld and slurmdbd restart on 2/8, there are many error similar to 

error: We have more time than is possible (634982400+582615985152+0)(583250967552) > 583027891200 for cluster grace(161952192) from 2021-02-02T22:00:00 - 2021-02-02T23:00:00 tres 2

error: We have more time than is possible (115200+62791336+0)(62906536) > 62802000 for cluster grace(17445) from 2021-02-03T13:00:00 - 2021-02-03T14:00:00 tres 5

The cluster was undergoing maintenance from 2/2-2/4, so there were no users on the system but Yale staff may have been running test jobs for some of that time.
Comment 3 Albert Gil 2021-02-12 02:35:23 MST
Hi Kaylea,

Yes, I'm already tracking your case on bug 10824, although you have a different version than Harvard and Princeton, the root error seems to be the same.

Actually it could also be some clue that you also have those "more time than is possible" errors because Harvard also had them on bug 10753.

If this is ok for you I'm closing this bug as duplicate of bug 10824 con concentrate our investigation there.

If we finally see that the problem is not share between versions, I'll reopen this one.

Regards,
Albert
Comment 4 Albert Gil 2021-02-12 02:36:36 MST
Marking as duplicate of bug 10824.

*** This ticket has been marked as a duplicate of ticket 10824 ***