Ticket 10836

Summary:	RawUsage numbers suddenly impossibly high after upgrade
Product:	Slurm	Reporter:	Kaylea Nelson <kaylea.nelson>
Component:	Accounting	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	adam.munro
Version:	20.02.6
Hardware:	Linux
OS:	Linux
Site:	Yale	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	RHEL	Machine Name:	Grace
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	sshare output current conf

Description Kaylea Nelson 2021-02-10 12:42:32 MST

Created attachment 17868 [details]
sshare output

We just updated our Grace cluster from 20.02.3 to 20.02.6 and are now seeing many many impossibly high numbers in sshare for RawUsage and nan values for EffectvUsage (see attached). 

During the update we also moved from cons_res to cons_tres (not sure if that is relevant, but it was one of the only configuration changes made).


Also, We have noticed a pattern where it is appears that many (if not all) of the users with RawUsage=9223372036854775808 are the users who should have (i.e. used to have) RawUsage=0.

Our issue seems perhaps similar to this issue just reported, although we are on a different version: https://bugs.schedmd.com/show_bug.cgi?id=10824


Kaylea

Comment 1 Kaylea Nelson 2021-02-10 12:43:00 MST

Created attachment 17869 [details]
current conf

Comment 2 Kaylea Nelson 2021-02-10 12:51:37 MST

We also found that prior to a slurmctld and slurmdbd restart on 2/8, there are many error similar to 

error: We have more time than is possible (634982400+582615985152+0)(583250967552) > 583027891200 for cluster grace(161952192) from 2021-02-02T22:00:00 - 2021-02-02T23:00:00 tres 2

error: We have more time than is possible (115200+62791336+0)(62906536) > 62802000 for cluster grace(17445) from 2021-02-03T13:00:00 - 2021-02-03T14:00:00 tres 5

The cluster was undergoing maintenance from 2/2-2/4, so there were no users on the system but Yale staff may have been running test jobs for some of that time.

Comment 3 Albert Gil 2021-02-12 02:35:23 MST

Hi Kaylea,

Yes, I'm already tracking your case on bug 10824, although you have a different version than Harvard and Princeton, the root error seems to be the same.

Actually it could also be some clue that you also have those "more time than is possible" errors because Harvard also had them on bug 10753.

If this is ok for you I'm closing this bug as duplicate of bug 10824 con concentrate our investigation there.

If we finally see that the problem is not share between versions, I'll reopen this one.

Regards,
Albert

Comment 4 Albert Gil 2021-02-12 02:36:36 MST

Marking as duplicate of bug 10824.

*** This ticket has been marked as a duplicate of ticket 10824 ***