Ticket 19310

Summary:	Power Outage keeps running jobs accumulate core*h in DB
Product:	Slurm	Reporter:	HHLR Admins <hhlr-admins>
Component:	Accounting	Assignee:	Jacob Jenson <jacob>
Status:	OPEN ---	QA Contact:
Severity:	6 - No support contract
Priority:	---
Version:	23.02.7
Hardware:	Linux
OS:	Linux
Site:	-Other-	Alineos Sites:	---
Atos/Eviden Sites:	---	Confidential Site:	---
Coreweave sites:	---	Cray Sites:	---
DS9 clusters:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description HHLR Admins 2024-03-14 08:35:17 MDT

Dear SLURM team,

We have had a short in our power infrastructure and had to keep our cluster off for ~4 weeks of repair, including both slurmctld's...

After powering up, users noticed that jobs running into the power failure kept accumulating fictitious CPU runtime during the downtime, resulting in i.e.

sreport -t hours cluster userutilizationbyaccount account=p00xxxxx start=2024-02-09 end=2024-02-29
--------------------------------------------------------------------------------
Cluster/User/Account Utilization 2024-02-09T00:00:00 - 2024-02-28T23:59:59 (1728000 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account       Used   Energy
--------- --------- --------------- --------------- ---------- --------
 lcluster  userid                          p00xxxxx     399360        0

We do energy consumption measuring via AcctGatherEnergyType=acct_gather_energy/rapl, so it correctly counted zero kWh, but incorrectly lots of CPU*h (wouldn't reject the infinite efficiency, though).

Job end mails also wrongly indicate a runtime of 28 days (we only allow 7 days).

We need to correct this, but as you (legitimately!) discourage to fiddle with the active accounting database: is there a way to reduce all accounts' (and users') CPU minutes for the downtime to zero...?

I tried to reset the rollup to before the downtime (as described in https://bugs.schedmd.com/show_bug.cgi?id=1706), but that didn't help.


Thanks in advance!

Comment 1 HHLR Admins 2024-03-14 08:37:06 MDT

... forgot to mention that there are no "sacctmgr show runaway" jobs.