19310 – Power Outage keeps running jobs accumulate core*h in DB

Ticket 19310 - Power Outage keeps running jobs accumulate core*h in DB

Summary: Power Outage keeps running jobs accumulate core*h in DB

Status:	OPEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	23.02.7
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2024-03-14 08:35 MDT by HHLR Admins
Modified:	2024-03-14 08:37 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description HHLR Admins 2024-03-14 08:35:17 MDT

Dear SLURM team,

We have had a short in our power infrastructure and had to keep our cluster off for ~4 weeks of repair, including both slurmctld's...

After powering up, users noticed that jobs running into the power failure kept accumulating fictitious CPU runtime during the downtime, resulting in i.e.

sreport -t hours cluster userutilizationbyaccount account=p00xxxxx start=2024-02-09 end=2024-02-29
--------------------------------------------------------------------------------
Cluster/User/Account Utilization 2024-02-09T00:00:00 - 2024-02-28T23:59:59 (1728000 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account       Used   Energy
--------- --------- --------------- --------------- ---------- --------
 lcluster  userid                          p00xxxxx     399360        0

We do energy consumption measuring via AcctGatherEnergyType=acct_gather_energy/rapl, so it correctly counted zero kWh, but incorrectly lots of CPU*h (wouldn't reject the infinite efficiency, though).

Job end mails also wrongly indicate a runtime of 28 days (we only allow 7 days).

We need to correct this, but as you (legitimately!) discourage to fiddle with the active accounting database: is there a way to reduce all accounts' (and users') CPU minutes for the downtime to zero...?

I tried to reset the rollup to before the downtime (as described in https://bugs.schedmd.com/show_bug.cgi?id=1706), but that didn't help.


Thanks in advance!

Comment 1 HHLR Admins 2024-03-14 08:37:06 MDT

... forgot to mention that there are no "sacctmgr show runaway" jobs.