Ticket 19310 - Power Outage keeps running jobs accumulate core*h in DB
Summary: Power Outage keeps running jobs accumulate core*h in DB
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 23.02.7
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-03-14 08:35 MDT by HHLR Admins
Modified: 2024-03-14 08:37 MDT (History)
0 users

See Also:
Site: -Other-
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description HHLR Admins 2024-03-14 08:35:17 MDT
Dear SLURM team,

We have had a short in our power infrastructure and had to keep our cluster off for ~4 weeks of repair, including both slurmctld's...

After powering up, users noticed that jobs running into the power failure kept accumulating fictitious CPU runtime during the downtime, resulting in i.e.

sreport -t hours cluster userutilizationbyaccount account=p00xxxxx start=2024-02-09 end=2024-02-29
--------------------------------------------------------------------------------
Cluster/User/Account Utilization 2024-02-09T00:00:00 - 2024-02-28T23:59:59 (1728000 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account       Used   Energy
--------- --------- --------------- --------------- ---------- --------
 lcluster  userid                          p00xxxxx     399360        0

We do energy consumption measuring via AcctGatherEnergyType=acct_gather_energy/rapl, so it correctly counted zero kWh, but incorrectly lots of CPU*h (wouldn't reject the infinite efficiency, though).

Job end mails also wrongly indicate a runtime of 28 days (we only allow 7 days).

We need to correct this, but as you (legitimately!) discourage to fiddle with the active accounting database: is there a way to reduce all accounts' (and users') CPU minutes for the downtime to zero...?

I tried to reset the rollup to before the downtime (as described in https://bugs.schedmd.com/show_bug.cgi?id=1706), but that didn't help.


Thanks in advance!
Comment 1 HHLR Admins 2024-03-14 08:37:06 MDT
... forgot to mention that there are no "sacctmgr show runaway" jobs.