Ticket 19310

Summary: Power Outage keeps running jobs accumulate core*h in DB
Product: Slurm Reporter: HHLR Admins <hhlr-admins>
Component: AccountingAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.02.7   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description HHLR Admins 2024-03-14 08:35:17 MDT
Dear SLURM team,

We have had a short in our power infrastructure and had to keep our cluster off for ~4 weeks of repair, including both slurmctld's...

After powering up, users noticed that jobs running into the power failure kept accumulating fictitious CPU runtime during the downtime, resulting in i.e.

sreport -t hours cluster userutilizationbyaccount account=p00xxxxx start=2024-02-09 end=2024-02-29
--------------------------------------------------------------------------------
Cluster/User/Account Utilization 2024-02-09T00:00:00 - 2024-02-28T23:59:59 (1728000 secs)
Usage reported in CPU Hours
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account       Used   Energy
--------- --------- --------------- --------------- ---------- --------
 lcluster  userid                          p00xxxxx     399360        0

We do energy consumption measuring via AcctGatherEnergyType=acct_gather_energy/rapl, so it correctly counted zero kWh, but incorrectly lots of CPU*h (wouldn't reject the infinite efficiency, though).

Job end mails also wrongly indicate a runtime of 28 days (we only allow 7 days).

We need to correct this, but as you (legitimately!) discourage to fiddle with the active accounting database: is there a way to reduce all accounts' (and users') CPU minutes for the downtime to zero...?

I tried to reset the rollup to before the downtime (as described in https://bugs.schedmd.com/show_bug.cgi?id=1706), but that didn't help.


Thanks in advance!
Comment 1 HHLR Admins 2024-03-14 08:37:06 MDT
... forgot to mention that there are no "sacctmgr show runaway" jobs.