Dear SLURM team, We have had a short in our power infrastructure and had to keep our cluster off for ~4 weeks of repair, including both slurmctld's... After powering up, users noticed that jobs running into the power failure kept accumulating fictitious CPU runtime during the downtime, resulting in i.e. sreport -t hours cluster userutilizationbyaccount account=p00xxxxx start=2024-02-09 end=2024-02-29 -------------------------------------------------------------------------------- Cluster/User/Account Utilization 2024-02-09T00:00:00 - 2024-02-28T23:59:59 (1728000 secs) Usage reported in CPU Hours -------------------------------------------------------------------------------- Cluster Login Proper Name Account Used Energy --------- --------- --------------- --------------- ---------- -------- lcluster userid p00xxxxx 399360 0 We do energy consumption measuring via AcctGatherEnergyType=acct_gather_energy/rapl, so it correctly counted zero kWh, but incorrectly lots of CPU*h (wouldn't reject the infinite efficiency, though). Job end mails also wrongly indicate a runtime of 28 days (we only allow 7 days). We need to correct this, but as you (legitimately!) discourage to fiddle with the active accounting database: is there a way to reduce all accounts' (and users') CPU minutes for the downtime to zero...? I tried to reset the rollup to before the downtime (as described in https://bugs.schedmd.com/show_bug.cgi?id=1706), but that didn't help. Thanks in advance!
... forgot to mention that there are no "sacctmgr show runaway" jobs.