Ticket 14771

Summary: Deleting a nonexistant job
Product: Slurm Reporter: DRW GridOps <gridadm>
Component: slurmdbdAssignee: Jason Booth <jbooth>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: DRW Trading Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description DRW GridOps 2022-08-17 08:06:18 MDT
We have some jobs that show up as RUNNING in sacct but are actually long gone; presumably some crash in the past caused the accounting system to lose track of them.   They are causing no real problems other than causing lies to appear in 'sacct', 'sreport cluster utilization', etc but that in itself is a problem as it impacts our billback model, resulting in mighty outrage amongst our users.

Scanceling these jobs is ineffective:

root@sc06-08:48:24-/srv/slurm/etc# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
33877821     backfill_+     shared        drw          1    RUNNING      0:0
33877821.ba+      batch                   drw          1    RUNNING      0:0
34942389     spot_px_c+       ficc        drw          1    RUNNING      0:0
34942389.ba+      batch                   drw          1    RUNNING      0:0
root@sc06-08:48:29-/srv/slurm/etc# scancel -v 34942389 33877821
scancel: Terminating job 34942389
scancel: Terminating job 33877821
scancel: error: Kill job error on job id 33877821: Invalid job id specified
scancel: error: Kill job error on job id 34942389: Invalid job id specified


The obvious thing to do is just delete the appropriate rows from <clustername>_job_table in accounting database, but if cascading is not set up properly this could result in some broken relations in the database so I am reluctant to just try it.

What is the best and safest way to clear a job that is ghosting in the accounting db?

NOTE: This is on a very old version of Slurm (v19) and we have not (yet!) seen this behavior on clusters running v21+.  We are not looking for a fix, just a band-aid to help us muddle through until we get this old cluster decommed.
Comment 1 Jason Booth 2022-08-17 10:53:47 MDT
What does the following command show for you? If it asks to fix runaways, please select yes.

> $ sacctmgr show runaway


https://slurm.schedmd.com/sacctmgr.html#OPT_RunawayJobs

This command will set the end time to the job start time, and should reflect in reporting on the next roll-up.
Comment 2 DRW GridOps 2022-08-19 10:21:42 MDT
Ah, thanks!  That rings a bell now.
Comment 3 Jason Booth 2022-08-19 10:25:53 MDT
Resolving. Please feel free to re-open if you have further questions regarding this issue.