| Summary: | Deleting a nonexistant job | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | DRW GridOps <gridadm> |
| Component: | slurmdbd | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DRW Trading | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
What does the following command show for you? If it asks to fix runaways, please select yes. > $ sacctmgr show runaway https://slurm.schedmd.com/sacctmgr.html#OPT_RunawayJobs This command will set the end time to the job start time, and should reflect in reporting on the next roll-up. Ah, thanks! That rings a bell now. Resolving. Please feel free to re-open if you have further questions regarding this issue. |
We have some jobs that show up as RUNNING in sacct but are actually long gone; presumably some crash in the past caused the accounting system to lose track of them. They are causing no real problems other than causing lies to appear in 'sacct', 'sreport cluster utilization', etc but that in itself is a problem as it impacts our billback model, resulting in mighty outrage amongst our users. Scanceling these jobs is ineffective: root@sc06-08:48:24-/srv/slurm/etc# sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 33877821 backfill_+ shared drw 1 RUNNING 0:0 33877821.ba+ batch drw 1 RUNNING 0:0 34942389 spot_px_c+ ficc drw 1 RUNNING 0:0 34942389.ba+ batch drw 1 RUNNING 0:0 root@sc06-08:48:29-/srv/slurm/etc# scancel -v 34942389 33877821 scancel: Terminating job 34942389 scancel: Terminating job 33877821 scancel: error: Kill job error on job id 33877821: Invalid job id specified scancel: error: Kill job error on job id 34942389: Invalid job id specified The obvious thing to do is just delete the appropriate rows from <clustername>_job_table in accounting database, but if cascading is not set up properly this could result in some broken relations in the database so I am reluctant to just try it. What is the best and safest way to clear a job that is ghosting in the accounting db? NOTE: This is on a very old version of Slurm (v19) and we have not (yet!) seen this behavior on clusters running v21+. We are not looking for a fix, just a band-aid to help us muddle through until we get this old cluster decommed.