5934 – Jobs that have completed show up as runaway jobs? When fixed no time is recorded for the job?

Ticket 5934 - Jobs that have completed show up as runaway jobs? When fixed no time is recorded for the job?

Summary: Jobs that have completed show up as runaway jobs? When fixed no time is reco...

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	18.08.1
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marshall Garey
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-10-26 08:45 MDT by Steve
Modified:	2018-11-14 17:07 MST (History)
CC List:	0 users

See Also:
Site:	NOAA
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	ESRL
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	ntest
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Steve 2018-10-26 08:45:31 MDT

I submitted a small parallel batch job yesterday.  The job e-mails me when completed.  Here is the e-mail subject line.

Slurm Job_id=713 Name=Parallel_small Ended, Run time 00:00:52, COMPLETED, ExitCode 0

But when I do a  sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 
711          Parallel_+       ujet       njet    PENDING             Unknown             Unknown 
705          Parallel_+      ntest       njet    RUNNING 2018-10-25T21:49:49             Unknown 
706          Parallel_+      ntest       njet    RUNNING 2018-10-25T22:01:25             Unknown 
707          Parallel_+      ntest       njet    RUNNING 2018-10-25T22:03:52             Unknown 
708          Parallel_+      ntest       njet    RUNNING 2018-10-25T22:09:06             Unknown 
709          Parallel_+       ujet       njet    RUNNING 2018-10-25T22:09:09             Unknown 
710          Parallel_+       ujet       njet    RUNNING 2018-10-25T22:19:24             Unknown 
712          Parallel_+       njet       njet    RUNNING 2018-10-25T23:17:58             Unknown 
713          Parallel_+       njet       njet    RUNNING 2018-10-25T23:21:42             Unknown

So the job is thought to be still running?  I have stopped and restarted slurmdbd and slurmctld on their respective servers.  I don't see the job 713 mentioned in any logs on the server hosting slurmdbd.  

In the slurmSched.log on the host running slurmctld - I see the following...
slurmSched.log:sched: [2018-10-25T23:21:42.026] JobId=713 allocated resources: NodeList=(null)
slurmSched.log:sched: [2018-10-25T23:21:42.921] JobId=713 initiated
slurmSched.log:sched: [2018-10-25T23:21:42.921] Allocate JobId=713 NodeList=n[393-394] #CPUs=16 Partition=njet

Comment 1 Marshall Garey 2018-10-26 13:51:33 MDT

Hi Steve,

If the slurmdbd was down at all or couldn't be communicated with at all, it's quite possible you may have been hit by bug 5875, fixed in commit ea71e10d3ac. I recommend applying that patch locally right away, or upgrading to 18.08.3 ASAP. There was a regression introduced into 18.08.1 and 17.11.10 that can cause a loss of accounting records, including possibly these job completion records. Until you apply the patch locally or upgrade, avoid any slurmdbd down time.

https://github.com/schedmd/slurm/commit/ea71e10d3ac

To fix these jobs, you have a couple of options:

(1) Let sacctmgr show runawayjobs fix those accounting records. It will set the state to completed and set the end time to the start time (0 usage). This means that the usage will be incorrect, but the jobs won't be runaway (and thus accruing usage) anymore.

(2) You can use those emails to see the run time of the jobs, and manually modify the job records in the database:
- ***Backup the database before making any changes by hand.***
- Change the state to 3 (completed)
- Change the time_end to time_start plus the run time reported by the email.

I recommend option (1) if you don't mind a incorrect usage times of those jobs. Option (2) will give you the correct usage of all those jobs, but of course requires more work on your part, including manually changing the database.

If you decide to do option (2), I can provide mysql queries to fix the jobs.

Can you let me know which option you decide to use?

- Marshall

Comment 2 Steve 2018-10-26 14:05:10 MDT

Hi Marshall,
I will go ahead and upgrade to 18.08.3...
thanks,

Steve

On Fri, Oct 26, 2018 at 7:51 PM <bugs@schedmd.com> wrote:

> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c1> on bug
> 5934 <https://bugs.schedmd.com/show_bug.cgi?id=5934> from Marshall Garey
> <marshall@schedmd.com> *
>
> Hi Steve,
>
> If the slurmdbd was down at all or couldn't be communicated with at all, it's
> quite possible you may have been hit by bug 5875 <https://bugs.schedmd.com/show_bug.cgi?id=5875>, fixed in commit ea71e10d3ac.
> I recommend applying that patch locally right away, or upgrading to 18.08.3
> ASAP. There was a regression introduced into 18.08.1 and 17.11.10 that can
> cause a loss of accounting records, including possibly these job completion
> records. Until you apply the patch locally or upgrade, avoid any slurmdbd down
> time.
> https://github.com/schedmd/slurm/commit/ea71e10d3ac
>
> To fix these jobs, you have a couple of options:
>
> (1) Let sacctmgr show runawayjobs fix those accounting records. It will set the
> state to completed and set the end time to the start time (0 usage). This means
> that the usage will be incorrect, but the jobs won't be runaway (and thus
> accruing usage) anymore.
>
> (2) You can use those emails to see the run time of the jobs, and manually
> modify the job records in the database:
> - ***Backup the database before making any changes by hand.***
> - Change the state to 3 (completed)
> - Change the time_end to time_start plus the run time reported by the email.
>
> I recommend option (1) if you don't mind a incorrect usage times of those jobs.
> Option (2) will give you the correct usage of all those jobs, but of course
> requires more work on your part, including manually changing the database.
>
> If you decide to do option (2), I can provide mysql queries to fix the jobs.
>
> Can you let me know which option you decide to use?
>
> - Marshall
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 3 Marshall Garey 2018-10-26 14:13:44 MDT

Sounds good. Upgrading won't fix those runaway jobs, so you'll still need to use sacctmgr or manually modify the database for that.

Comment 4 Steve 2018-10-26 14:15:20 MDT

I assume you mean it will stop new jobs from getting into that state - but
the existing jobs will have to "fixed"?

On Fri, Oct 26, 2018 at 8:13 PM <bugs@schedmd.com> wrote:

> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c3> on bug
> 5934 <https://bugs.schedmd.com/show_bug.cgi?id=5934> from Marshall Garey
> <marshall@schedmd.com> *
>
> Sounds good. Upgrading won't fix those runaway jobs, so you'll still need to
> use sacctmgr or manually modify the database for that.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 5 Marshall Garey 2018-10-26 14:20:25 MDT

(In reply to Steve from comment #4)
> I assume you mean it will stop new jobs from getting into that state - but
> the existing jobs will have to "fixed"?

Yes, that's correct.

Comment 6 Steve 2018-10-26 14:37:09 MDT

I see this over and over again in my system....

slurmdbd: debug2: DBD_MODIFY_RESV: called
slurmdbd: error: There is no reservation by id 317, time_start 1539273600,
and cluster 'njet'

can I clean that up?

On Fri, Oct 26, 2018 at 8:20 PM <bugs@schedmd.com> wrote:

> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c5> on bug
> 5934 <https://bugs.schedmd.com/show_bug.cgi?id=5934> from Marshall Garey
> <marshall@schedmd.com> *
>
> (In reply to Steve from comment #4 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c4>)> I assume you mean it will stop new jobs from getting into that state - but
> > the existing jobs will have to "fixed"?
>
> Yes, that's correct.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 7 Marshall Garey 2018-10-26 14:42:43 MDT

Yes, you definitely hit that bug I was talking about. You should definitely upgrade to 18.08.3 ASAP.

Bug 2741 comment 11 describes steps to fix these manually in the database. You'll need to follow these directions to fix these, since we haven't yet included a patch to automatically fix these situations, but we're working on that. You should take care of this ASAP so you don't begin to lose more accounting records - keep an eye on the DBD agent queue size at the top of the sdiag output.

Comment 8 Marshall Garey 2018-10-29 10:08:58 MDT

Were you able to resolve the reservation issue? Were you able to upgrade to 18.08.3?

Comment 9 Marshall Garey 2018-11-05 09:43:09 MST

Just following up on comment 8

Comment 10 Marshall Garey 2018-11-14 17:07:06 MST

I'm closing this ticket as infogiven. Please re-open it if you have additional questions.

If you haven't done so, you should upgrade to 18.08.3 ASAP.