I submitted a small parallel batch job yesterday. The job e-mails me when completed. Here is the e-mail subject line. Slurm Job_id=713 Name=Parallel_small Ended, Run time 00:00:52, COMPLETED, ExitCode 0 But when I do a sacctmgr show runawayjobs NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 711 Parallel_+ ujet njet PENDING Unknown Unknown 705 Parallel_+ ntest njet RUNNING 2018-10-25T21:49:49 Unknown 706 Parallel_+ ntest njet RUNNING 2018-10-25T22:01:25 Unknown 707 Parallel_+ ntest njet RUNNING 2018-10-25T22:03:52 Unknown 708 Parallel_+ ntest njet RUNNING 2018-10-25T22:09:06 Unknown 709 Parallel_+ ujet njet RUNNING 2018-10-25T22:09:09 Unknown 710 Parallel_+ ujet njet RUNNING 2018-10-25T22:19:24 Unknown 712 Parallel_+ njet njet RUNNING 2018-10-25T23:17:58 Unknown 713 Parallel_+ njet njet RUNNING 2018-10-25T23:21:42 Unknown So the job is thought to be still running? I have stopped and restarted slurmdbd and slurmctld on their respective servers. I don't see the job 713 mentioned in any logs on the server hosting slurmdbd. In the slurmSched.log on the host running slurmctld - I see the following... slurmSched.log:sched: [2018-10-25T23:21:42.026] JobId=713 allocated resources: NodeList=(null) slurmSched.log:sched: [2018-10-25T23:21:42.921] JobId=713 initiated slurmSched.log:sched: [2018-10-25T23:21:42.921] Allocate JobId=713 NodeList=n[393-394] #CPUs=16 Partition=njet
Hi Steve, If the slurmdbd was down at all or couldn't be communicated with at all, it's quite possible you may have been hit by bug 5875, fixed in commit ea71e10d3ac. I recommend applying that patch locally right away, or upgrading to 18.08.3 ASAP. There was a regression introduced into 18.08.1 and 17.11.10 that can cause a loss of accounting records, including possibly these job completion records. Until you apply the patch locally or upgrade, avoid any slurmdbd down time. https://github.com/schedmd/slurm/commit/ea71e10d3ac To fix these jobs, you have a couple of options: (1) Let sacctmgr show runawayjobs fix those accounting records. It will set the state to completed and set the end time to the start time (0 usage). This means that the usage will be incorrect, but the jobs won't be runaway (and thus accruing usage) anymore. (2) You can use those emails to see the run time of the jobs, and manually modify the job records in the database: - ***Backup the database before making any changes by hand.*** - Change the state to 3 (completed) - Change the time_end to time_start plus the run time reported by the email. I recommend option (1) if you don't mind a incorrect usage times of those jobs. Option (2) will give you the correct usage of all those jobs, but of course requires more work on your part, including manually changing the database. If you decide to do option (2), I can provide mysql queries to fix the jobs. Can you let me know which option you decide to use? - Marshall
Hi Marshall, I will go ahead and upgrade to 18.08.3... thanks, Steve On Fri, Oct 26, 2018 at 7:51 PM <bugs@schedmd.com> wrote: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c1> on bug > 5934 <https://bugs.schedmd.com/show_bug.cgi?id=5934> from Marshall Garey > <marshall@schedmd.com> * > > Hi Steve, > > If the slurmdbd was down at all or couldn't be communicated with at all, it's > quite possible you may have been hit by bug 5875 <https://bugs.schedmd.com/show_bug.cgi?id=5875>, fixed in commit ea71e10d3ac. > I recommend applying that patch locally right away, or upgrading to 18.08.3 > ASAP. There was a regression introduced into 18.08.1 and 17.11.10 that can > cause a loss of accounting records, including possibly these job completion > records. Until you apply the patch locally or upgrade, avoid any slurmdbd down > time. > https://github.com/schedmd/slurm/commit/ea71e10d3ac > > To fix these jobs, you have a couple of options: > > (1) Let sacctmgr show runawayjobs fix those accounting records. It will set the > state to completed and set the end time to the start time (0 usage). This means > that the usage will be incorrect, but the jobs won't be runaway (and thus > accruing usage) anymore. > > (2) You can use those emails to see the run time of the jobs, and manually > modify the job records in the database: > - ***Backup the database before making any changes by hand.*** > - Change the state to 3 (completed) > - Change the time_end to time_start plus the run time reported by the email. > > I recommend option (1) if you don't mind a incorrect usage times of those jobs. > Option (2) will give you the correct usage of all those jobs, but of course > requires more work on your part, including manually changing the database. > > If you decide to do option (2), I can provide mysql queries to fix the jobs. > > Can you let me know which option you decide to use? > > - Marshall > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Sounds good. Upgrading won't fix those runaway jobs, so you'll still need to use sacctmgr or manually modify the database for that.
I assume you mean it will stop new jobs from getting into that state - but the existing jobs will have to "fixed"? On Fri, Oct 26, 2018 at 8:13 PM <bugs@schedmd.com> wrote: > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c3> on bug > 5934 <https://bugs.schedmd.com/show_bug.cgi?id=5934> from Marshall Garey > <marshall@schedmd.com> * > > Sounds good. Upgrading won't fix those runaway jobs, so you'll still need to > use sacctmgr or manually modify the database for that. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Steve from comment #4) > I assume you mean it will stop new jobs from getting into that state - but > the existing jobs will have to "fixed"? Yes, that's correct.
I see this over and over again in my system.... slurmdbd: debug2: DBD_MODIFY_RESV: called slurmdbd: error: There is no reservation by id 317, time_start 1539273600, and cluster 'njet' can I clean that up? On Fri, Oct 26, 2018 at 8:20 PM <bugs@schedmd.com> wrote: > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c5> on bug > 5934 <https://bugs.schedmd.com/show_bug.cgi?id=5934> from Marshall Garey > <marshall@schedmd.com> * > > (In reply to Steve from comment #4 <https://bugs.schedmd.com/show_bug.cgi?id=5934#c4>)> I assume you mean it will stop new jobs from getting into that state - but > > the existing jobs will have to "fixed"? > > Yes, that's correct. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Yes, you definitely hit that bug I was talking about. You should definitely upgrade to 18.08.3 ASAP. Bug 2741 comment 11 describes steps to fix these manually in the database. You'll need to follow these directions to fix these, since we haven't yet included a patch to automatically fix these situations, but we're working on that. You should take care of this ASAP so you don't begin to lose more accounting records - keep an eye on the DBD agent queue size at the top of the sdiag output.
Were you able to resolve the reservation issue? Were you able to upgrade to 18.08.3?
Just following up on comment 8
I'm closing this ticket as infogiven. Please re-open it if you have additional questions. If you haven't done so, you should upgrade to 18.08.3 ASAP.