Ticket 5086

Summary:	2 types of runaway jobs, some seen PENDING and COMPLETED at the same time
Product:	Slurm	Reporter:	Sophie Créno <sophie.creno>
Component:	Accounting	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	vlegrand
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.8 18.08.0-pre2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	extract of slurmctld log extract of slurmdbd.log extract of slurmd.log for tars-629 output of sql select command slurmdbd config file slurm configuration file list of all runaway jobs database content for the runaway jobs deadlock error messages of the last days sacct -D output 17.11 patch 17.11 patch runaway job 25/05/18 slurmdbd log 25/05/18 from 00:00 to 01:00 output of ls -l command on the directory that contains the patched files slurmdbd logs 1rts hour last night logs since config change steps purged every hour 17.11 patch take3 slurmdbd logs of the 06/03 until the ER_LOCK_WAIT_TIMEOUT Decrease slurmdbd.log of today 06/04 until 17:17 log of slurmdbd 06/06 00:00 -> 10:00 traces from mariadb (locks) remove for update patch when archiving last hour of slurmdbd logs before archive/purge of 7pm and ER_LOCK_WAIT_TIMEOUT

Description Sophie Créno 2018-04-20 08:33:05 MDT

Hello,

Since our last upgrade to 17.11.5, we have observed 2 types of runaway jobs:

* some are seen simultaneously as PENDING:

         ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
30535967     allocation                  tars    PENDING             Unknown             Unknown 


and COMPLETED:

$ sacct -j 30535967

       JobID    JobName   Partition        QOS      State               Start    Elapsed                 End ExitCode
------------ ----------  ---------- ---------- ---------- ------------------- ---------- ------------------- -------- 
30533734_19+      ez_50   dedicated       fast  COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0
30533734_19+      batch                         COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0 
30533734_19+      mcell                         COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0 

Note that these jobs are jobs that requeue themselves automatically by exiting with the code put 
as RequeueExit in slurm.conf.


* others that are seen as running:

        ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
32070617     D-15-0013+  dedicated       tars    RUNNING 2018-04-19T23:43:55             Unknown

$ sacct  -j 32070617

       JobID    JobName  Partition        QOS      State               Start    Elapsed                 End
------------ ---------- ---------- ---------- ---------- ------------------- ---------- -------------------
32070617     D-15-0013+  dedicated       fast    RUNNING 2018-04-19T23:43:55   15:45:52             Unknown


but are over for several hours:

2018-04-19T22:55:33+02:00 tars-master slurmctld[29480]: _slurm_rpc_submit_batch_job: JobId=32070617 InitPrio=7029 usec=7829
2018-04-19T23:43:55+02:00 tars-master slurmctld[29480]: sched: Allocate JobID=32070617 NodeList=tars-464 #CPUs=1 Partition=dedicated
2018-04-20T00:05:07+02:00 tars-master slurmctld[29480]: _job_complete: JobID=32070617 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-04-20T00:05:07+02:00 tars-master slurmctld[29480]: _job_complete: JobID=32070617 State=0x8003 NodeCnt=1 done
2018-04-20T00:07:38+02:00 tars-master slurmctld[29480]: Resending TERMINATE_JOB request JobId=32070617 Nodelist=tars-464
2018-04-20T00:08:35+02:00 tars-master slurmctld[29480]: cleanup_completing: job 32070617 completion process took 208 seconds


In both cases, we don't understand why they are runaway jobs:
* why the first type of jobs are in 2 different states depending on the command to query the db,
* why the second type are not marked as completed in the db whereas it seems that everything behaved
as expected. Can it be due to the completion process time?


As a side effect, for the first type, the information displayed by sacct is correct but when we fix
the runaway jobs, the end time is set to the start time and thus becomes wrong:

$ sacct -j 30535967

       JobID    JobName   Partition        QOS      State               Start    Elapsed                 End ExitCode
------------ ----------  ---------- ---------- ---------- ------------------- ---------- ------------------- -------- 
30533734_19+      ez_50   dedicated       fast  COMPLETED 2018-04-08T16:11:25   00:00:00 2018-04-08T16:11:25
30533734_19+      batch                         COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0 
30533734_19+      mcell                         COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0 


what should we do to avoid these 2 situations? We would like to keep the accounting as accurate as possible.
I guess runaway jobs should be rare and only appear in special circumstances such as jobs terminating while
daemons are stopped because of a maintenance operation (such as patching slurm).

Thanks in advance,

Comment 1 Alejandro Sanchez 2018-04-24 07:44:55 MDT

Hi Sophie. Will you show me the output of sacct with --duplicates?

$ sacct -j 30535967 --duplicates
$ sacct -j 32070617 --duplicates

Are the the RequeueExit jobs being 'scontrol update'd' right after the requeue happens? I'm wondering if you are hitting the same issue that was reported in bug 4855, and we provided a fix:

https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2
https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch (if needed)

that will be available since 17.11.6. The fact that fixed runawayjobs end_time is set to the start_time is normal, as advised by the sacctmgr show runawayjobs warning:

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)

 (You have 30 seconds to decide)
(N/y): n

Thanks.

Comment 2 Sophie Créno 2018-04-24 08:28:36 MDT

Hello,

> $ sacct -j 30535967 --duplicates
Since this one has been requeued several times, I only put an excerpt:

       JobID JobName      State               Start                 End ExitCode 
------------ ------- ---------- ------------------- ------------------- ------ 
30533734_19+   ez_50  COMPLETED 2018-03-30T19:43:03 2018-03-30T19:43:03   99:0 
30533734_19+   batch     FAILED 2018-03-30T19:43:03 2018-03-30T20:08:14   99:0 
30533734_19+   mcell  CANCELLED 2018-03-30T19:43:09 2018-03-30T20:09:02   0:15 
30533734_19+   ez_50  COMPLETED 2018-03-30T20:33:11 2018-03-30T20:33:11   99:0 
30533734_19+   batch     FAILED 2018-03-30T20:33:11 2018-03-30T20:58:01   99:0 
30533734_19+   mcell  CANCELLED 2018-03-30T20:33:14 2018-03-30T20:58:07   0:15

[...]

30533734_19+   ez_50  COMPLETED 2018-04-08T14:57:22 2018-04-08T14:57:22   99:0 
30533734_19+   batch     FAILED 2018-04-08T14:57:22 2018-04-08T15:22:09   99:0 
30533734_19+   mcell  CANCELLED 2018-04-08T14:57:22 2018-04-08T15:22:10   0:15 
30533734_19+   ez_50  COMPLETED 2018-04-08T15:28:51 2018-04-08T15:28:51   99:0 
30533734_19+   batch     FAILED 2018-04-08T15:28:51 2018-04-08T15:53:46   99:0 
30533734_19+   mcell  CANCELLED 2018-04-08T15:28:52 2018-04-08T15:54:05   0:15 
30533734_19+   ez_50  COMPLETED 2018-04-08T16:11:25 2018-04-08T16:11:25    0:0 
30533734_19+   batch  COMPLETED 2018-04-08T16:11:25 2018-04-08T16:32:02    0:0 
30533734_19+   mcell  COMPLETED 2018-04-08T16:11:25 2018-04-08T16:32:02    0:0 


> $ sacct -j 32070617 --duplicates

      JobID JobName      State               Start                 End ExitCo 
------------ ------- ---------- ------------------- ------------------- ------ 
32070617     D-15-0+  COMPLETED 2018-04-19T23:43:55 2018-04-19T23:43:55    0:0


> Are the the RequeueExit jobs being 'scontrol update'd' right after the requeue
> happens? 
No, we just let SLURM requeue the job automatically. We don't do anything else
on them manually or not.


> I'm wondering if you are hitting the same issue that was reported in
> bug 4855, and we provided a fix:
> 
> https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2
> https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch  (if needed)
> 
> that will be available since 17.11.6. The fact that fixed runawayjobs end_time
> is set to the start_time is normal, as advised by the sacctmgr show runawayjobs
> warning. 
Yes I know but it's a pity when the job has completed like 30535967 and is correct
in the db before the sacctmgr show runawayjobs.

Thanks.

Comment 3 Alejandro Sanchez 2018-04-24 09:22:05 MDT

For job 30535967, which two commands are showing different states when querying the db? If sacctmgr show runawayjobs was showing it as PENDING it is normal until you fix it, then sacct -j 30535967 --duplicates should be showing all the right states. Isn't that the case? or you were comparing other tools?

Job 32070617 state seems to have been fixed to COMPLETED after a while or after you fixed it with runawayjobs. Looking at the logs you provided:

2018-04-20T00:07:38+02:00 tars-master slurmctld[29480]: Resending TERMINATE_JOB request JobId=32070617 Nodelist=tars-464
2018-04-20T00:08:35+02:00 tars-master slurmctld[29480]: cleanup_completing: job 32070617 completion process took 208 seconds

The "Resending TERMINATE_JOB request" message indicates that the ctld background thread, executing the logic for purging old job records (see slurm.conf MinJobAge), the job was in COMPLETING state and Slurm tried to deallocate its nodes for a second time, because the first attempt didn't deallocate them. The completion process took 208 seconds. Was this job also a RequeueExit one? Anyway the output from the last sacct query on this job shows its state as COMPLETED.

(In reply to Sophie Créno from comment #2)
> Yes I know but it's a pity when the job has completed like 30535967 and is
> correct
> in the db before the sacctmgr show runawayjobs.

Yeah but it'd be incorrect to leave the end time unmodified since by definition runawayjobs are those that don't exist in the controller but are still considered running in the database, so you would end up with incorrect end time values.

Comment 4 Sophie Créno 2018-04-26 03:45:34 MDT

> For job 30535967, which two commands are showing different states when 
> querying the db? 
It was  sacctmgr show runawayjobs for the PENDING state:

         ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
30535967     allocation                  tars    PENDING             Unknown             Unknown 


and sacct for the COMPLETED state:

       JobID    JobName   Partition        QOS      State               Start    Elapsed                 End ExitCode
------------ ----------  ---------- ---------- ---------- ------------------- ---------- ------------------- -------- 
30533734_19+      ez_50   dedicated       fast  COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0
30533734_19+      batch                         COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0 
30533734_19+      mcell                         COMPLETED 2018-04-08T16:11:25   00:20:37 2018-04-08T16:32:02      0:0 


> If sacctmgr show runawayjobs was showing it as PENDING it is normal
> until you fix it, 
Sorry, I don't understand why it was considered as a runaway job
whereas it was completed. I would have expected to see it as completed
in sacct output and not appearing in the sacctmgr show runawayjobs.
We have 


> then sacct -j 30535967 --duplicates should be showing all the right
> states. Isn't that the case? or you were comparing other tools?
They were already right so they still remain, it's just the end time
that was correct before being fixed by sacctmgr show runawayjobs but
I know it is the expected output of the command.


> Job 32070617 state seems to have been fixed to COMPLETED after a while
> or after you fixed it with runawayjobs.
Yes, it was fixed by sacctmgr show runawayjobs.


> Looking at the logs you provided:
> 
> 2018-04-20T00:07:38+02:00 tars-master slurmctld[29480]: Resending TERMINATE_JOB
> request JobId=32070617 Nodelist=tars-464
> 2018-04-20T00:08:35+02:00 tars-master slurmctld[29480]: cleanup_completing: job
> 32070617 completion process took 208 seconds
> 
> The "Resending TERMINATE_JOB request" message indicates that the ctld
> background thread, executing the logic for purging old job records (see
> slurm.conf MinJobAge), 
which is set to 300 in our slurm.conf to have a few minutes to see it 
in the output of scontrol even after it is over.


> the job was in COMPLETING state and Slurm tried to deallocate its nodes
> for a second time, because the first attempt didn't deallocate them. 
> The completion process took 208 seconds. 
I guess the completion process should be quicker and should have succeeded
on the first attempt. Do you have an idea where the problem comes from? 
Can we change something to improve that?


> Was this job also a RequeueExit one?
No, it's a regular one.


> Anyway the output from the last sacct query on this job shows its state 
> as COMPLETED.
Yes, but it's only due to sacctmgr show runawayjobs.

Currently, I also observe this type of runaway job, considered by everyone
as RUNNING:

sacctmgr show runawayjobs


      ID       Name  Partition    Cluster      State           TimeStart     TimeEnd 
-------- ---------- ---------- ---------- ---------- ------------------- ----------- 
32210221 D-15-0012+     common       tars    RUNNING 2018-04-20T21:01:59     Unknown 


sacct -D -j 32210221

   JobID     JobName  Partition    QOS      State                  Start      Elapsed
-------- ----------- ---------- ------ ---------- ---------------------- ------------ 
32210221  D-15-0012+     common   fast    RUNNING    2018-04-20T21:01:59   5-14:27:17

 
but over in reality:

2018-04-20T20:21:17+02:00 tars-master slurmctld[8329]: _slurm_rpc_submit_batch_job: JobId=32210221 InitPrio=7248 usec=7847
2018-04-20T21:01:59+02:00 tars-master slurmctld[8329]: sched: Allocate JobID=32210221 NodeList=tars-960 #CPUs=1 Partition=common
2018-04-20T21:17:44+02:00 tars-master slurmctld[8329]: _job_complete: JobID=32210221 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-04-20T21:17:44+02:00 tars-master slurmctld[8329]: _job_complete: JobID=32210221 State=0x8003 NodeCnt=1 done


Again, since it is over, it should be COMPLETED in sacct output and 
not appearing in sacctmgr show runawayjobs. This one is of the same
kind as job 32070617, not a requeued one. What are we doing wrong?

Thanks.

Comment 5 Alejandro Sanchez 2018-04-30 09:25:41 MDT

The problem is you can't expect the information from a runwayajob to be correct before it is fixed. Runawayjobs are, by definition, jobs that don't exist in the controller but are still pending, running or suspended in the database. Both 30535967 and 32210221 are examples of this.

Comment 6 Sophie Créno 2018-05-02 05:19:54 MDT

> The problem is you can't expect the information from a runwayajob to be correct
> before it is fixed. Runawayjobs are, by definition, jobs that don't exist in
> the controller but are still pending, running or suspended in the database.
> Both 30535967 and 32210221 are examples of this.
Ok so what should we change in the configuration or elsewhere in order to
ensure that the database is updated when the jobs complete?

Thanks,

Comment 9 Alejandro Sanchez 2018-05-03 05:44:05 MDT

Sophie, could you verify whether the suggested patch (available since 17.11.6) or directly from here:

https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch

is applied to your system? If that's not the case, will you apply it and see if that helps improving things? Thank you.

Comment 12 Sophie Créno 2018-05-04 04:26:26 MDT

Hello,

The patch is now in place. I'll be on vacation but I will ask my colleague 
to keep you informed.

Thanks,

Comment 13 Sophie Créno 2018-05-04 09:56:58 MDT

I don't know if it is linked, but I see things like the following right now:

2018-05-04T15:18:12+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34150382 sent from wrong node (tars-483 rather than tars-400). Was the job requeued due to node failure?
2018-05-04T16:39:07+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34159943 sent from wrong node (tars-446 rather than tars-470). Was the job requeued due to node failure?

Here is the output of sacct for these jobs:


sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 34150382

       JobID    JobName      State               Start    Elapsed                 End ExitCode        NodeList 
------------ ---------- ---------- ------------------- ---------- ------------------- -------- --------------- 
34150382     S-16-0133+  PREEMPTED 2018-05-04T15:01:53   00:00:00 2018-05-04T15:01:53      0:0        tars-483 
34150382     S-16-0133+     FAILED 2018-05-04T15:05:16   00:13:05 2018-05-04T15:18:21      1:0        tars-400 
34150382.ba+      batch     FAILED 2018-05-04T15:05:16   00:13:05 2018-05-04T15:18:21      1:0        tars-400


sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 34159943

       JobID    JobName      State               Start    Elapsed                 End ExitCode        NodeList 
------------ ---------- ---------- ------------------- ---------- ------------------- -------- --------------- 
34159943     S-16-0127+  PREEMPTED 2018-05-04T16:15:01   00:00:00 2018-05-04T16:15:01      0:0        tars-445 
34159943     S-16-0127+  PREEMPTED 2018-05-04T16:17:22   00:00:00 2018-05-04T16:17:22      0:0        tars-452 
34159943.ba+      batch  CANCELLED 2018-05-04T16:17:04   00:00:20 2018-05-04T16:17:24     0:15        tars-452 
34159943     S-16-0127+  PREEMPTED 2018-05-04T16:19:27   00:00:00 2018-05-04T16:19:27      0:0        tars-446 
34159943     S-16-0127+  COMPLETED 2018-05-04T16:21:33   00:17:07 2018-05-04T16:38:40      0:0        tars-470 
34159943.ba+      batch  COMPLETED 2018-05-04T16:21:33   00:17:07 2018-05-04T16:38:40      0:0        tars-470 



It's weird because, for job 34150382, it's the former node who sent
the information regarding the completion while slurmctld was waiting
for it from the new one. For job 34159943, it's the opposite.

Here are the complete logs from slurmctld about these jobs:

2018-05-04T14:58:24+02:00 tars-master slurmctld[23754]: _slurm_rpc_submit_batch_job: JobId=34150382 InitPrio=6186 usec=1216
2018-05-04T15:01:39+02:00 tars-master slurmctld[23754]: backfill: Started JobID=34150382 in dedicated on tars-483
2018-05-04T15:01:53+02:00 tars-master slurmctld[23754]: preempted job 34150382 has been requeued to reclaim resources for job 34150806
2018-05-04T15:01:56+02:00 tars-master slurmctld[23754]: Requeuing JobID=34150382 State=0x0 NodeCnt=0
2018-05-04T15:05:16+02:00 tars-master slurmctld[23754]: backfill: Started JobID=34150382 in dedicated on tars-400
2018-05-04T15:18:12+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34150382 sent from wrong node (tars-483 rather than tars-400). Was the job requeued due to node failure?
2018-05-04T15:18:21+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34150382 State=0x1 NodeCnt=1 WEXITSTATUS 1
2018-05-04T15:18:21+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34150382 State=0x8005 NodeCnt=1 done

2018-05-04T16:14:38+02:00 tars-master slurmctld[23754]: _slurm_rpc_submit_batch_job: JobId=34159943 InitPrio=6186 usec=1597
2018-05-04T16:14:56+02:00 tars-master slurmctld[23754]: backfill: Started JobID=34159943 in dedicated on tars-445
2018-05-04T16:15:01+02:00 tars-master slurmctld[23754]: preempted job 34159943 has been requeued to reclaim resources for job 34159387
2018-05-04T16:15:03+02:00 tars-master slurmctld[23754]: Requeuing JobID=34159943 State=0x0 NodeCnt=0
2018-05-04T16:15:03+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x0 NodeCnt=0 WTERMSIG 15
2018-05-04T16:15:03+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x0 NodeCnt=0 cancelled by interactive user
2018-05-04T16:17:04+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34159943 NodeList=tars-452 #CPUs=1 Partition=dedicated
2018-05-04T16:17:22+02:00 tars-master slurmctld[23754]: preempted job 34159943 has been requeued to reclaim resources for job 34160314
2018-05-04T16:17:24+02:00 tars-master slurmctld[23754]: Requeuing JobID=34159943 State=0x0 NodeCnt=0
2018-05-04T16:19:26+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34159943 NodeList=tars-446 #CPUs=1 Partition=dedicated
2018-05-04T16:19:27+02:00 tars-master slurmctld[23754]: preempted job 34159943 has been requeued to reclaim resources for job 34160314
2018-05-04T16:19:32+02:00 tars-master slurmctld[23754]: Requeuing JobID=34159943 State=0x0 NodeCnt=0
2018-05-04T16:21:33+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34159943 NodeList=tars-470 #CPUs=1 Partition=dedicated
2018-05-04T16:38:40+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-05-04T16:38:40+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x8003 NodeCnt=1 done
2018-05-04T16:39:07+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34159943 sent from wrong node (tars-446 rather than tars-470). Was the job requeued due to node failure?


I've found in former logs that we have already seen such a thing over the previous
months, always on preempted jobs. But it seems that some of them have not a correct
state in the db regarding the first run:

sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 29249884

       JobID    JobName      State               Start    Elapsed                 End ExitCode        NodeList 
------------ ---------- ---------- ------------------- ---------- ------------------- -------- --------------- 
29249884     C260p.B2_+    PENDING 2018-03-09T10:54:27   00:00:00 2018-03-09T10:54:27      0:0        tars-516 
29249884     C260p.B2_+  COMPLETED 2018-03-09T10:56:30   00:18:18 2018-03-09T11:14:48      0:0        tars-427


Thanks,

Comment 14 Alejandro Sanchez 2018-05-07 07:56:40 MDT

The "Batch completion for job" logged messages can be the result of the slurmd on the batch_host (host where batch script is executed) failing, but the slurmstepd continuing to run. Then the batch job is requeued and started on a different node. The end result is one batch complete RPC from each node. But this message should be harmless.

Please also note the patches in 17.11.5 that fix the pending/completing state in the database will only work on new jobs. Are you seeing bad states in the database for preempted & requeued jobs started after the patches have been applied? Thanks.

Comment 15 Véronique Legrand 2018-05-07 08:36:20 MDT

Hello Alejandro,
Sophie applied the patch last friday at approx. noon (local time here in France).
I cleaned the runaway jobs this morning. I didn't pay attention to whether they had started before of after applying the patch.
Right now, there are no new runaway jobs but it is too soon to tell if the patch has solved the issue or not. I will get back to you in a few days.

Regards,

Véronique

Comment 16 Tim Wickberg 2018-05-09 00:40:32 MDT

*** Ticket 5123 has been marked as a duplicate of this ticket. ***

Comment 17 Véronique Legrand 2018-05-09 02:14:04 MDT

Hello Tim,

I found the following runaway jobs this morning:

vlegrand@tars-submit0 ~ $ sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 
34830236     C020s.B_a+  dedicated       tars    RUNNING 2018-05-08T00:06:25             Unknown 
34830237     C020s.B_a+  dedicated       tars    RUNNING 2018-05-08T00:06:25             Unknown 
34830238     C020s.B_a+  dedicated       tars    RUNNING 2018-05-08T00:06:25             Unknown 
34830239     C020s.B_a+  dedicated       tars    RUNNING 2018-05-08T00:06:25             Unknown 
34830240     C020s.B_a+  dedicated       tars    RUNNING 2018-05-08T00:06:25             Unknown 

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)

 (You have 30 seconds to decide)
(N/y): N
Changes Discarded

According to slurm they started at 6:25 am (our time) yesterday (may 8) so after the application of the patch.



Here are other information about them:

vlegrand@tars-submit0 ~ $ sacct -j 34830236 -D 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
34830236     C020s.B_a+  dedicated   hubbioit          1    RUNNING      0:0 


vlegrand@tars-submit0 ~ $ sacct -j 34830236 --duplicates
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
34830236     C020s.B_a+  dedicated   hubbioit          1    RUNNING      0:0 

vlegrand@tars-submit0 ~ $ sacct -D -j 34830236 -o Start,Elapsed,nodelist
              Start    Elapsed        NodeList 
------------------- ---------- --------------- 
2018-05-08T00:06:25 1-09:50:23        tars-629 

So, it has not been requeued.

In slurmdbd.log, I have:
2018-05-08T00:06:27+02:00 tars-acct slurmdbd[18198]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012update "tars_job_table" set mod_time=UNIX_TIMESTAMP(), time_end=1525730787, state=3, derived_ec=0, tres_alloc='1=1,2=3072,3=18446744073709551614,4=1,5=1', exit_code=0, kill_requid=-1 where job_db_inx=74928152;

In slurmctld.log, I have:
2018-05-08T00:06:24+02:00 tars-master slurmctld[23754]: _slurm_rpc_submit_batch_job: JobId=34830236 InitPrio=2984 usec=701
2018-05-08T00:06:25+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34830236 NodeList=tars-629 #CPUs=1 Partition=dedicated
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x8003 NodeCnt=1 done

So, yes, It seems that the job is a "runaway" because slurmdbd wasn't able to update the job table in the database due to a deadlock.

Please let me know if you need any additional information. I am keeping the runaway jobs as is in case you would need to do further investigation.

Regards,

Véronique

Comment 18 Alejandro Sanchez 2018-05-14 07:18:06 MDT

Hi Véronique. Could you please uplaod the slurmdbd, slurmctld and tars-629 slurmd logs from before job 34830236 submission till its completion? That can be helpful to narrow down the root cause of the deadlock. Let's keep the newly reported runawayjobs unfixed for now just in case we require you with more information. Thank you.

Comment 19 Véronique Legrand 2018-05-14 08:07:01 MDT

Created attachment 6846 [details]
extract of slurmctld log

Comment 20 Véronique Legrand 2018-05-14 08:07:30 MDT

Created attachment 6847 [details]
extract of slurmdbd.log

Comment 21 Véronique Legrand 2018-05-14 08:08:04 MDT

Created attachment 6848 [details]
extract of slurmd.log for tars-629

Comment 22 Véronique Legrand 2018-05-14 08:09:55 MDT

Hi Alejandro,

Here there are attached. Just to let you know: there are plenty of new runaway jobs since the last time I checked (on wednesday).
Regards,

Véronique

Comment 23 Alejandro Sanchez 2018-05-14 08:32:32 MDT

Hi Véronique. Could you please attach them again fully, without filtering by jobid strings or similar? the end goal is to know what was going on while the jobs were active in the system, and potentially derive a set of events/operations leading to the deadlock. With the filtered logs I can't see such information. Thank you.

Comment 24 Véronique Legrand 2018-05-14 08:36:58 MDT

Hi Alejandro,

I didn't filter them. I just took the whole log 1 minute before the job starts and 1 minute after the job ends for slurmctld.
For slurmdbd, there was not much in the logs; you have it all.
for slurmd, I took from 2 minutes before the start of the job to 3 minutes after it is terminated.
The whole files are huge. How many minutes before and after the job would you need?

Comment 25 Alejandro Sanchez 2018-05-14 08:51:47 MDT

Yes sorry I see you just had the debuglevel set at info and there isn't much to work off of these messages. Are the timestamps for the completion of the new runaway jobs coincident with the these messages in the dbd.log reporting the deadlock?

Comment 26 Véronique Legrand 2018-05-14 08:56:13 MDT

Yes, I slurmctld.log, we have:
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x8003 NodeCnt=1 done

and at the same time, in slurmctld.log, you can see:
2018-05-08T00:06:27+02:00 tars-acct slurmdbd[18198]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012update "tars_job_table" set mod_time=UNIX_TIMESTAMP(), time_end=1525730787, state=3, derived_ec=0, tres_alloc='1=1,2=3072,3=18446744073709551614,4=1,5=1', exit_code=0, kill_requid=-1 where job_db_inx=74928152;

Isn't that suspect?

Comment 27 Alejandro Sanchez 2018-05-14 09:31:00 MDT

Indeed, it looks like the deadlock prevents the as_mysql_job_complete() function to update the job fields, thus generating new runaway jobs. I'm investigating the lock order for the potentially involved operations (complete, start, rollup, and so). We'll come back once we find something useful. Thanks.

Comment 28 Alejandro Sanchez 2018-05-14 10:27:58 MDT

Véronique, will you give me the output of:

select * from tars_job_table where job_db_inx=74928152;

Thanks

Comment 29 Véronique Legrand 2018-05-15 02:08:40 MDT

Created attachment 6869 [details]
output of sql select command

Comment 30 Véronique Legrand 2018-05-15 02:09:21 MDT

Hi Alejandro,

Please find it in attachment.
Regards,

Véronique

Comment 33 Alejandro Sanchez 2018-05-16 08:32:28 MDT

Hi Véronique. Inspecting the attached select query output we requested, I notice a few things:

1. The mod_time is 1525730800. Epoch Converter translates that timestamp to Tuesday, May 8, 2018 12:06:40 AM GMT+02:00 DST (assuming France timezone).

2. The id_job is 34828611. This jobid isn't any of the reported as runaway at least until comment 17 (in comment 22 you mention there are plenty of new ones but didn't list them). These are the ctld.log messages related to this concrete jobid:

2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34828611 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34828611 State=0x8003 NodeCnt=1 done

The mod_time is updated ~13s after the _job_complete() messages are logged, which makes sense since during these seconds slurmctld was sending the RPC message to slurmdbd to process the mysql update for the completed job.

The 'state' in the database for this specific jobid is 3 (JOB_COMPLETE) and the time_end corresponds to the _job_complete messages timestamps. This means that this specific job is NOT a runaway one. The deadlock return code from mysql was reported back to slurmctld, who retried the update request for this job and the next retry didn't fail. So the deadlock event was something bening and seems to us unrelated to the runaway jobs.

We still need to figure out why you have runaway jobs in your system though. So we're gonna ask you more information.

1. Attach slurm.conf and slurmdbd.conf.

2. Output of:

Current:

$ sacctmgr show runawayjobs

and

mysql> select * from tars_job_table where id_job=<runawayjobid_1> or ... or id_job=<runawayjobid_N>

where you 'or' all the runaway jobids reported by 'sacctmgr show runawayjobs' above.

3. Temporarily increase in production debuglevel to 'debug' both in slurmctld and slurmdbd as well as add DebugFlags=DB_JOB,DB_STEP,DB_QUERY to slurmdbd.conf so that next time we request you the logs we have more information to work off.

Thanks for your collaboration.

Comment 36 Véronique Legrand 2018-05-17 03:36:45 MDT

Created attachment 6887 [details]
slurm configuration file

Comment 37 Véronique Legrand 2018-05-17 03:37:17 MDT

Created attachment 6888 [details]
list of all runaway jobs

Comment 38 Véronique Legrand 2018-05-17 03:38:22 MDT

Created attachment 6889 [details]
database content for the runaway jobs

Comment 39 Véronique Legrand 2018-05-17 03:40:30 MDT

Hi Alejandro,

Thank you for your explanation.
Please find attached the files containing the information you required.
As for changing slurmdbd's debug level in production, I want to speak about it to our sys admin first. Because this will generate huge log files and I don't want to fill hard drives with them...

Comment 40 Alejandro Sanchez 2018-05-17 03:43:51 MDT

(In reply to Véronique Legrand from comment #39)
> Hi Alejandro,
> 
> Thank you for your explanation.
> Please find attached the files containing the information you required.
> As for changing slurmdbd's debug level in production, I want to speak about
> it to our sys admin first. Because this will generate huge log files and I
> don't want to fill hard drives with them...

That's fine. Only if you have an appropiate logrotate configuration and increasing the debuglevel/flags doesn't impact performance on the daemons. Otherwise just skip this step. Thanks.

Comment 42 Alejandro Sanchez 2018-05-17 04:53:27 MDT

Marking slurmdbd.conf attachment as private, just in case password isn't worded.

Comment 43 Alejandro Sanchez 2018-05-17 04:57:01 MDT

I've noticed all runaway jobs TimeStart is close to midnight. Do you have any special process/workflow running at midnight that you think could be relevant and/or impacting Slurm functionality in some way?

Comment 44 Véronique Legrand 2018-05-17 06:26:41 MDT

Hi Alejandro,

At midnight there is the daily purge of job steps. That's the only thing.
I am a little bit surprised by your question because to me the runaway job issue is dued to slurm not "recording" correctly the job's termination information; what has the job start time to do with this?

I had a look at the slurmdbd logs. We had a few "deadlock error" messages. I put them as attachment just in case that could be useful to you.

Regards,

Véronique

Comment 45 Véronique Legrand 2018-05-17 06:28:16 MDT

Created attachment 6891 [details]
deadlock error messages of the last days

As you can see, the deadlock error messages appear nearly every day soon after midnight.

Comment 46 Alejandro Sanchez 2018-05-17 09:35:56 MDT

Do you know what type of jobs were the following ones and how they were submitted? Do you have the related batch scripts if any?

32178224
32178302
32187155
32187156
32187157
32187158
32187159
32187160
32187161
32187162
32187163
32187165
32187169
32187170 

They all belong to id_user 25939 and they are all runaway jobs. They all also had DerivedExitCode 15. I'm curious to see if these ones were doing anything special/different that could shed some light on the runaway generation. 

Perhaps the output of sacct of one of them also:

sacct -D -j 32178224 -o jobid,user,account,partition,state,submit,start,end,elapsed,allocnodes

There are bunch of runaway that also belong to id_user 29763, like:

35657118
35657121
...

Thank you, and sorry to request so much information. I'm trying to find a common pattern or something different to these runaway jobs with respect to the rest.

Comment 50 Véronique Legrand 2018-05-18 02:11:25 MDT

Created attachment 6895 [details]
sacct -D output

Comment 51 Véronique Legrand 2018-05-18 02:20:44 MDT

Hi Alejandro,

The 1rts serie of jobs you are talking about belongs to a user that is doing particule movement simulation with a software called mcell. She has thousands of jobs running every week and only a few of them are runaway. I have asked her for batch's script.

I have put in attachment the output of :
sacct -D -j 32178224 -o jobid,user,account,partition,state,submit,start,end,elapsed,allocnodes

I just hided the user login.

The other serie of jobs you are talking about belongs to a user that also has thousands of jobs every week. He is doing blast searches against databanks and queries to an sqlite database.

By the way, did you have a look at the slurmdbd error messages that I sent you? We see clearly with them that a deadlock occurs nearly everyday soon after midnight and you had noticed a surprising the coincidence with the starting time of the runaway jobs.

Hope this helps,

Regards,

Véronique

Comment 52 Alejandro Sanchez 2018-05-18 04:34:56 MDT

Created attachment 6896 [details]
17.11 patch

Hi Véronique. We've found a potential source of issues. What I'd do now is fix your current runaway jobs (proceed with Yes after showing them). Then apply this patch and let's see if new runaway jobs are appearing next week, jobs starting after the patch has been applied. Thanks.

Comment 55 Alejandro Sanchez 2018-05-21 01:16:44 MDT

Hi Véronique. Please hold on applying the patch, since we internally agreed it required some modifications. We'll send you a second version shortly. Thank you.

Comment 56 Alejandro Sanchez 2018-05-21 04:09:03 MDT

Created attachment 6907 [details]
17.11 patch

Véronique - will you fix your current runaway jobs, then apply this patch and let Slurm run for a couple of days and finally check for runaway jobs again? We've added some lock protection around the purge/archive operation as well as around the job/steps record at start and completion. We think this should fix the issue you've been experiencing here, but your feedback is needed to confirm it. Thanks again.

Comment 57 Véronique Legrand 2018-05-24 02:12:01 MDT

Hi Alejandro,

Thank you for the new patch. I applied it on Tuesday morning on our test cluster. I cannot say if it fixed the problem on it since it is a small cluster and we don't have as many jobs on it as on the production cluster. As everything seemed ok anyway, I applied it this morning on the production cluster and cleaned the runaway jobs. Let's just wait and see now.

Regards,

Véronique

Comment 58 Alejandro Sanchez 2018-05-24 02:18:48 MDT

All right, thanks for the feedback. Please, let us know how it goes after a couple of days have passed.

Comment 59 Véronique Legrand 2018-05-25 01:50:27 MDT

Created attachment 6936 [details]
runaway job 25/05/18

Comment 60 Véronique Legrand 2018-05-25 01:51:06 MDT

Created attachment 6937 [details]
slurmdbd log 25/05/18 from 00:00 to 01:00

Comment 61 Véronique Legrand 2018-05-25 01:53:37 MDT

Hello Alejandro,

Unfortunately, there is already a new runaway job this morning. Its start date is last night at 00:03 approximatively so several hours after I applied the patch. I have put in attachment the slurmdbd logs of tonight from 00:00 to 1 am. and the output of "sacctmgr show runawayjobs".

Regards,

Véronique

Comment 62 Alejandro Sanchez 2018-05-25 03:45:12 MDT

Thanks Véronique. It looks like at least there's no deadlock anymore but there might be something still not fully fixed. We'll come back to you.

Comment 65 Alejandro Sanchez 2018-05-29 06:12:28 MDT

Hi Véronique. We think we either missed a spot or the patch wasn't applied correctly. Most probably we missed something, but the first thing we'd like to ask is to please double-check patch was properly applied and slurmdbd was restarted after recompiling.

I also didn't properly search for Deadlock in your latest slurmdbd log files (most probably due to a case sensitive thing, my fault). There was indeed a Deadlock again:

2018-05-25T00:05:00+02:00 tars-acct slurmdbd[6301]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, array_task_str, array_task_pending) values (37819018, UNIX_TIMESTAMP(), 0, 4294967294, 0, 0, 133, 0, 0, 0, 'tars-63

Can you please upload the output of:

select * from tars_job_table where job_db_inx=78651763;
select * from tars_job_table where id_job=37819017;
select * from tars_job_table where id_job=37819018;

sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 37819017
sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 37819018

This is being a slippery one. Thanks again for your helpful collaboration.

Comment 66 Véronique Legrand 2018-05-29 08:07:06 MDT

Created attachment 6946 [details]
output of ls -l command on the directory that contains the patched files

Comment 67 Véronique Legrand 2018-05-29 08:14:50 MDT

Hi Alejandro,

I have put as attachment the output of ls -l on the src/plugins/accounting_storage/mysql directory.

As you can see, the 2 impacted files abd the corresponding .o were modified on May 24 so the patch has been applied.

I also checked the executables:
ls -l /XXX/slurm/cluster/slurm/sbin
total 19047
-rwxr-xr-x 1 root slurm 7766233 May 24 09:52 slurmctld
-rwxr-xr-x 1 root slurm 3083091 May 24 09:52 slurmd
-rwxr-xr-x 1 root slurm 1128985 May 24 09:52 slurmdbd
-rwxr-xr-x 1 root slurm 4105626 May 24 09:52 slurmstepd

So the executables are new ones generated after I applied the patch.

So, I checked the starting date of slurmdbd on the accounting machine and realized
that it had not been restarted since May 23 (1 day before patch application)

ps -ef|grep slurmdbd
slurm 6301 1 0 May23 ? 00:09:41 /pasteur/services/policy01/slurm/tars/slurm/sbin/slurmdbd
vlegrand 32746 32634 0 16:01 pts/3 00:00:00 grep slurmdbd

I had simply forgotten to restart slurmdbd.
So, I cleaned the runaways:

vlegrand@cluster-acct ~ $ sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
ID Name Partition Cluster State TimeStart TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
37819017 S-16-0010+ dedicated tars RUNNING 2018-05-25T00:03:17 Unknown
37853853 FUCC.30.1+ pibnet tars RUNNING 2018-05-28T00:01:43 Unknown
37853858 FUCC.214.+ pibnet tars RUNNING 2018-05-28T00:01:43 Unknown
37853862 FUCC.104.+ pibnet tars RUNNING 2018-05-28T00:01:43 Unknown

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)

(You have 30 seconds to decide)
(N/y): y

Then restarted slurmdbd.

vlegrand@tars-acct ~ $ sudo /etc/init.d/slurmdbd restart
stopping slurmdbd: [ OK ]
slurmdbd (pid 586) is running...
slurmdbd (pid 586) is running...
slurmdbd (pid 586) is running...
slurmdbd (pid 586) is running...
starting slurmdbd: [ OK ]
vlegrand@tars-acct ~ $ ps -ef|grep slurmdbd
slurm 647 1 0 16:03 ? 00:00:00 /pasteur/services/policy01/slurm/tars/slurm/sbin/slurmdbd
vlegrand 662 32634 0 16:03 pts/3 00:00:00 grep slurmdbd

Sorry for this. We will see if we have deadlocks and runaway jobs in the next days.

Regards,

Véronique

Comment 68 Véronique Legrand 2018-05-30 01:50:03 MDT

Created attachment 6950 [details]
slurmdbd logs 1rts hour last night

Comment 69 Véronique Legrand 2018-05-30 02:02:59 MDT

Hi Alejandro,

It seems that the patch has put us back in the same situation as we were in  6 months ago when we had a previous version of slurm and the job step table was purged once a month (see bug 4295).
When I arrived this morning, I found that the slurmdbd daemon was down.

Trying to do an saact command resulted in the following error message:
"""
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to tars-acct:6819: Connection refused
sacct: error: slurmdbd: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
"""
The slurmdbd daemon crashed soon after midnight (see logs in attachment) with an "err_lock_wait_timeout" just as before...
As the job step table is now purged every night (we followed your colleague's advice). It is likely that slurmdbd will crash every night which is not acceptable for us.

Is there a way to schedule the job purge at an office hour so that we can keep an eye on the slurmdbd daemon and try to see with our DBA if we can provide you more information to finally solve this problem?

Regards,

Véronique

Comment 70 Alejandro Sanchez 2018-05-30 02:10:19 MDT

Do you happen to have a vore file generales by slurmdbd? If so could you inspect it with gdb:

$ gdb /path/to/slurmdbd /path/to/Core
or
$ coredumpctl gdb

Then from inside gdb:

'thread apply all bt'

And attach the output.

Thanks!

Comment 71 Véronique Legrand 2018-05-30 02:54:48 MDT

Hi Alejandro,

I have checked with our system engineer. From the operating system point of view (cent OS 6), everything is configured so that we should have a core dump if the daemon crashed. Yet, contrary to what I thought, it didn't crash but exited in a proper way with return code 1.

I hope that this information will be useful to you.
It would really be helpful if we could change the hour at which the purge is done so that we can investigate the problem when it occurs...

Véronique

Comment 72 Alejandro Sanchez 2018-05-30 03:16:30 MDT

Inspecting the latest slurmdbd.log, the last two entries are:

2018-05-30T00:12:37+02:00 tars-acct slurmdbd[647]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert into "tars_step_table" (job_db_inx, id_step, time_start, step_name, state, tres_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq, req_cpufreq_min, req_cpufreq_gov) values (78942755, -2, 1527630246, 'batch', 1, '1=1,2=3072,4=1', 1, 1, 'tars-513', '186', 0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='tars-513', node_inx='186', task_dis
2018-05-30T00:12:37+02:00 tars-acct slurmdbd[647]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program

so slurmdbd fatal()'ed, whose implementation doesn't generate a core file but exit(1) the process.

Véronique, please see if there was anything else locking up the database. A common cause of this is mysqldump running or something. Running mysqldump will sometimes cause all sorts of issues on a live database. If this is the situation please make sure you have the following mysqldump options...

--single-transaction
--quick
--lock-tables=false

Without them the database will lock up and you will get this kind of issues. I am hoping this is the case or something like it was locking the tables messing things up.

Have you gone through the basic MySQL tuning instructions? They're documented here: https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build

Relevant configuration options:
innodb_buffer_pool_size
innodb_lock_wait_timeout
innodb_log_file_size

Thanks

Comment 73 Alejandro Sanchez 2018-05-30 03:28:28 MDT

... I'd also suggest switching the Purge[Job|Step]After intervals to be expressed in 'hours' to get those more frequent purges.

Comment 74 Véronique Legrand 2018-05-30 06:04:11 MDT

I have already checked that with our DBA a few days/weeks ago. 
The database dumps are not running at midnight. 
They are scheduled/triggered at 10 pm and only last a few minutes.
 The options that are used for sqldump are:
 
 --all-databases --single-transaction --add-drop-database --triggers --routines --events --socket=$SOCKET --user=$USERNAME
 
 with --lock_tables=false, we may loose database consistency and --quick would take more memory.
 We may give --quick a try but I must first talk again with our DBA.
 
 Here are the values for the mysql parameters you mention:



MariaDB [(none)]> SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+--------------+
| Variable_name           | Value        |
+-------------------------+--------------+
| innodb_buffer_pool_size | 137438953472 |
+-------------------------+--------------+
1 row in set (0.00 sec)

MariaDB [(none)]> SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 600   |
+--------------------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> SHOW VARIABLES LIKE 'innodb_log_file_size';
+----------------------+-------------+
| Variable_name        | Value       |
+----------------------+-------------+
| innodb_log_file_size | 16921919488 |
+----------------------+-------------+
1 row in set (0.00 sec)



I also changed in slurmdbd.conf:
#PurgeStepAfter=31days
PurgeStepAfter=744hours

to see if we can reproduce the err_lock_wait_timeout error at office hours.

Comment 75 Véronique Legrand 2018-05-30 06:49:04 MDT

Created attachment 6952 [details]
logs since config change steps purged every hour

Comment 76 Véronique Legrand 2018-05-30 06:50:02 MDT

Alejandro,

I did the config change and restarted slurmdbd at 13:59. 
Since then It has stopped 2 times: once at 14h15 and another time at 14h35 after I restarted it.

Here is the content of the archive directory for the steps:

ls -l /pasteur/services/policy01/slurm/tars/archives/|grep step
-rw------- 1 slurm slurm       1565 Mar  5 15:02 tars_step_archive_1970-01-01T00:00:00_1970-02-01T00:00:00
-rw------- 1 slurm slurm       1179 Mar  3 19:02 tars_step_archive_1970-01-01T00:00:00_1970-02-01T00:00:00.old
-rw------- 1 slurm slurm   43218407 May  1  2016 tars_step_archive_2016-03-01T00:00:00_2016-03-31T23:59:59
-rw------- 1 slurm slurm   59028107 Jun  1  2016 tars_step_archive_2016-04-01T00:00:00_2016-04-30T23:59:59
-rw------- 1 slurm slurm 1504793139 Sep  7  2016 tars_step_archive_2016-05-01T00:00:00_2016-07-31T23:59:59
-rw------- 1 slurm slurm  284054663 Oct  1  2016 tars_step_archive_2016-08-01T00:00:00_2016-08-31T23:59:59
-rw------- 1 slurm slurm  697896812 Nov  1  2016 tars_step_archive_2016-09-01T00:00:00_2016-09-30T23:59:59
-rw------- 1 slurm slurm  418874294 Dec  1  2016 tars_step_archive_2016-10-01T00:00:00_2016-10-31T23:59:59
-rw------- 1 slurm slurm  723510388 Jan  1  2017 tars_step_archive_2016-11-01T00:00:00_2016-11-30T23:59:59
-rw------- 1 slurm slurm  352682402 Feb  1  2017 tars_step_archive_2016-12-01T00:00:00_2016-12-31T23:59:59
-rw------- 1 slurm slurm  937407883 Mar  1  2017 tars_step_archive_2017-01-01T00:00:00_2017-01-31T23:59:59
-rw------- 1 slurm slurm  937407883 Mar  1  2017 tars_step_archive_2017-01-01T00:00:00_2017-01-31T23:59:59.old
-rw------- 1 slurm slurm 1028052287 Apr  2  2017 tars_step_archive_2017-02-01T00:00:00_2017-02-28T23:59:59.new
-rw------- 1 slurm slurm 1447011676 May  1  2017 tars_step_archive_2017-02-01T00:00:00_2017-03-31T23:59:59
-rw------- 1 slurm slurm 1315341329 Jun  1  2017 tars_step_archive_2017-04-01T00:00:00_2017-04-30T23:59:59
-rw------- 1 slurm slurm 1149182689 Jul  1  2017 tars_step_archive_2017-05-01T00:00:00_2017-05-31T23:59:59
-rw------- 1 slurm slurm  573523907 Aug  1  2017 tars_step_archive_2017-06-01T00:00:00_2017-06-30T23:59:59
-rw------- 1 slurm slurm 1950348993 Sep  1  2017 tars_step_archive_2017-07-01T00:00:00_2017-07-31T23:59:59
-rw------- 1 slurm slurm       2134 Nov  1  2017 tars_step_archive_2017-07-01T00:00:00_2017-08-01T00:00:00
-rw------- 1 slurm slurm          0 Nov  1  2017 tars_step_archive_2017-08-01T00:00:00_2017-09-01T00:00:00
-rw------- 1 slurm slurm 1437386927 Nov  1  2017 tars_step_archive_2017-09-01T00:00:00_2017-09-30T23:59:59
-rw------- 1 slurm slurm 1513625447 Dec  1 00:02 tars_step_archive_2017-10-01T00:00:00_2017-10-31T23:59:59
-rw------- 1 slurm slurm        476 Jan  4 16:19 tars_step_archive_2017-10-01T00:00:00_2017-11-30T23:59:59
-rw------- 1 slurm slurm 1822759587 Jan  1 00:02 tars_step_archive_2017-11-01T00:00:00_2017-11-30T23:59:59
-rw------- 1 slurm slurm       1372 Feb  2 06:01 tars_step_archive_2017-11-01T00:00:00_2017-12-31T23:59:59
-rw------- 1 slurm slurm 1540574311 Feb  1 00:03 tars_step_archive_2017-11-01T00:00:00_2017-12-31T23:59:59.old
-rw------- 1 slurm slurm          0 Mar  1 14:01 tars_step_archive_2017-12-01T00:00:00_2018-01-31T23:59:59
-rw------- 1 slurm slurm       2406 May 30 14:00 tars_step_table_archive_1970-01-01T00:00:00_1970-02-01T00:00:00
-rw------- 1 slurm slurm       6166 May 30 00:00 tars_step_table_archive_1970-01-01T00:00:00_1970-02-01T00:00:00.old
-rw------- 1 slurm slurm          0 Apr  3 11:19 tars_step_table_archive_2018-02-01T00:00:00_2018-02-28T23:59:59
-rw------- 1 slurm slurm 1273454685 Apr 17 00:04 tars_step_table_archive_2018-03-01T00:00:00_2018-03-16T23:59:59
-rw------- 1 slurm slurm        473 Apr 20 21:18 tars_step_table_archive_2018-03-12T00:00:00_2018-03-19T23:59:59
-rw------- 1 slurm slurm  118530126 Apr 18 00:00 tars_step_table_archive_2018-03-17T00:00:00_2018-03-17T23:59:59
-rw------- 1 slurm slurm  144090745 Apr 19 00:00 tars_step_table_archive_2018-03-18T00:00:00_2018-03-18T23:59:59
-rw------- 1 slurm slurm   74257924 Apr 20 00:03 tars_step_table_archive_2018-03-19T00:00:00_2018-03-19T23:59:59
-rw------- 1 slurm slurm   91347696 Apr 21 00:01 tars_step_table_archive_2018-03-20T00:00:00_2018-03-20T23:59:59
-rw------- 1 slurm slurm  201299297 Apr 22 00:01 tars_step_table_archive_2018-03-21T00:00:00_2018-03-21T23:59:59
-rw------- 1 slurm slurm  129974326 Apr 23 00:01 tars_step_table_archive_2018-03-22T00:00:00_2018-03-22T23:59:59
-rw------- 1 slurm slurm   68700552 Apr 24 00:01 tars_step_table_archive_2018-03-23T00:00:00_2018-03-23T23:59:59
-rw------- 1 slurm slurm   40997689 Apr 25 00:01 tars_step_table_archive_2018-03-24T00:00:00_2018-03-24T23:59:59
-rw------- 1 slurm slurm  128560550 Apr 26 00:03 tars_step_table_archive_2018-03-25T00:00:00_2018-03-25T23:59:59
-rw------- 1 slurm slurm   70845809 Apr 27 00:03 tars_step_table_archive_2018-03-26T00:00:00_2018-03-26T23:59:59
-rw------- 1 slurm slurm   24216285 Apr 28 00:01 tars_step_table_archive_2018-03-27T00:00:00_2018-03-27T23:59:59
-rw------- 1 slurm slurm  237993265 Apr 29 00:01 tars_step_table_archive_2018-03-28T00:00:00_2018-03-28T23:59:59
-rw------- 1 slurm slurm   19889146 Apr 30 00:01 tars_step_table_archive_2018-03-29T00:00:00_2018-03-29T23:59:59
-rw------- 1 slurm slurm   80251193 May  1 00:01 tars_step_table_archive_2018-03-30T00:00:00_2018-03-30T23:59:59
-rw------- 1 slurm slurm  215350730 May  2 00:01 tars_step_table_archive_2018-03-31T00:00:00_2018-03-31T23:59:59
-rw------- 1 slurm slurm  141028162 May  3 00:04 tars_step_table_archive_2018-04-01T00:00:00_2018-04-01T23:59:59
-rw------- 1 slurm slurm  141143648 May  4 00:03 tars_step_table_archive_2018-04-02T00:00:00_2018-04-02T23:59:59
-rw------- 1 slurm slurm  246160884 May  5 00:01 tars_step_table_archive_2018-04-03T00:00:00_2018-04-03T23:59:59
-rw------- 1 slurm slurm  139078760 May  6 00:03 tars_step_table_archive_2018-04-04T00:00:00_2018-04-04T23:59:59
-rw------- 1 slurm slurm  171546334 May  7 00:03 tars_step_table_archive_2018-04-05T00:00:00_2018-04-05T23:59:59
-rw------- 1 slurm slurm  179421306 May  8 00:03 tars_step_table_archive_2018-04-06T00:00:00_2018-04-06T23:59:59
-rw------- 1 slurm slurm  144518990 May  9 00:01 tars_step_table_archive_2018-04-07T00:00:00_2018-04-07T23:59:59
-rw------- 1 slurm slurm   79614618 May 10 00:04 tars_step_table_archive_2018-04-08T00:00:00_2018-04-08T23:59:59
-rw------- 1 slurm slurm   76785583 May 11 00:03 tars_step_table_archive_2018-04-09T00:00:00_2018-04-09T23:59:59
-rw------- 1 slurm slurm   54465917 May 12 00:03 tars_step_table_archive_2018-04-10T00:00:00_2018-04-10T23:59:59
-rw------- 1 slurm slurm  455256771 May 15 00:03 tars_step_table_archive_2018-04-10T00:00:00_2018-04-13T23:59:59
-rw------- 1 slurm slurm   49718259 May 13 00:03 tars_step_table_archive_2018-04-11T00:00:00_2018-04-11T23:59:59
-rw------- 1 slurm slurm  642908605 May 14 00:04 tars_step_table_archive_2018-04-12T00:00:00_2018-04-12T23:59:59
-rw------- 1 slurm slurm  256772336 May 19 00:03 tars_step_table_archive_2018-04-12T00:00:00_2018-04-17T23:59:59
-rw------- 1 slurm slurm  101480091 May 16 00:03 tars_step_table_archive_2018-04-14T00:00:00_2018-04-14T23:59:59
-rw------- 1 slurm slurm  111564174 May 17 00:03 tars_step_table_archive_2018-04-15T00:00:00_2018-04-15T23:59:59
-rw------- 1 slurm slurm  278337830 May 18 00:04 tars_step_table_archive_2018-04-16T00:00:00_2018-04-16T23:59:59
-rw------- 1 slurm slurm  380835382 May 20 00:01 tars_step_table_archive_2018-04-18T00:00:00_2018-04-18T23:59:59
-rw------- 1 slurm slurm  484623757 May 21 00:01 tars_step_table_archive_2018-04-19T00:00:00_2018-04-19T23:59:59
-rw------- 1 slurm slurm  171905543 May 22 00:00 tars_step_table_archive_2018-04-20T00:00:00_2018-04-20T23:59:59
-rw------- 1 slurm slurm  393856952 May 23 00:01 tars_step_table_archive_2018-04-21T00:00:00_2018-04-21T23:59:59
-rw------- 1 slurm slurm   64644183 May 24 00:00 tars_step_table_archive_2018-04-22T00:00:00_2018-04-22T23:59:59
-rw------- 1 slurm slurm   56756708 May 25 00:00 tars_step_table_archive_2018-04-23T00:00:00_2018-04-23T23:59:59
-rw------- 1 slurm slurm  295824629 May 26 00:00 tars_step_table_archive_2018-04-24T00:00:00_2018-04-24T23:59:59
-rw------- 1 slurm slurm  175179686 May 27 00:00 tars_step_table_archive_2018-04-25T00:00:00_2018-04-25T23:59:59
-rw------- 1 slurm slurm  147678960 May 28 00:00 tars_step_table_archive_2018-04-26T00:00:00_2018-04-26T23:59:59
-rw------- 1 slurm slurm  123730748 May 29 00:02 tars_step_table_archive_2018-04-27T00:00:00_2018-04-27T23:59:59
-rw------- 1 slurm slurm  150537896 May 30 09:24 tars_step_table_archive_2018-04-28T00:00:00_2018-04-28T23:59:59
-rw------- 1 slurm slurm  150537896 May 30 00:02 tars_step_table_archive_2018-04-28T00:00:00_2018-04-28T23:59:59.old
-rw------- 1 slurm slurm   72774314 May 30 14:02 tars_step_table_archive_2018-04-29T00:00:00_2018-04-29T12:59:59
-rw------- 1 slurm slurm   82015784 May 30 14:22 tars_step_table_archive_2018-04-29T00:00:00_2018-04-29T13:59:59

I put the logs in attachment.

Comment 77 Véronique Legrand 2018-05-30 06:54:13 MDT

I have put back :
PurgeStepAfter=31days
so that slurmdbd could start again without stopping right after because of the err_lock_wait_timeout (purge of the job_steps table) but it is quasi sure that it will stop again tonight...

Comment 78 Alejandro Sanchez 2018-05-30 07:05:25 MDT

It seems like the hourly purge on steps operation didn't Deadlock but made slurmdbd fatal() with the ER_LOCK_WAIT_TIMEOUT twice. 

I think we should try setting Purge[Job|Step]After to the 'hourly' equivalent of the current days, but also increase the MySQL parameters and see if the Purge operations can finish with that increase.

I'd increase the the innodb_buffer_pool_size taking into account the documentation considerations:

https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html

But especially I'd increase the innodb_lock_wait_timeout and see if that let the operation finish:

https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout

Can we try 'hourly' purges with increased MySQL parameters and see if that helps for the next hour? Thanks.

Comment 84 Alejandro Sanchez 2018-05-30 11:51:10 MDT

Created attachment 6957 [details]
17.11 patch take3

Hi Véronique. See attached another modification of the fix to this issue. Sorry for uploading yet another version. Please remove all previous patches from this bug and apply this one. We also require you set DebugFlags=DB_ARCHIVE before restarting slurmdbd so we can discern exactly what's going on and the state of the locks. Thanks!

Comment 85 Alejandro Sanchez 2018-05-30 12:04:48 MDT

Just to make it clear, we ask you to append DB_ARCHIVE to what we already requested. So you should at least have these flags set:

DebugFlags=DB_JOB,DB_STEP,DB_ARCHIVE

after slurmdbd restart with the applied patch. Thanks!

Comment 86 Véronique Legrand 2018-05-31 01:55:17 MDT

Hi Alejandro

just to keep you informed, I forwarded yesterday your suggestions about the changes in the database configuration to our DBA. He will look at it today and get back to us soon.
I am going to have a look at the 2nd patch and see if we can apply it today.
slurmdbd did an exit(1) last night as I expected. I started it again this morning and it seems OK until now.

Regards

Comment 87 Véronique Legrand 2018-05-31 02:08:47 MDT

Hi Alejandro,

Just to make sure: there were 2 patches applied for this bug:
- this one: 
https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch that Sophie applied and 5086-take2.patch that I applied. You mean we should remove both before applying the new one. Am I right?

Véronique

Comment 88 Alejandro Sanchez 2018-05-31 03:11:25 MDT

Hi Véronique,

this one should continue being applied:

https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch

do not remove it. This is one is already checked into Slurm 17.11 branch since tag 17.11.6. So it should only be manually applied if you're <= 17.11.5.

Then I asked you to remove the patch attached in comment 52:

https://bugs.schedmd.com/attachment.cgi?id=6907&action=diff

and then apply the new version of that one, attached in comment 84.

With regards to the configuration, before restarting slurmdbd with the new code, we ask you to have at least these flags set in slurmdbd.conf:

DebugFlags=DB_JOB,DB_STEP,DB_ARCHIVE

and

setting Purge[Job|Step]After to the 'hourly' equivalent of the current days, but also increase the MySQL parameters and see if the Purge operations can finish with that increase, as suggested in comment 78.

Please, let me know if you have any more questions. Thank you.

Comment 89 Sophie Créno 2018-05-31 03:16:22 MDT

Hello,

Do you think that we could install the 17.11.7 release to start from something 
clean and then apply patch take3?

Thanks,

Comment 90 Alejandro Sanchez 2018-05-31 03:26:06 MDT

Sophie - sure thing. We always encourage to stick to the latest micro release when possible. So yes, you can upgrade to 17.11.7 and apply patch take 3 + the changes in the config.

Comment 91 Alejandro Sanchez 2018-06-01 04:30:57 MDT

Hi. Did slurmdbd survive at yesterday midnight?

Comment 92 Sophie Créno 2018-06-01 09:23:38 MDT

> Hi. Did slurmdbd survive at yesterday midnight?

No but it was due to the fact that we hadn't had enough time to put
everything in place (tiny update + changes in innodb parameters + the patch). 
It's now the case and the first restart of slurmdbd worked as expected with
the creation of the archive:

28M Jun  1 17:11 tars_step_table_archive_2018-05-01T00:00:00_2018-05-01T16:59:59

We will see if the next round is as smooth as this one.

Thanks for your concern,

Comment 93 Sophie Créno 2018-06-01 10:06:03 MDT

Up to now, everything works as expected. The archives are generated without 
error or deadlock :-)

2.1M Jun  1 18:00 tars_step_table_archive_2018-05-01T17:00:00_2018-05-01T17:59:59
 23M Jun  1 18:02 tars_job_table_archive_2018-03-01T00:00:00_2018-03-02T01:59:59

Comment 95 Sophie Créno 2018-06-04 03:28:48 MDT

Created attachment 6987 [details]
slurmdbd logs of the 06/03 until the ER_LOCK_WAIT_TIMEOUT

Hello,

Unfortunately, slurmdbd didn't survive at yesterday morning. Lock wait 
timeout error again even with

MariaDB [slurm_acct_db]> SELECT @@innodb_lock_wait_timeout;
+----------------------------+
| @@innodb_lock_wait_timeout |
+----------------------------+
|                        900 |
+----------------------------+


2018-06-03T02:17:32+02:00 tars-acct slurmdbd[32691]: DBD_CLUSTER_TRES: cluster not registered
2018-06-03T02:17:33+02:00 tars-acct slurmdbd[32691]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (38246018, UNIX_TIMESTAMP
2018-06-03T02:17:33+02:00 tars-acct slurmdbd[32691]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program


The last archives seem to have worked as expected:

2.9M Jun  3 01:00 tars_step_table_archive_2018-05-03T00:00:00_2018-05-03T00:59:59
490K Jun  3 01:02 tars_job_table_archive_2018-03-02T16:00:00_2018-03-02T16:59:59
2.6M Jun  3 02:00 tars_step_table_archive_2018-05-03T01:00:00_2018-05-03T01:59:59
1.3M Jun  3 02:02 tars_job_table_archive_2018-03-02T17:00:00_2018-03-02T17:59:59

I've attached the slurmdbd logs between midnight and 02:17:33. Do you need 
something else? Thanks,

Comment 96 Alejandro Sanchez 2018-06-04 04:23:10 MDT

Created attachment 6989 [details]
Decrease

At least there aren't any deadlocks anymore. What I'd try now is:

1. Increase innodb_lock_wait_timeout even more, perhaps to 1500.
2. Apply the attached patch to decrease MAX_PURGE_LIMIT to half its current defined value so that locks can more frequently be released to let other operations happen like recording job/step start/complete, which seem to be the operations timing out from the logs.

Thanks.

Comment 97 Sophie Créno 2018-06-04 08:17:25 MDT

> At least there aren't any deadlocks anymore. 
Sorry to contradict you but it has just happened:

2018-06-04T16:02:28+02:00 tars-acct slurmdbd[23558]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (38267288, UN
2018-06-04T16:02:28+02:00 tars-acct slurmdbd[23558]: error: It looks like the storage has gone away trying to reconnect


> 1. Increase innodb_lock_wait_timeout even more, perhaps to 1500.
> 2. Apply the attached patch to decrease MAX_PURGE_LIMIT to half its current
> defined value so that locks can more frequently be released to let other
> operations happen like recording job/step start/complete, which seem to be 
> the operations timing out from the logs.

Ok, I'm going to do that and keep you informed.

Comment 99 Alejandro Sanchez 2018-06-04 08:49:59 MDT

Can you attach the full dbd log from both failures? Thanks.

Comment 100 Sophie Créno 2018-06-04 09:26:55 MDT

Created attachment 6991 [details]
slurmdbd.log of today 06/04 until 17:17

As you will see, I had to restart slurmdbd after the ER_LOCK_WAIT_TIMEOUT that 
happened at 16:17. This time, slurmdbd archived the jobs again but as usual
the .old file and the new one have the same size:

5.2M Jun  4 16:00 tars_step_table_archive_2018-05-04T15:00:00_2018-05-04T15:59:59
167K Jun  4 16:02 tars_job_table_archive_2018-03-04T07:00:00_2018-03-04T07:59:59.old
167K Jun  4 16:35 tars_job_table_archive_2018-03-04T07:00:00_2018-03-04T07:59:59

Comment 101 Sophie Créno 2018-06-05 10:49:44 MDT

We have tried to applied the patch to see if it was enough alone but the 
ER_LOCK_WAIT_TIMEOUT happened again so we have done as you suggested and
are now keeping an eye on it.

Comment 102 Sophie Créno 2018-06-06 03:05:19 MDT

Created attachment 7005 [details]
log of slurmdbd 06/06 00:00 -> 10:00

Hello,

Even with the patch and 
+----------------------------+
| @@innodb_lock_wait_timeout |
+----------------------------+
|                       1500 |
+----------------------------+

slurmdbd exited again due to an ER_LOCK_WAIT_TIMEOUT at 02:27 this morning. 
I have attached the slurmdbd.log. The archive itself seems to have worked as
expected:

4.3M Jun 6 02:00 tars_step_table_archive_2018-05-06T01:00:00_2018-05-06T01:59:59
116K Jun 6 02:02 tars_job_table_archive_2018-03-05T17:00:00_2018-03-05T17:59:59

but the purge/deletion in the job table didn't happened afterwards. It only
took place when slurmdbd was restarted and didn't take long:

2018-06-06T08:51:10+02:00 tars-acct slurmdbd[21547]: 0(as_mysql_archive.c:3218) query#012delete from "tars_job_table" where time_submit <= 1520290799 && time_end != 0 LIMIT 25000
2018-06-06T08:52:53+02:00 tars-acct slurmdbd[21547]: 9(as_mysql_job.c:696) query#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_task_pending, tres_req, work_dir) values (38322007, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 406, 1, 31540, 96848, 'None assigned', 0, 

So I guess that the question is why the purge of the job table didn't complete
the first time.

Thanks,

Comment 103 Alejandro Sanchez 2018-06-06 09:05:44 MDT

I'm still working on troubleshooting this. Just out of curiosity, I guess from the config that the mysql instance is co-located with slurmdbd in tars-acct node. What's the underlying filesystem/setup hardware backing up the database?

Comment 104 Sophie Créno 2018-06-06 09:45:53 MDT

> I'm still working on troubleshooting this. Just out of curiosity,
> I guess from the config that the mysql instance is co-located with
> slurmdbd in tars-acct node. 
Yes, you are right.

> What's the underlying filesystem/setup hardware backing up the database?

$ df -h /opt/data/mariadb
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        744G  155G  590G  21% /opt/data

$ mount | grep sdb
/dev/sdb on /opt/data type xfs (rw)

/dev/sdb corresponds to 3 SSD disks on RAID 5.
tars-acct has 256 GB of RAM and Intel Xeon E5-2680 v4 @ 2.40GHz.

Is that what you wanted to know?

Comment 105 Alejandro Sanchez 2018-06-06 09:57:52 MDT

Yes, this is what I was asking for. And the filesystem/hardware setup for:

ArchiveDir=/pasteur/services/policy01/slurm/tars/archives/

just wondering (despite unlikely) if the underlying setup was somehow acting as a bottleneck for the process to happen at an effective performance.

Comment 106 Alejandro Sanchez 2018-06-06 10:25:18 MDT

And another question: do you happen to use sreport?

Comment 107 Sophie Créno 2018-06-06 10:52:41 MDT

/pasteur/services/policy01/ is a NFS file system. 

> just wondering (despite unlikely) if the underlying setup was somehow
> acting as a bottleneck for the process to happen at an effective
> performance.
Since we archive every hour, files are rather small, less that 100M.
When slurmdbd exited this morning, the job archive file was 116K.
There are several

insert into "tars_job_table"

between the as_mysql_archive.c lines and the final ER_LOCK_WAIT_TIMEOUT, while
or just before the job archive file was written:

2018-06-06T02:02:23+02:00 tars-acct slurmdbd[15070]: 9(as_mysql_job.c:696) query#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (38319860, UNIX_TIMESTAMP(), 38319860, 4294967294, 0, 4294967294, 264, 20, 22037, 31

$ stat /pasteur/services/policy01/slurm/tars/archives/tars_job_table_archive_2018-03-05T17\:00\:00_2018-03-05T17\:59\:59

Access: (0600/-rw-------)  Uid: (  497/   slurm)   Gid: (  493/   slurm)
Access: 2018-06-06 02:02:23.380203816 +0200
Modify: 2018-06-06 02:02:23.383714218 +0200
Change: 2018-06-06 02:02:29.206564861 +0200


Were these insert waiting for a lock?

Since the job archive file was written, it means that the following lines
were executed:

        if (debug_flags & DEBUG_FLAG_DB_ARCHIVE)
                DB_DEBUG(mysql_conn->conn, "query\n%s", query);
        if (!(result = mysql_db_query_ret(mysql_conn, query, 0))) {
                xfree(query);
                return SLURM_ERROR;
        }
        xfree(query);

        if (!(cnt = mysql_num_rows(result))) {
                mysql_free_result(result);
                return 0;
        }

        buffer = (*pack_func)(result, cluster_name, cnt, usage_info,
                              &period_start);
        mysql_free_result(result);

        error_code = archive_write_file(buffer, cluster_name,
                                        period_start, period_end,
                                        arch_dir, sql_table,
                                        archive_period);

If we had problems on locks in mysql_db_query_ret, could the file be written?


> And another question: do you happen to use sreport?
It happens sometimes when we want to see how jobs will be scheduled and
understand where their priority comes from. But we don't have crons executing
such command if it is your question.

Comment 111 Véronique Legrand 2018-06-07 09:25:39 MDT

Created attachment 7024 [details]
traces from mariadb (locks)

Comment 112 Véronique Legrand 2018-06-07 09:32:45 MDT

Hi Alejndro,

Please have a look at the latest file I attached: it contains the mariadb traces about locks and it contains interesting things that wrere going on at the time of the last slurmdbd crash. Our DBA has pointed out that the following "select for update " request: select account, array_max_tasks, nodes_alloc, id_assoc, id_array_job, id_array_task, id_block, derived_ec, derived_es, exit_code, timelimit, time_eligible, time_end, id_group, job_db_inx, id_job, kill_requid, mcs_label, job_name, nodelist, node_inx, pack_job_id, pack_job_offset, `partition`, priority, id_qos, cpus_req, mem_req, id_resv, time_start, state, time_submit, time_suspended, track_steps, id_user, wckey, id_wckey, work_dir, tres_alloc, tres_req from "tars_job_table" where time_submit < 1520402399 && time_end != 0 order by time_submit asc for update

tries to lock the job tables but can't get that lock because the table is already locked by another request. So, slurmdbd wait to get that lock and finally exits when the timeout is reached...

Comment 113 Alejandro Sanchez 2018-06-08 08:29:18 MDT

Thanks for the database trace and analysis. We're still leveraging further solutions to this, sorry if it's taking so long to be properly fixed.

Comment 116 Alejandro Sanchez 2018-06-12 04:45:32 MDT

Created attachment 7062 [details]
remove for update patch when archiving

Hi. Will you apply the attached patch and see how it goes? We highly suspect your db admin report is pinpointing to the problem and this should fix it. Please, let us know how it goes. Thanks and sorry for the delayed response.

Comment 119 Sophie Créno 2018-06-12 09:17:53 MDT

Hello,

> Hi. Will you apply the attached patch and see how it goes? We highly suspect
> your db admin report is pinpointing to the problem and this should fix it.
Done.

> Please, let us know how it goes. 
Yes, of course. For the moment, everything works well. The archives/purges have
completed successfully without any lock problem.

Thanks.

Comment 120 Sophie Créno 2018-06-12 12:29:10 MDT

Created attachment 7069 [details]
last hour of slurmdbd logs before archive/purge of 7pm and ER_LOCK_WAIT_TIMEOUT

Sorry, ER_LOCK_WAIT_TIMEOUT again.

Comment 127 Alejandro Sanchez 2018-06-13 02:17:06 MDT

Hi. We were able to reproduce without the last patch applied. Once applied, recompiled and slurmdbd restarted, then we are not able to reproduce anymore. Sorry to ask, but we're interested in making sure you didn't forget to recompile and restart the daemon (it has been restarted now since it fatal'ed). Thanks.

Comment 128 Sophie Créno 2018-06-13 06:56:48 MDT

I would have asked the same to my own users, don't worry ;-)
And you are right, contrary to slurmctld and the slurmd, slurmdbd
wasn't restarted after the update. I checked in the logs. My apologies
for that oversight. Since the restart, no archive/purge problems. 

Thanks,

Comment 129 Alejandro Sanchez 2018-06-13 10:45:06 MDT

Sophie, we really think the real fix for this bug is just the last patch which removes the "for update" clause in 4 select queries in _archive_table function (attachment 7062 [details]). So we'd encourage to just apply this one on a stock build so that you have a clean build without intermediate patches we've been attaching during the course of this bug. If after a day or two you confirm there are no more issues we will check this last patch to .8 and close the bug. Thanks for your collaboration.

Comment 131 Sophie Créno 2018-06-14 08:31:59 MDT

17.11.7 reinstalled with only "remove for update patch" and off course slurmdbd
 restarted like the other daemons.  I'll inform you if something happens.
No timeout took place after the last restart.

Comment 132 Alejandro Sanchez 2018-06-15 01:44:31 MDT

Good morning. How did slurmdbd behave overnight? we can leave it running during the day and if everything is fine I'd check the patch in and close this. Thanks.

Comment 133 Sophie Créno 2018-06-15 03:23:13 MDT

Hello,

No problem with slurmdbd up to now. I'm going to clean the runaway jobs to see
if new ones appear.

Thanks,

Comment 134 Sophie Créno 2018-06-18 04:34:38 MDT

Hello,

No problem over the weekend :-)
Do you think that even with that change, we could face  ER_LOCK_WAIT_TIMEOUT 
again in the future if the number of jobs increase a lot?

Thanks,

Comment 135 Alejandro Sanchez 2018-06-18 07:52:29 MDT

(In reply to Sophie Créno from comment #134)
> Hello,
> 
> No problem over the weekend :-)

That is good, thanks for the feedback. We've pushed this commit:

commit 4c448fd72c4f36ae309a46af8de2c9799f363822 (HEAD -> slurm-17.11, origin/slurm-17.11)
Author:     Danny Auble <da@schedmd.com>
AuthorDate: Mon Jun 18 15:43:31 2018 +0200
Commit:     Alejandro Sanchez <alex@schedmd.com>
CommitDate: Mon Jun 18 15:43:31 2018 +0200

    MySQL - Prevent deadlock caused by archive logic locking reads.
    
    Specifically due to SELECT ... FOR UPDATE ones.
    
    Bug 5086.

Which will be available since 17.11.8 onwards.

> Do you think that even with that change, we could face  ER_LOCK_WAIT_TIMEOUT 
> again in the future if the number of jobs increase a lot?
> 
> Thanks,

Theoretically not. But in practice, there can always raise specific use-cases where (especially under heavy HTC conditions) this could happen. The ER_LOCK_WAIT_TIMEOUT can be caused by either bad things like the deadlock reported in the MySQL traces here, by heavy operations taking too much (rollup process are prone to this) or similar. Although with this patch we've reduced the likelihood of this problem, weird combinations of different kind of operations accessing the same tables could cause problems. We always look forward and strive to avoid all this.

I'm gonna go ahead and close this bug. Thanks again for your patience and collaboration.