Hi Sophie. Will you show me the output of sacct with --duplicates? $ sacct -j 30535967 --duplicates $ sacct -j 32070617 --duplicates Are the the RequeueExit jobs being 'scontrol update'd' right after the requeue happens? I'm wondering if you are hitting the same issue that was reported in bug 4855, and we provided a fix: https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2 https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch (if needed) that will be available since 17.11.6. The fact that fixed runawayjobs end_time is set to the start_time is normal, as advised by the sacctmgr show runawayjobs warning: Would you like to fix these runaway jobs? (This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed. Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.) (You have 30 seconds to decide) (N/y): n Thanks. Hello, > $ sacct -j 30535967 --duplicates Since this one has been requeued several times, I only put an excerpt: JobID JobName State Start End ExitCode ------------ ------- ---------- ------------------- ------------------- ------ 30533734_19+ ez_50 COMPLETED 2018-03-30T19:43:03 2018-03-30T19:43:03 99:0 30533734_19+ batch FAILED 2018-03-30T19:43:03 2018-03-30T20:08:14 99:0 30533734_19+ mcell CANCELLED 2018-03-30T19:43:09 2018-03-30T20:09:02 0:15 30533734_19+ ez_50 COMPLETED 2018-03-30T20:33:11 2018-03-30T20:33:11 99:0 30533734_19+ batch FAILED 2018-03-30T20:33:11 2018-03-30T20:58:01 99:0 30533734_19+ mcell CANCELLED 2018-03-30T20:33:14 2018-03-30T20:58:07 0:15 [...] 30533734_19+ ez_50 COMPLETED 2018-04-08T14:57:22 2018-04-08T14:57:22 99:0 30533734_19+ batch FAILED 2018-04-08T14:57:22 2018-04-08T15:22:09 99:0 30533734_19+ mcell CANCELLED 2018-04-08T14:57:22 2018-04-08T15:22:10 0:15 30533734_19+ ez_50 COMPLETED 2018-04-08T15:28:51 2018-04-08T15:28:51 99:0 30533734_19+ batch FAILED 2018-04-08T15:28:51 2018-04-08T15:53:46 99:0 30533734_19+ mcell CANCELLED 2018-04-08T15:28:52 2018-04-08T15:54:05 0:15 30533734_19+ ez_50 COMPLETED 2018-04-08T16:11:25 2018-04-08T16:11:25 0:0 30533734_19+ batch COMPLETED 2018-04-08T16:11:25 2018-04-08T16:32:02 0:0 30533734_19+ mcell COMPLETED 2018-04-08T16:11:25 2018-04-08T16:32:02 0:0 > $ sacct -j 32070617 --duplicates JobID JobName State Start End ExitCo ------------ ------- ---------- ------------------- ------------------- ------ 32070617 D-15-0+ COMPLETED 2018-04-19T23:43:55 2018-04-19T23:43:55 0:0 > Are the the RequeueExit jobs being 'scontrol update'd' right after the requeue > happens? No, we just let SLURM requeue the job automatically. We don't do anything else on them manually or not. > I'm wondering if you are hitting the same issue that was reported in > bug 4855, and we provided a fix: > > https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2 > https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch (if needed) > > that will be available since 17.11.6. The fact that fixed runawayjobs end_time > is set to the start_time is normal, as advised by the sacctmgr show runawayjobs > warning. Yes I know but it's a pity when the job has completed like 30535967 and is correct in the db before the sacctmgr show runawayjobs. Thanks. For job 30535967, which two commands are showing different states when querying the db? If sacctmgr show runawayjobs was showing it as PENDING it is normal until you fix it, then sacct -j 30535967 --duplicates should be showing all the right states. Isn't that the case? or you were comparing other tools? Job 32070617 state seems to have been fixed to COMPLETED after a while or after you fixed it with runawayjobs. Looking at the logs you provided: 2018-04-20T00:07:38+02:00 tars-master slurmctld[29480]: Resending TERMINATE_JOB request JobId=32070617 Nodelist=tars-464 2018-04-20T00:08:35+02:00 tars-master slurmctld[29480]: cleanup_completing: job 32070617 completion process took 208 seconds The "Resending TERMINATE_JOB request" message indicates that the ctld background thread, executing the logic for purging old job records (see slurm.conf MinJobAge), the job was in COMPLETING state and Slurm tried to deallocate its nodes for a second time, because the first attempt didn't deallocate them. The completion process took 208 seconds. Was this job also a RequeueExit one? Anyway the output from the last sacct query on this job shows its state as COMPLETED. (In reply to Sophie Créno from comment #2) > Yes I know but it's a pity when the job has completed like 30535967 and is > correct > in the db before the sacctmgr show runawayjobs. Yeah but it'd be incorrect to leave the end time unmodified since by definition runawayjobs are those that don't exist in the controller but are still considered running in the database, so you would end up with incorrect end time values. > For job 30535967, which two commands are showing different states when > querying the db? It was sacctmgr show runawayjobs for the PENDING state: ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 30535967 allocation tars PENDING Unknown Unknown and sacct for the COMPLETED state: JobID JobName Partition QOS State Start Elapsed End ExitCode ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------------------- -------- 30533734_19+ ez_50 dedicated fast COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 30533734_19+ batch COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 30533734_19+ mcell COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 > If sacctmgr show runawayjobs was showing it as PENDING it is normal > until you fix it, Sorry, I don't understand why it was considered as a runaway job whereas it was completed. I would have expected to see it as completed in sacct output and not appearing in the sacctmgr show runawayjobs. We have > then sacct -j 30535967 --duplicates should be showing all the right > states. Isn't that the case? or you were comparing other tools? They were already right so they still remain, it's just the end time that was correct before being fixed by sacctmgr show runawayjobs but I know it is the expected output of the command. > Job 32070617 state seems to have been fixed to COMPLETED after a while > or after you fixed it with runawayjobs. Yes, it was fixed by sacctmgr show runawayjobs. > Looking at the logs you provided: > > 2018-04-20T00:07:38+02:00 tars-master slurmctld[29480]: Resending TERMINATE_JOB > request JobId=32070617 Nodelist=tars-464 > 2018-04-20T00:08:35+02:00 tars-master slurmctld[29480]: cleanup_completing: job > 32070617 completion process took 208 seconds > > The "Resending TERMINATE_JOB request" message indicates that the ctld > background thread, executing the logic for purging old job records (see > slurm.conf MinJobAge), which is set to 300 in our slurm.conf to have a few minutes to see it in the output of scontrol even after it is over. > the job was in COMPLETING state and Slurm tried to deallocate its nodes > for a second time, because the first attempt didn't deallocate them. > The completion process took 208 seconds. I guess the completion process should be quicker and should have succeeded on the first attempt. Do you have an idea where the problem comes from? Can we change something to improve that? > Was this job also a RequeueExit one? No, it's a regular one. > Anyway the output from the last sacct query on this job shows its state > as COMPLETED. Yes, but it's only due to sacctmgr show runawayjobs. Currently, I also observe this type of runaway job, considered by everyone as RUNNING: sacctmgr show runawayjobs ID Name Partition Cluster State TimeStart TimeEnd -------- ---------- ---------- ---------- ---------- ------------------- ----------- 32210221 D-15-0012+ common tars RUNNING 2018-04-20T21:01:59 Unknown sacct -D -j 32210221 JobID JobName Partition QOS State Start Elapsed -------- ----------- ---------- ------ ---------- ---------------------- ------------ 32210221 D-15-0012+ common fast RUNNING 2018-04-20T21:01:59 5-14:27:17 but over in reality: 2018-04-20T20:21:17+02:00 tars-master slurmctld[8329]: _slurm_rpc_submit_batch_job: JobId=32210221 InitPrio=7248 usec=7847 2018-04-20T21:01:59+02:00 tars-master slurmctld[8329]: sched: Allocate JobID=32210221 NodeList=tars-960 #CPUs=1 Partition=common 2018-04-20T21:17:44+02:00 tars-master slurmctld[8329]: _job_complete: JobID=32210221 State=0x1 NodeCnt=1 WEXITSTATUS 0 2018-04-20T21:17:44+02:00 tars-master slurmctld[8329]: _job_complete: JobID=32210221 State=0x8003 NodeCnt=1 done Again, since it is over, it should be COMPLETED in sacct output and not appearing in sacctmgr show runawayjobs. This one is of the same kind as job 32070617, not a requeued one. What are we doing wrong? Thanks. The problem is you can't expect the information from a runwayajob to be correct before it is fixed. Runawayjobs are, by definition, jobs that don't exist in the controller but are still pending, running or suspended in the database. Both 30535967 and 32210221 are examples of this. > The problem is you can't expect the information from a runwayajob to be correct
> before it is fixed. Runawayjobs are, by definition, jobs that don't exist in
> the controller but are still pending, running or suspended in the database.
> Both 30535967 and 32210221 are examples of this.
Ok so what should we change in the configuration or elsewhere in order to
ensure that the database is updated when the jobs complete?
Thanks,
Sophie, could you verify whether the suggested patch (available since 17.11.6) or directly from here: https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch is applied to your system? If that's not the case, will you apply it and see if that helps improving things? Thank you. Hello, The patch is now in place. I'll be on vacation but I will ask my colleague to keep you informed. Thanks, I don't know if it is linked, but I see things like the following right now:
2018-05-04T15:18:12+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34150382 sent from wrong node (tars-483 rather than tars-400). Was the job requeued due to node failure?
2018-05-04T16:39:07+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34159943 sent from wrong node (tars-446 rather than tars-470). Was the job requeued due to node failure?
Here is the output of sacct for these jobs:
sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 34150382
JobID JobName State Start Elapsed End ExitCode NodeList
------------ ---------- ---------- ------------------- ---------- ------------------- -------- ---------------
34150382 S-16-0133+ PREEMPTED 2018-05-04T15:01:53 00:00:00 2018-05-04T15:01:53 0:0 tars-483
34150382 S-16-0133+ FAILED 2018-05-04T15:05:16 00:13:05 2018-05-04T15:18:21 1:0 tars-400
34150382.ba+ batch FAILED 2018-05-04T15:05:16 00:13:05 2018-05-04T15:18:21 1:0 tars-400
sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 34159943
JobID JobName State Start Elapsed End ExitCode NodeList
------------ ---------- ---------- ------------------- ---------- ------------------- -------- ---------------
34159943 S-16-0127+ PREEMPTED 2018-05-04T16:15:01 00:00:00 2018-05-04T16:15:01 0:0 tars-445
34159943 S-16-0127+ PREEMPTED 2018-05-04T16:17:22 00:00:00 2018-05-04T16:17:22 0:0 tars-452
34159943.ba+ batch CANCELLED 2018-05-04T16:17:04 00:00:20 2018-05-04T16:17:24 0:15 tars-452
34159943 S-16-0127+ PREEMPTED 2018-05-04T16:19:27 00:00:00 2018-05-04T16:19:27 0:0 tars-446
34159943 S-16-0127+ COMPLETED 2018-05-04T16:21:33 00:17:07 2018-05-04T16:38:40 0:0 tars-470
34159943.ba+ batch COMPLETED 2018-05-04T16:21:33 00:17:07 2018-05-04T16:38:40 0:0 tars-470
It's weird because, for job 34150382, it's the former node who sent
the information regarding the completion while slurmctld was waiting
for it from the new one. For job 34159943, it's the opposite.
Here are the complete logs from slurmctld about these jobs:
2018-05-04T14:58:24+02:00 tars-master slurmctld[23754]: _slurm_rpc_submit_batch_job: JobId=34150382 InitPrio=6186 usec=1216
2018-05-04T15:01:39+02:00 tars-master slurmctld[23754]: backfill: Started JobID=34150382 in dedicated on tars-483
2018-05-04T15:01:53+02:00 tars-master slurmctld[23754]: preempted job 34150382 has been requeued to reclaim resources for job 34150806
2018-05-04T15:01:56+02:00 tars-master slurmctld[23754]: Requeuing JobID=34150382 State=0x0 NodeCnt=0
2018-05-04T15:05:16+02:00 tars-master slurmctld[23754]: backfill: Started JobID=34150382 in dedicated on tars-400
2018-05-04T15:18:12+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34150382 sent from wrong node (tars-483 rather than tars-400). Was the job requeued due to node failure?
2018-05-04T15:18:21+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34150382 State=0x1 NodeCnt=1 WEXITSTATUS 1
2018-05-04T15:18:21+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34150382 State=0x8005 NodeCnt=1 done
2018-05-04T16:14:38+02:00 tars-master slurmctld[23754]: _slurm_rpc_submit_batch_job: JobId=34159943 InitPrio=6186 usec=1597
2018-05-04T16:14:56+02:00 tars-master slurmctld[23754]: backfill: Started JobID=34159943 in dedicated on tars-445
2018-05-04T16:15:01+02:00 tars-master slurmctld[23754]: preempted job 34159943 has been requeued to reclaim resources for job 34159387
2018-05-04T16:15:03+02:00 tars-master slurmctld[23754]: Requeuing JobID=34159943 State=0x0 NodeCnt=0
2018-05-04T16:15:03+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x0 NodeCnt=0 WTERMSIG 15
2018-05-04T16:15:03+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x0 NodeCnt=0 cancelled by interactive user
2018-05-04T16:17:04+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34159943 NodeList=tars-452 #CPUs=1 Partition=dedicated
2018-05-04T16:17:22+02:00 tars-master slurmctld[23754]: preempted job 34159943 has been requeued to reclaim resources for job 34160314
2018-05-04T16:17:24+02:00 tars-master slurmctld[23754]: Requeuing JobID=34159943 State=0x0 NodeCnt=0
2018-05-04T16:19:26+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34159943 NodeList=tars-446 #CPUs=1 Partition=dedicated
2018-05-04T16:19:27+02:00 tars-master slurmctld[23754]: preempted job 34159943 has been requeued to reclaim resources for job 34160314
2018-05-04T16:19:32+02:00 tars-master slurmctld[23754]: Requeuing JobID=34159943 State=0x0 NodeCnt=0
2018-05-04T16:21:33+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34159943 NodeList=tars-470 #CPUs=1 Partition=dedicated
2018-05-04T16:38:40+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-05-04T16:38:40+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34159943 State=0x8003 NodeCnt=1 done
2018-05-04T16:39:07+02:00 tars-master slurmctld[23754]: error: Batch completion for job 34159943 sent from wrong node (tars-446 rather than tars-470). Was the job requeued due to node failure?
I've found in former logs that we have already seen such a thing over the previous
months, always on preempted jobs. But it seems that some of them have not a correct
state in the db regarding the first run:
sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 29249884
JobID JobName State Start Elapsed End ExitCode NodeList
------------ ---------- ---------- ------------------- ---------- ------------------- -------- ---------------
29249884 C260p.B2_+ PENDING 2018-03-09T10:54:27 00:00:00 2018-03-09T10:54:27 0:0 tars-516
29249884 C260p.B2_+ COMPLETED 2018-03-09T10:56:30 00:18:18 2018-03-09T11:14:48 0:0 tars-427
Thanks,
The "Batch completion for job" logged messages can be the result of the slurmd on the batch_host (host where batch script is executed) failing, but the slurmstepd continuing to run. Then the batch job is requeued and started on a different node. The end result is one batch complete RPC from each node. But this message should be harmless. Please also note the patches in 17.11.5 that fix the pending/completing state in the database will only work on new jobs. Are you seeing bad states in the database for preempted & requeued jobs started after the patches have been applied? Thanks. Hello Alejandro, Sophie applied the patch last friday at approx. noon (local time here in France). I cleaned the runaway jobs this morning. I didn't pay attention to whether they had started before of after applying the patch. Right now, there are no new runaway jobs but it is too soon to tell if the patch has solved the issue or not. I will get back to you in a few days. Regards, Véronique *** Ticket 5123 has been marked as a duplicate of this ticket. *** Hello Tim,
I found the following runaway jobs this morning:
vlegrand@tars-submit0 ~ $ sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
ID Name Partition Cluster State TimeStart TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
34830236 C020s.B_a+ dedicated tars RUNNING 2018-05-08T00:06:25 Unknown
34830237 C020s.B_a+ dedicated tars RUNNING 2018-05-08T00:06:25 Unknown
34830238 C020s.B_a+ dedicated tars RUNNING 2018-05-08T00:06:25 Unknown
34830239 C020s.B_a+ dedicated tars RUNNING 2018-05-08T00:06:25 Unknown
34830240 C020s.B_a+ dedicated tars RUNNING 2018-05-08T00:06:25 Unknown
Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)
(You have 30 seconds to decide)
(N/y): N
Changes Discarded
According to slurm they started at 6:25 am (our time) yesterday (may 8) so after the application of the patch.
Here are other information about them:
vlegrand@tars-submit0 ~ $ sacct -j 34830236 -D
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
34830236 C020s.B_a+ dedicated hubbioit 1 RUNNING 0:0
vlegrand@tars-submit0 ~ $ sacct -j 34830236 --duplicates
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
34830236 C020s.B_a+ dedicated hubbioit 1 RUNNING 0:0
vlegrand@tars-submit0 ~ $ sacct -D -j 34830236 -o Start,Elapsed,nodelist
Start Elapsed NodeList
------------------- ---------- ---------------
2018-05-08T00:06:25 1-09:50:23 tars-629
So, it has not been requeued.
In slurmdbd.log, I have:
2018-05-08T00:06:27+02:00 tars-acct slurmdbd[18198]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012update "tars_job_table" set mod_time=UNIX_TIMESTAMP(), time_end=1525730787, state=3, derived_ec=0, tres_alloc='1=1,2=3072,3=18446744073709551614,4=1,5=1', exit_code=0, kill_requid=-1 where job_db_inx=74928152;
In slurmctld.log, I have:
2018-05-08T00:06:24+02:00 tars-master slurmctld[23754]: _slurm_rpc_submit_batch_job: JobId=34830236 InitPrio=2984 usec=701
2018-05-08T00:06:25+02:00 tars-master slurmctld[23754]: sched: Allocate JobID=34830236 NodeList=tars-629 #CPUs=1 Partition=dedicated
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x1 NodeCnt=1 WEXITSTATUS 0
2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x8003 NodeCnt=1 done
So, yes, It seems that the job is a "runaway" because slurmdbd wasn't able to update the job table in the database due to a deadlock.
Please let me know if you need any additional information. I am keeping the runaway jobs as is in case you would need to do further investigation.
Regards,
Véronique
Hi Véronique. Could you please uplaod the slurmdbd, slurmctld and tars-629 slurmd logs from before job 34830236 submission till its completion? That can be helpful to narrow down the root cause of the deadlock. Let's keep the newly reported runawayjobs unfixed for now just in case we require you with more information. Thank you. Created attachment 6846 [details]
extract of slurmctld log
Created attachment 6847 [details]
extract of slurmdbd.log
Created attachment 6848 [details]
extract of slurmd.log for tars-629
Hi Alejandro, Here there are attached. Just to let you know: there are plenty of new runaway jobs since the last time I checked (on wednesday). Regards, Véronique Hi Véronique. Could you please attach them again fully, without filtering by jobid strings or similar? the end goal is to know what was going on while the jobs were active in the system, and potentially derive a set of events/operations leading to the deadlock. With the filtered logs I can't see such information. Thank you. Hi Alejandro, I didn't filter them. I just took the whole log 1 minute before the job starts and 1 minute after the job ends for slurmctld. For slurmdbd, there was not much in the logs; you have it all. for slurmd, I took from 2 minutes before the start of the job to 3 minutes after it is terminated. The whole files are huge. How many minutes before and after the job would you need? Yes sorry I see you just had the debuglevel set at info and there isn't much to work off of these messages. Are the timestamps for the completion of the new runaway jobs coincident with the these messages in the dbd.log reporting the deadlock? Yes, I slurmctld.log, we have: 2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x1 NodeCnt=1 WEXITSTATUS 0 2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34830236 State=0x8003 NodeCnt=1 done and at the same time, in slurmctld.log, you can see: 2018-05-08T00:06:27+02:00 tars-acct slurmdbd[18198]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012update "tars_job_table" set mod_time=UNIX_TIMESTAMP(), time_end=1525730787, state=3, derived_ec=0, tres_alloc='1=1,2=3072,3=18446744073709551614,4=1,5=1', exit_code=0, kill_requid=-1 where job_db_inx=74928152; Isn't that suspect? Indeed, it looks like the deadlock prevents the as_mysql_job_complete() function to update the job fields, thus generating new runaway jobs. I'm investigating the lock order for the potentially involved operations (complete, start, rollup, and so). We'll come back once we find something useful. Thanks. Véronique, will you give me the output of: select * from tars_job_table where job_db_inx=74928152; Thanks Created attachment 6869 [details]
output of sql select command
Hi Alejandro, Please find it in attachment. Regards, Véronique Hi Véronique. Inspecting the attached select query output we requested, I notice a few things: 1. The mod_time is 1525730800. Epoch Converter translates that timestamp to Tuesday, May 8, 2018 12:06:40 AM GMT+02:00 DST (assuming France timezone). 2. The id_job is 34828611. This jobid isn't any of the reported as runaway at least until comment 17 (in comment 22 you mention there are plenty of new ones but didn't list them). These are the ctld.log messages related to this concrete jobid: 2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34828611 State=0x1 NodeCnt=1 WEXITSTATUS 0 2018-05-08T00:06:27+02:00 tars-master slurmctld[23754]: _job_complete: JobID=34828611 State=0x8003 NodeCnt=1 done The mod_time is updated ~13s after the _job_complete() messages are logged, which makes sense since during these seconds slurmctld was sending the RPC message to slurmdbd to process the mysql update for the completed job. The 'state' in the database for this specific jobid is 3 (JOB_COMPLETE) and the time_end corresponds to the _job_complete messages timestamps. This means that this specific job is NOT a runaway one. The deadlock return code from mysql was reported back to slurmctld, who retried the update request for this job and the next retry didn't fail. So the deadlock event was something bening and seems to us unrelated to the runaway jobs. We still need to figure out why you have runaway jobs in your system though. So we're gonna ask you more information. 1. Attach slurm.conf and slurmdbd.conf. 2. Output of: Current: $ sacctmgr show runawayjobs and mysql> select * from tars_job_table where id_job=<runawayjobid_1> or ... or id_job=<runawayjobid_N> where you 'or' all the runaway jobids reported by 'sacctmgr show runawayjobs' above. 3. Temporarily increase in production debuglevel to 'debug' both in slurmctld and slurmdbd as well as add DebugFlags=DB_JOB,DB_STEP,DB_QUERY to slurmdbd.conf so that next time we request you the logs we have more information to work off. Thanks for your collaboration. Created attachment 6887 [details]
slurm configuration file
Created attachment 6888 [details]
list of all runaway jobs
Created attachment 6889 [details]
database content for the runaway jobs
Hi Alejandro, Thank you for your explanation. Please find attached the files containing the information you required. As for changing slurmdbd's debug level in production, I want to speak about it to our sys admin first. Because this will generate huge log files and I don't want to fill hard drives with them... (In reply to Véronique Legrand from comment #39) > Hi Alejandro, > > Thank you for your explanation. > Please find attached the files containing the information you required. > As for changing slurmdbd's debug level in production, I want to speak about > it to our sys admin first. Because this will generate huge log files and I > don't want to fill hard drives with them... That's fine. Only if you have an appropiate logrotate configuration and increasing the debuglevel/flags doesn't impact performance on the daemons. Otherwise just skip this step. Thanks. Marking slurmdbd.conf attachment as private, just in case password isn't worded. I've noticed all runaway jobs TimeStart is close to midnight. Do you have any special process/workflow running at midnight that you think could be relevant and/or impacting Slurm functionality in some way? Hi Alejandro, At midnight there is the daily purge of job steps. That's the only thing. I am a little bit surprised by your question because to me the runaway job issue is dued to slurm not "recording" correctly the job's termination information; what has the job start time to do with this? I had a look at the slurmdbd logs. We had a few "deadlock error" messages. I put them as attachment just in case that could be useful to you. Regards, Véronique Created attachment 6891 [details]
deadlock error messages of the last days
As you can see, the deadlock error messages appear nearly every day soon after midnight.
Do you know what type of jobs were the following ones and how they were submitted? Do you have the related batch scripts if any? 32178224 32178302 32187155 32187156 32187157 32187158 32187159 32187160 32187161 32187162 32187163 32187165 32187169 32187170 They all belong to id_user 25939 and they are all runaway jobs. They all also had DerivedExitCode 15. I'm curious to see if these ones were doing anything special/different that could shed some light on the runaway generation. Perhaps the output of sacct of one of them also: sacct -D -j 32178224 -o jobid,user,account,partition,state,submit,start,end,elapsed,allocnodes There are bunch of runaway that also belong to id_user 29763, like: 35657118 35657121 ... Thank you, and sorry to request so much information. I'm trying to find a common pattern or something different to these runaway jobs with respect to the rest. Created attachment 6895 [details]
sacct -D output
Hi Alejandro, The 1rts serie of jobs you are talking about belongs to a user that is doing particule movement simulation with a software called mcell. She has thousands of jobs running every week and only a few of them are runaway. I have asked her for batch's script. I have put in attachment the output of : sacct -D -j 32178224 -o jobid,user,account,partition,state,submit,start,end,elapsed,allocnodes I just hided the user login. The other serie of jobs you are talking about belongs to a user that also has thousands of jobs every week. He is doing blast searches against databanks and queries to an sqlite database. By the way, did you have a look at the slurmdbd error messages that I sent you? We see clearly with them that a deadlock occurs nearly everyday soon after midnight and you had noticed a surprising the coincidence with the starting time of the runaway jobs. Hope this helps, Regards, Véronique Created attachment 6896 [details]
17.11 patch
Hi Véronique. We've found a potential source of issues. What I'd do now is fix your current runaway jobs (proceed with Yes after showing them). Then apply this patch and let's see if new runaway jobs are appearing next week, jobs starting after the patch has been applied. Thanks.
Hi Véronique. Please hold on applying the patch, since we internally agreed it required some modifications. We'll send you a second version shortly. Thank you. Created attachment 6907 [details]
17.11 patch
Véronique - will you fix your current runaway jobs, then apply this patch and let Slurm run for a couple of days and finally check for runaway jobs again? We've added some lock protection around the purge/archive operation as well as around the job/steps record at start and completion. We think this should fix the issue you've been experiencing here, but your feedback is needed to confirm it. Thanks again.
Hi Alejandro, Thank you for the new patch. I applied it on Tuesday morning on our test cluster. I cannot say if it fixed the problem on it since it is a small cluster and we don't have as many jobs on it as on the production cluster. As everything seemed ok anyway, I applied it this morning on the production cluster and cleaned the runaway jobs. Let's just wait and see now. Regards, Véronique All right, thanks for the feedback. Please, let us know how it goes after a couple of days have passed. Created attachment 6936 [details]
runaway job 25/05/18
Created attachment 6937 [details]
slurmdbd log 25/05/18 from 00:00 to 01:00
Hello Alejandro, Unfortunately, there is already a new runaway job this morning. Its start date is last night at 00:03 approximatively so several hours after I applied the patch. I have put in attachment the slurmdbd logs of tonight from 00:00 to 1 am. and the output of "sacctmgr show runawayjobs". Regards, Véronique Thanks Véronique. It looks like at least there's no deadlock anymore but there might be something still not fully fixed. We'll come back to you. Hi Véronique. We think we either missed a spot or the patch wasn't applied correctly. Most probably we missed something, but the first thing we'd like to ask is to please double-check patch was properly applied and slurmdbd was restarted after recompiling. I also didn't properly search for Deadlock in your latest slurmdbd log files (most probably due to a case sensitive thing, my fault). There was indeed a Deadlock again: 2018-05-25T00:05:00+02:00 tars-acct slurmdbd[6301]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, array_task_str, array_task_pending) values (37819018, UNIX_TIMESTAMP(), 0, 4294967294, 0, 0, 133, 0, 0, 0, 'tars-63 Can you please upload the output of: select * from tars_job_table where job_db_inx=78651763; select * from tars_job_table where id_job=37819017; select * from tars_job_table where id_job=37819018; sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 37819017 sacct --format=jobid,jobname,state,start,elapse,end,exitcode,nodelist -D -j 37819018 This is being a slippery one. Thanks again for your helpful collaboration. Created attachment 6946 [details]
output of ls -l command on the directory that contains the patched files
Hi Alejandro,
I have put as attachment the output of ls -l on the src/plugins/accounting_storage/mysql directory.
As you can see, the 2 impacted files abd the corresponding .o were modified on May 24 so the patch has been applied.
I also checked the executables:
ls -l /XXX/slurm/cluster/slurm/sbin
total 19047
-rwxr-xr-x 1 root slurm 7766233 May 24 09:52 slurmctld
-rwxr-xr-x 1 root slurm 3083091 May 24 09:52 slurmd
-rwxr-xr-x 1 root slurm 1128985 May 24 09:52 slurmdbd
-rwxr-xr-x 1 root slurm 4105626 May 24 09:52 slurmstepd
So the executables are new ones generated after I applied the patch.
So, I checked the starting date of slurmdbd on the accounting machine and realized
that it had not been restarted since May 23 (1 day before patch application)
ps -ef|grep slurmdbd
slurm 6301 1 0 May23 ? 00:09:41 /pasteur/services/policy01/slurm/tars/slurm/sbin/slurmdbd
vlegrand 32746 32634 0 16:01 pts/3 00:00:00 grep slurmdbd
I had simply forgotten to restart slurmdbd.
So, I cleaned the runaways:
vlegrand@cluster-acct ~ $ sacctmgr show runawayjobs
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered pending, running or suspended in the database
ID Name Partition Cluster State TimeStart TimeEnd
------------ ---------- ---------- ---------- ---------- ------------------- -------------------
37819017 S-16-0010+ dedicated tars RUNNING 2018-05-25T00:03:17 Unknown
37853853 FUCC.30.1+ pibnet tars RUNNING 2018-05-28T00:01:43 Unknown
37853858 FUCC.214.+ pibnet tars RUNNING 2018-05-28T00:01:43 Unknown
37853862 FUCC.104.+ pibnet tars RUNNING 2018-05-28T00:01:43 Unknown
Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)
(You have 30 seconds to decide)
(N/y): y
Then restarted slurmdbd.
vlegrand@tars-acct ~ $ sudo /etc/init.d/slurmdbd restart
stopping slurmdbd: [ OK ]
slurmdbd (pid 586) is running...
slurmdbd (pid 586) is running...
slurmdbd (pid 586) is running...
slurmdbd (pid 586) is running...
starting slurmdbd: [ OK ]
vlegrand@tars-acct ~ $ ps -ef|grep slurmdbd
slurm 647 1 0 16:03 ? 00:00:00 /pasteur/services/policy01/slurm/tars/slurm/sbin/slurmdbd
vlegrand 662 32634 0 16:03 pts/3 00:00:00 grep slurmdbd
Sorry for this. We will see if we have deadlocks and runaway jobs in the next days.
Regards,
Véronique
Created attachment 6950 [details]
slurmdbd logs 1rts hour last night
Hi Alejandro, It seems that the patch has put us back in the same situation as we were in 6 months ago when we had a previous version of slurm and the job step table was purged once a month (see bug 4295). When I arrived this morning, I found that the slurmdbd daemon was down. Trying to do an saact command resulted in the following error message: """ sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to tars-acct:6819: Connection refused sacct: error: slurmdbd: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused """ The slurmdbd daemon crashed soon after midnight (see logs in attachment) with an "err_lock_wait_timeout" just as before... As the job step table is now purged every night (we followed your colleague's advice). It is likely that slurmdbd will crash every night which is not acceptable for us. Is there a way to schedule the job purge at an office hour so that we can keep an eye on the slurmdbd daemon and try to see with our DBA if we can provide you more information to finally solve this problem? Regards, Véronique Do you happen to have a vore file generales by slurmdbd? If so could you inspect it with gdb: $ gdb /path/to/slurmdbd /path/to/Core or $ coredumpctl gdb Then from inside gdb: 'thread apply all bt' And attach the output. Thanks! Hi Alejandro, I have checked with our system engineer. From the operating system point of view (cent OS 6), everything is configured so that we should have a core dump if the daemon crashed. Yet, contrary to what I thought, it didn't crash but exited in a proper way with return code 1. I hope that this information will be useful to you. It would really be helpful if we could change the hour at which the purge is done so that we can investigate the problem when it occurs... Véronique Inspecting the latest slurmdbd.log, the last two entries are:
2018-05-30T00:12:37+02:00 tars-acct slurmdbd[647]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert into "tars_step_table" (job_db_inx, id_step, time_start, step_name, state, tres_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist, req_cpufreq, req_cpufreq_min, req_cpufreq_gov) values (78942755, -2, 1527630246, 'batch', 1, '1=1,2=3072,4=1', 1, 1, 'tars-513', '186', 0, 0, 0, 0) on duplicate key update nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='tars-513', node_inx='186', task_dis
2018-05-30T00:12:37+02:00 tars-acct slurmdbd[647]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
so slurmdbd fatal()'ed, whose implementation doesn't generate a core file but exit(1) the process.
Véronique, please see if there was anything else locking up the database. A common cause of this is mysqldump running or something. Running mysqldump will sometimes cause all sorts of issues on a live database. If this is the situation please make sure you have the following mysqldump options...
--single-transaction
--quick
--lock-tables=false
Without them the database will lock up and you will get this kind of issues. I am hoping this is the case or something like it was locking the tables messing things up.
Have you gone through the basic MySQL tuning instructions? They're documented here: https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build
Relevant configuration options:
innodb_buffer_pool_size
innodb_lock_wait_timeout
innodb_log_file_size
Thanks
... I'd also suggest switching the Purge[Job|Step]After intervals to be expressed in 'hours' to get those more frequent purges. I have already checked that with our DBA a few days/weeks ago. The database dumps are not running at midnight. They are scheduled/triggered at 10 pm and only last a few minutes. The options that are used for sqldump are: --all-databases --single-transaction --add-drop-database --triggers --routines --events --socket=$SOCKET --user=$USERNAME with --lock_tables=false, we may loose database consistency and --quick would take more memory. We may give --quick a try but I must first talk again with our DBA. Here are the values for the mysql parameters you mention: MariaDB [(none)]> SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; +-------------------------+--------------+ | Variable_name | Value | +-------------------------+--------------+ | innodb_buffer_pool_size | 137438953472 | +-------------------------+--------------+ 1 row in set (0.00 sec) MariaDB [(none)]> SHOW VARIABLES LIKE 'innodb_lock_wait_timeout'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | innodb_lock_wait_timeout | 600 | +--------------------------+-------+ 1 row in set (0.00 sec) MariaDB [(none)]> SHOW VARIABLES LIKE 'innodb_log_file_size'; +----------------------+-------------+ | Variable_name | Value | +----------------------+-------------+ | innodb_log_file_size | 16921919488 | +----------------------+-------------+ 1 row in set (0.00 sec) I also changed in slurmdbd.conf: #PurgeStepAfter=31days PurgeStepAfter=744hours to see if we can reproduce the err_lock_wait_timeout error at office hours. Created attachment 6952 [details]
logs since config change steps purged every hour
Alejandro, I did the config change and restarted slurmdbd at 13:59. Since then It has stopped 2 times: once at 14h15 and another time at 14h35 after I restarted it. Here is the content of the archive directory for the steps: ls -l /pasteur/services/policy01/slurm/tars/archives/|grep step -rw------- 1 slurm slurm 1565 Mar 5 15:02 tars_step_archive_1970-01-01T00:00:00_1970-02-01T00:00:00 -rw------- 1 slurm slurm 1179 Mar 3 19:02 tars_step_archive_1970-01-01T00:00:00_1970-02-01T00:00:00.old -rw------- 1 slurm slurm 43218407 May 1 2016 tars_step_archive_2016-03-01T00:00:00_2016-03-31T23:59:59 -rw------- 1 slurm slurm 59028107 Jun 1 2016 tars_step_archive_2016-04-01T00:00:00_2016-04-30T23:59:59 -rw------- 1 slurm slurm 1504793139 Sep 7 2016 tars_step_archive_2016-05-01T00:00:00_2016-07-31T23:59:59 -rw------- 1 slurm slurm 284054663 Oct 1 2016 tars_step_archive_2016-08-01T00:00:00_2016-08-31T23:59:59 -rw------- 1 slurm slurm 697896812 Nov 1 2016 tars_step_archive_2016-09-01T00:00:00_2016-09-30T23:59:59 -rw------- 1 slurm slurm 418874294 Dec 1 2016 tars_step_archive_2016-10-01T00:00:00_2016-10-31T23:59:59 -rw------- 1 slurm slurm 723510388 Jan 1 2017 tars_step_archive_2016-11-01T00:00:00_2016-11-30T23:59:59 -rw------- 1 slurm slurm 352682402 Feb 1 2017 tars_step_archive_2016-12-01T00:00:00_2016-12-31T23:59:59 -rw------- 1 slurm slurm 937407883 Mar 1 2017 tars_step_archive_2017-01-01T00:00:00_2017-01-31T23:59:59 -rw------- 1 slurm slurm 937407883 Mar 1 2017 tars_step_archive_2017-01-01T00:00:00_2017-01-31T23:59:59.old -rw------- 1 slurm slurm 1028052287 Apr 2 2017 tars_step_archive_2017-02-01T00:00:00_2017-02-28T23:59:59.new -rw------- 1 slurm slurm 1447011676 May 1 2017 tars_step_archive_2017-02-01T00:00:00_2017-03-31T23:59:59 -rw------- 1 slurm slurm 1315341329 Jun 1 2017 tars_step_archive_2017-04-01T00:00:00_2017-04-30T23:59:59 -rw------- 1 slurm slurm 1149182689 Jul 1 2017 tars_step_archive_2017-05-01T00:00:00_2017-05-31T23:59:59 -rw------- 1 slurm slurm 573523907 Aug 1 2017 tars_step_archive_2017-06-01T00:00:00_2017-06-30T23:59:59 -rw------- 1 slurm slurm 1950348993 Sep 1 2017 tars_step_archive_2017-07-01T00:00:00_2017-07-31T23:59:59 -rw------- 1 slurm slurm 2134 Nov 1 2017 tars_step_archive_2017-07-01T00:00:00_2017-08-01T00:00:00 -rw------- 1 slurm slurm 0 Nov 1 2017 tars_step_archive_2017-08-01T00:00:00_2017-09-01T00:00:00 -rw------- 1 slurm slurm 1437386927 Nov 1 2017 tars_step_archive_2017-09-01T00:00:00_2017-09-30T23:59:59 -rw------- 1 slurm slurm 1513625447 Dec 1 00:02 tars_step_archive_2017-10-01T00:00:00_2017-10-31T23:59:59 -rw------- 1 slurm slurm 476 Jan 4 16:19 tars_step_archive_2017-10-01T00:00:00_2017-11-30T23:59:59 -rw------- 1 slurm slurm 1822759587 Jan 1 00:02 tars_step_archive_2017-11-01T00:00:00_2017-11-30T23:59:59 -rw------- 1 slurm slurm 1372 Feb 2 06:01 tars_step_archive_2017-11-01T00:00:00_2017-12-31T23:59:59 -rw------- 1 slurm slurm 1540574311 Feb 1 00:03 tars_step_archive_2017-11-01T00:00:00_2017-12-31T23:59:59.old -rw------- 1 slurm slurm 0 Mar 1 14:01 tars_step_archive_2017-12-01T00:00:00_2018-01-31T23:59:59 -rw------- 1 slurm slurm 2406 May 30 14:00 tars_step_table_archive_1970-01-01T00:00:00_1970-02-01T00:00:00 -rw------- 1 slurm slurm 6166 May 30 00:00 tars_step_table_archive_1970-01-01T00:00:00_1970-02-01T00:00:00.old -rw------- 1 slurm slurm 0 Apr 3 11:19 tars_step_table_archive_2018-02-01T00:00:00_2018-02-28T23:59:59 -rw------- 1 slurm slurm 1273454685 Apr 17 00:04 tars_step_table_archive_2018-03-01T00:00:00_2018-03-16T23:59:59 -rw------- 1 slurm slurm 473 Apr 20 21:18 tars_step_table_archive_2018-03-12T00:00:00_2018-03-19T23:59:59 -rw------- 1 slurm slurm 118530126 Apr 18 00:00 tars_step_table_archive_2018-03-17T00:00:00_2018-03-17T23:59:59 -rw------- 1 slurm slurm 144090745 Apr 19 00:00 tars_step_table_archive_2018-03-18T00:00:00_2018-03-18T23:59:59 -rw------- 1 slurm slurm 74257924 Apr 20 00:03 tars_step_table_archive_2018-03-19T00:00:00_2018-03-19T23:59:59 -rw------- 1 slurm slurm 91347696 Apr 21 00:01 tars_step_table_archive_2018-03-20T00:00:00_2018-03-20T23:59:59 -rw------- 1 slurm slurm 201299297 Apr 22 00:01 tars_step_table_archive_2018-03-21T00:00:00_2018-03-21T23:59:59 -rw------- 1 slurm slurm 129974326 Apr 23 00:01 tars_step_table_archive_2018-03-22T00:00:00_2018-03-22T23:59:59 -rw------- 1 slurm slurm 68700552 Apr 24 00:01 tars_step_table_archive_2018-03-23T00:00:00_2018-03-23T23:59:59 -rw------- 1 slurm slurm 40997689 Apr 25 00:01 tars_step_table_archive_2018-03-24T00:00:00_2018-03-24T23:59:59 -rw------- 1 slurm slurm 128560550 Apr 26 00:03 tars_step_table_archive_2018-03-25T00:00:00_2018-03-25T23:59:59 -rw------- 1 slurm slurm 70845809 Apr 27 00:03 tars_step_table_archive_2018-03-26T00:00:00_2018-03-26T23:59:59 -rw------- 1 slurm slurm 24216285 Apr 28 00:01 tars_step_table_archive_2018-03-27T00:00:00_2018-03-27T23:59:59 -rw------- 1 slurm slurm 237993265 Apr 29 00:01 tars_step_table_archive_2018-03-28T00:00:00_2018-03-28T23:59:59 -rw------- 1 slurm slurm 19889146 Apr 30 00:01 tars_step_table_archive_2018-03-29T00:00:00_2018-03-29T23:59:59 -rw------- 1 slurm slurm 80251193 May 1 00:01 tars_step_table_archive_2018-03-30T00:00:00_2018-03-30T23:59:59 -rw------- 1 slurm slurm 215350730 May 2 00:01 tars_step_table_archive_2018-03-31T00:00:00_2018-03-31T23:59:59 -rw------- 1 slurm slurm 141028162 May 3 00:04 tars_step_table_archive_2018-04-01T00:00:00_2018-04-01T23:59:59 -rw------- 1 slurm slurm 141143648 May 4 00:03 tars_step_table_archive_2018-04-02T00:00:00_2018-04-02T23:59:59 -rw------- 1 slurm slurm 246160884 May 5 00:01 tars_step_table_archive_2018-04-03T00:00:00_2018-04-03T23:59:59 -rw------- 1 slurm slurm 139078760 May 6 00:03 tars_step_table_archive_2018-04-04T00:00:00_2018-04-04T23:59:59 -rw------- 1 slurm slurm 171546334 May 7 00:03 tars_step_table_archive_2018-04-05T00:00:00_2018-04-05T23:59:59 -rw------- 1 slurm slurm 179421306 May 8 00:03 tars_step_table_archive_2018-04-06T00:00:00_2018-04-06T23:59:59 -rw------- 1 slurm slurm 144518990 May 9 00:01 tars_step_table_archive_2018-04-07T00:00:00_2018-04-07T23:59:59 -rw------- 1 slurm slurm 79614618 May 10 00:04 tars_step_table_archive_2018-04-08T00:00:00_2018-04-08T23:59:59 -rw------- 1 slurm slurm 76785583 May 11 00:03 tars_step_table_archive_2018-04-09T00:00:00_2018-04-09T23:59:59 -rw------- 1 slurm slurm 54465917 May 12 00:03 tars_step_table_archive_2018-04-10T00:00:00_2018-04-10T23:59:59 -rw------- 1 slurm slurm 455256771 May 15 00:03 tars_step_table_archive_2018-04-10T00:00:00_2018-04-13T23:59:59 -rw------- 1 slurm slurm 49718259 May 13 00:03 tars_step_table_archive_2018-04-11T00:00:00_2018-04-11T23:59:59 -rw------- 1 slurm slurm 642908605 May 14 00:04 tars_step_table_archive_2018-04-12T00:00:00_2018-04-12T23:59:59 -rw------- 1 slurm slurm 256772336 May 19 00:03 tars_step_table_archive_2018-04-12T00:00:00_2018-04-17T23:59:59 -rw------- 1 slurm slurm 101480091 May 16 00:03 tars_step_table_archive_2018-04-14T00:00:00_2018-04-14T23:59:59 -rw------- 1 slurm slurm 111564174 May 17 00:03 tars_step_table_archive_2018-04-15T00:00:00_2018-04-15T23:59:59 -rw------- 1 slurm slurm 278337830 May 18 00:04 tars_step_table_archive_2018-04-16T00:00:00_2018-04-16T23:59:59 -rw------- 1 slurm slurm 380835382 May 20 00:01 tars_step_table_archive_2018-04-18T00:00:00_2018-04-18T23:59:59 -rw------- 1 slurm slurm 484623757 May 21 00:01 tars_step_table_archive_2018-04-19T00:00:00_2018-04-19T23:59:59 -rw------- 1 slurm slurm 171905543 May 22 00:00 tars_step_table_archive_2018-04-20T00:00:00_2018-04-20T23:59:59 -rw------- 1 slurm slurm 393856952 May 23 00:01 tars_step_table_archive_2018-04-21T00:00:00_2018-04-21T23:59:59 -rw------- 1 slurm slurm 64644183 May 24 00:00 tars_step_table_archive_2018-04-22T00:00:00_2018-04-22T23:59:59 -rw------- 1 slurm slurm 56756708 May 25 00:00 tars_step_table_archive_2018-04-23T00:00:00_2018-04-23T23:59:59 -rw------- 1 slurm slurm 295824629 May 26 00:00 tars_step_table_archive_2018-04-24T00:00:00_2018-04-24T23:59:59 -rw------- 1 slurm slurm 175179686 May 27 00:00 tars_step_table_archive_2018-04-25T00:00:00_2018-04-25T23:59:59 -rw------- 1 slurm slurm 147678960 May 28 00:00 tars_step_table_archive_2018-04-26T00:00:00_2018-04-26T23:59:59 -rw------- 1 slurm slurm 123730748 May 29 00:02 tars_step_table_archive_2018-04-27T00:00:00_2018-04-27T23:59:59 -rw------- 1 slurm slurm 150537896 May 30 09:24 tars_step_table_archive_2018-04-28T00:00:00_2018-04-28T23:59:59 -rw------- 1 slurm slurm 150537896 May 30 00:02 tars_step_table_archive_2018-04-28T00:00:00_2018-04-28T23:59:59.old -rw------- 1 slurm slurm 72774314 May 30 14:02 tars_step_table_archive_2018-04-29T00:00:00_2018-04-29T12:59:59 -rw------- 1 slurm slurm 82015784 May 30 14:22 tars_step_table_archive_2018-04-29T00:00:00_2018-04-29T13:59:59 I put the logs in attachment. I have put back : PurgeStepAfter=31days so that slurmdbd could start again without stopping right after because of the err_lock_wait_timeout (purge of the job_steps table) but it is quasi sure that it will stop again tonight... It seems like the hourly purge on steps operation didn't Deadlock but made slurmdbd fatal() with the ER_LOCK_WAIT_TIMEOUT twice. I think we should try setting Purge[Job|Step]After to the 'hourly' equivalent of the current days, but also increase the MySQL parameters and see if the Purge operations can finish with that increase. I'd increase the the innodb_buffer_pool_size taking into account the documentation considerations: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html But especially I'd increase the innodb_lock_wait_timeout and see if that let the operation finish: https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_lock_wait_timeout Can we try 'hourly' purges with increased MySQL parameters and see if that helps for the next hour? Thanks. Created attachment 6957 [details]
17.11 patch take3
Hi Véronique. See attached another modification of the fix to this issue. Sorry for uploading yet another version. Please remove all previous patches from this bug and apply this one. We also require you set DebugFlags=DB_ARCHIVE before restarting slurmdbd so we can discern exactly what's going on and the state of the locks. Thanks!
Just to make it clear, we ask you to append DB_ARCHIVE to what we already requested. So you should at least have these flags set: DebugFlags=DB_JOB,DB_STEP,DB_ARCHIVE after slurmdbd restart with the applied patch. Thanks! Hi Alejandro just to keep you informed, I forwarded yesterday your suggestions about the changes in the database configuration to our DBA. He will look at it today and get back to us soon. I am going to have a look at the 2nd patch and see if we can apply it today. slurmdbd did an exit(1) last night as I expected. I started it again this morning and it seems OK until now. Regards Hi Alejandro, Just to make sure: there were 2 patches applied for this bug: - this one: https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch that Sophie applied and 5086-take2.patch that I applied. You mean we should remove both before applying the new one. Am I right? Véronique Hi Véronique, this one should continue being applied: https://github.com/SchedMD/slurm/commit/ed74a22abf7c9066a2.patch do not remove it. This is one is already checked into Slurm 17.11 branch since tag 17.11.6. So it should only be manually applied if you're <= 17.11.5. Then I asked you to remove the patch attached in comment 52: https://bugs.schedmd.com/attachment.cgi?id=6907&action=diff and then apply the new version of that one, attached in comment 84. With regards to the configuration, before restarting slurmdbd with the new code, we ask you to have at least these flags set in slurmdbd.conf: DebugFlags=DB_JOB,DB_STEP,DB_ARCHIVE and setting Purge[Job|Step]After to the 'hourly' equivalent of the current days, but also increase the MySQL parameters and see if the Purge operations can finish with that increase, as suggested in comment 78. Please, let me know if you have any more questions. Thank you. Hello, Do you think that we could install the 17.11.7 release to start from something clean and then apply patch take3? Thanks, Sophie - sure thing. We always encourage to stick to the latest micro release when possible. So yes, you can upgrade to 17.11.7 and apply patch take 3 + the changes in the config. Hi. Did slurmdbd survive at yesterday midnight? > Hi. Did slurmdbd survive at yesterday midnight?
No but it was due to the fact that we hadn't had enough time to put
everything in place (tiny update + changes in innodb parameters + the patch).
It's now the case and the first restart of slurmdbd worked as expected with
the creation of the archive:
28M Jun 1 17:11 tars_step_table_archive_2018-05-01T00:00:00_2018-05-01T16:59:59
We will see if the next round is as smooth as this one.
Thanks for your concern,
Up to now, everything works as expected. The archives are generated without error or deadlock :-) 2.1M Jun 1 18:00 tars_step_table_archive_2018-05-01T17:00:00_2018-05-01T17:59:59 23M Jun 1 18:02 tars_job_table_archive_2018-03-01T00:00:00_2018-03-02T01:59:59 Created attachment 6987 [details]
slurmdbd logs of the 06/03 until the ER_LOCK_WAIT_TIMEOUT
Hello,
Unfortunately, slurmdbd didn't survive at yesterday morning. Lock wait
timeout error again even with
MariaDB [slurm_acct_db]> SELECT @@innodb_lock_wait_timeout;
+----------------------------+
| @@innodb_lock_wait_timeout |
+----------------------------+
| 900 |
+----------------------------+
2018-06-03T02:17:32+02:00 tars-acct slurmdbd[32691]: DBD_CLUSTER_TRES: cluster not registered
2018-06-03T02:17:33+02:00 tars-acct slurmdbd[32691]: error: mysql_query failed: 1205 Lock wait timeout exceeded; try restarting transaction#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (38246018, UNIX_TIMESTAMP
2018-06-03T02:17:33+02:00 tars-acct slurmdbd[32691]: fatal: mysql gave ER_LOCK_WAIT_TIMEOUT as an error. The only way to fix this is restart the calling program
The last archives seem to have worked as expected:
2.9M Jun 3 01:00 tars_step_table_archive_2018-05-03T00:00:00_2018-05-03T00:59:59
490K Jun 3 01:02 tars_job_table_archive_2018-03-02T16:00:00_2018-03-02T16:59:59
2.6M Jun 3 02:00 tars_step_table_archive_2018-05-03T01:00:00_2018-05-03T01:59:59
1.3M Jun 3 02:02 tars_job_table_archive_2018-03-02T17:00:00_2018-03-02T17:59:59
I've attached the slurmdbd logs between midnight and 02:17:33. Do you need
something else? Thanks,
Created attachment 6989 [details]
Decrease
At least there aren't any deadlocks anymore. What I'd try now is:
1. Increase innodb_lock_wait_timeout even more, perhaps to 1500.
2. Apply the attached patch to decrease MAX_PURGE_LIMIT to half its current defined value so that locks can more frequently be released to let other operations happen like recording job/step start/complete, which seem to be the operations timing out from the logs.
Thanks.
> At least there aren't any deadlocks anymore. Sorry to contradict you but it has just happened: 2018-06-04T16:02:28+02:00 tars-acct slurmdbd[23558]: error: mysql_query failed: 1213 Deadlock found when trying to get lock; try restarting transaction#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (38267288, UN 2018-06-04T16:02:28+02:00 tars-acct slurmdbd[23558]: error: It looks like the storage has gone away trying to reconnect > 1. Increase innodb_lock_wait_timeout even more, perhaps to 1500. > 2. Apply the attached patch to decrease MAX_PURGE_LIMIT to half its current > defined value so that locks can more frequently be released to let other > operations happen like recording job/step start/complete, which seem to be > the operations timing out from the logs. Ok, I'm going to do that and keep you informed. Can you attach the full dbd log from both failures? Thanks. Created attachment 6991 [details]
slurmdbd.log of today 06/04 until 17:17
As you will see, I had to restart slurmdbd after the ER_LOCK_WAIT_TIMEOUT that
happened at 16:17. This time, slurmdbd archived the jobs again but as usual
the .old file and the new one have the same size:
5.2M Jun 4 16:00 tars_step_table_archive_2018-05-04T15:00:00_2018-05-04T15:59:59
167K Jun 4 16:02 tars_job_table_archive_2018-03-04T07:00:00_2018-03-04T07:59:59.old
167K Jun 4 16:35 tars_job_table_archive_2018-03-04T07:00:00_2018-03-04T07:59:59
We have tried to applied the patch to see if it was enough alone but the ER_LOCK_WAIT_TIMEOUT happened again so we have done as you suggested and are now keeping an eye on it. Created attachment 7005 [details]
log of slurmdbd 06/06 00:00 -> 10:00
Hello,
Even with the patch and
+----------------------------+
| @@innodb_lock_wait_timeout |
+----------------------------+
| 1500 |
+----------------------------+
slurmdbd exited again due to an ER_LOCK_WAIT_TIMEOUT at 02:27 this morning.
I have attached the slurmdbd.log. The archive itself seems to have worked as
expected:
4.3M Jun 6 02:00 tars_step_table_archive_2018-05-06T01:00:00_2018-05-06T01:59:59
116K Jun 6 02:02 tars_job_table_archive_2018-03-05T17:00:00_2018-03-05T17:59:59
but the purge/deletion in the job table didn't happened afterwards. It only
took place when slurmdbd was restarted and didn't take long:
2018-06-06T08:51:10+02:00 tars-acct slurmdbd[21547]: 0(as_mysql_archive.c:3218) query#012delete from "tars_job_table" where time_submit <= 1520290799 && time_end != 0 LIMIT 25000
2018-06-06T08:52:53+02:00 tars-acct slurmdbd[21547]: 9(as_mysql_job.c:696) query#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_task_pending, tres_req, work_dir) values (38322007, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 406, 1, 31540, 96848, 'None assigned', 0,
So I guess that the question is why the purge of the job table didn't complete
the first time.
Thanks,
I'm still working on troubleshooting this. Just out of curiosity, I guess from the config that the mysql instance is co-located with slurmdbd in tars-acct node. What's the underlying filesystem/setup hardware backing up the database? > I'm still working on troubleshooting this. Just out of curiosity, > I guess from the config that the mysql instance is co-located with > slurmdbd in tars-acct node. Yes, you are right. > What's the underlying filesystem/setup hardware backing up the database? $ df -h /opt/data/mariadb Filesystem Size Used Avail Use% Mounted on /dev/sdb 744G 155G 590G 21% /opt/data $ mount | grep sdb /dev/sdb on /opt/data type xfs (rw) /dev/sdb corresponds to 3 SSD disks on RAID 5. tars-acct has 256 GB of RAM and Intel Xeon E5-2680 v4 @ 2.40GHz. Is that what you wanted to know? Yes, this is what I was asking for. And the filesystem/hardware setup for: ArchiveDir=/pasteur/services/policy01/slurm/tars/archives/ just wondering (despite unlikely) if the underlying setup was somehow acting as a bottleneck for the process to happen at an effective performance. And another question: do you happen to use sreport? /pasteur/services/policy01/ is a NFS file system. > just wondering (despite unlikely) if the underlying setup was somehow > acting as a bottleneck for the process to happen at an effective > performance. Since we archive every hour, files are rather small, less that 100M. When slurmdbd exited this morning, the job archive file was 116K. There are several insert into "tars_job_table" between the as_mysql_archive.c lines and the final ER_LOCK_WAIT_TIMEOUT, while or just before the job archive file was written: 2018-06-06T02:02:23+02:00 tars-acct slurmdbd[15070]: 9(as_mysql_job.c:696) query#012insert into "tars_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, array_task_str, array_max_tasks, array_task_pending, tres_req, work_dir) values (38319860, UNIX_TIMESTAMP(), 38319860, 4294967294, 0, 4294967294, 264, 20, 22037, 31 $ stat /pasteur/services/policy01/slurm/tars/archives/tars_job_table_archive_2018-03-05T17\:00\:00_2018-03-05T17\:59\:59 Access: (0600/-rw-------) Uid: ( 497/ slurm) Gid: ( 493/ slurm) Access: 2018-06-06 02:02:23.380203816 +0200 Modify: 2018-06-06 02:02:23.383714218 +0200 Change: 2018-06-06 02:02:29.206564861 +0200 Were these insert waiting for a lock? Since the job archive file was written, it means that the following lines were executed: if (debug_flags & DEBUG_FLAG_DB_ARCHIVE) DB_DEBUG(mysql_conn->conn, "query\n%s", query); if (!(result = mysql_db_query_ret(mysql_conn, query, 0))) { xfree(query); return SLURM_ERROR; } xfree(query); if (!(cnt = mysql_num_rows(result))) { mysql_free_result(result); return 0; } buffer = (*pack_func)(result, cluster_name, cnt, usage_info, &period_start); mysql_free_result(result); error_code = archive_write_file(buffer, cluster_name, period_start, period_end, arch_dir, sql_table, archive_period); If we had problems on locks in mysql_db_query_ret, could the file be written? > And another question: do you happen to use sreport? It happens sometimes when we want to see how jobs will be scheduled and understand where their priority comes from. But we don't have crons executing such command if it is your question. Created attachment 7024 [details]
traces from mariadb (locks)
Hi Alejndro, Please have a look at the latest file I attached: it contains the mariadb traces about locks and it contains interesting things that wrere going on at the time of the last slurmdbd crash. Our DBA has pointed out that the following "select for update " request: select account, array_max_tasks, nodes_alloc, id_assoc, id_array_job, id_array_task, id_block, derived_ec, derived_es, exit_code, timelimit, time_eligible, time_end, id_group, job_db_inx, id_job, kill_requid, mcs_label, job_name, nodelist, node_inx, pack_job_id, pack_job_offset, `partition`, priority, id_qos, cpus_req, mem_req, id_resv, time_start, state, time_submit, time_suspended, track_steps, id_user, wckey, id_wckey, work_dir, tres_alloc, tres_req from "tars_job_table" where time_submit < 1520402399 && time_end != 0 order by time_submit asc for update tries to lock the job tables but can't get that lock because the table is already locked by another request. So, slurmdbd wait to get that lock and finally exits when the timeout is reached... Thanks for the database trace and analysis. We're still leveraging further solutions to this, sorry if it's taking so long to be properly fixed. Created attachment 7062 [details]
remove for update patch when archiving
Hi. Will you apply the attached patch and see how it goes? We highly suspect your db admin report is pinpointing to the problem and this should fix it. Please, let us know how it goes. Thanks and sorry for the delayed response.
Hello, > Hi. Will you apply the attached patch and see how it goes? We highly suspect > your db admin report is pinpointing to the problem and this should fix it. Done. > Please, let us know how it goes. Yes, of course. For the moment, everything works well. The archives/purges have completed successfully without any lock problem. Thanks. Created attachment 7069 [details]
last hour of slurmdbd logs before archive/purge of 7pm and ER_LOCK_WAIT_TIMEOUT
Sorry, ER_LOCK_WAIT_TIMEOUT again.
Hi. We were able to reproduce without the last patch applied. Once applied, recompiled and slurmdbd restarted, then we are not able to reproduce anymore. Sorry to ask, but we're interested in making sure you didn't forget to recompile and restart the daemon (it has been restarted now since it fatal'ed). Thanks. I would have asked the same to my own users, don't worry ;-) And you are right, contrary to slurmctld and the slurmd, slurmdbd wasn't restarted after the update. I checked in the logs. My apologies for that oversight. Since the restart, no archive/purge problems. Thanks, Sophie, we really think the real fix for this bug is just the last patch which removes the "for update" clause in 4 select queries in _archive_table function (attachment 7062 [details]). So we'd encourage to just apply this one on a stock build so that you have a clean build without intermediate patches we've been attaching during the course of this bug. If after a day or two you confirm there are no more issues we will check this last patch to .8 and close the bug. Thanks for your collaboration.
17.11.7 reinstalled with only "remove for update patch" and off course slurmdbd restarted like the other daemons. I'll inform you if something happens. No timeout took place after the last restart. Good morning. How did slurmdbd behave overnight? we can leave it running during the day and if everything is fine I'd check the patch in and close this. Thanks. Hello, No problem with slurmdbd up to now. I'm going to clean the runaway jobs to see if new ones appear. Thanks, Hello, No problem over the weekend :-) Do you think that even with that change, we could face ER_LOCK_WAIT_TIMEOUT again in the future if the number of jobs increase a lot? Thanks, (In reply to Sophie Créno from comment #134) > Hello, > > No problem over the weekend :-) That is good, thanks for the feedback. We've pushed this commit: commit 4c448fd72c4f36ae309a46af8de2c9799f363822 (HEAD -> slurm-17.11, origin/slurm-17.11) Author: Danny Auble <da@schedmd.com> AuthorDate: Mon Jun 18 15:43:31 2018 +0200 Commit: Alejandro Sanchez <alex@schedmd.com> CommitDate: Mon Jun 18 15:43:31 2018 +0200 MySQL - Prevent deadlock caused by archive logic locking reads. Specifically due to SELECT ... FOR UPDATE ones. Bug 5086. Which will be available since 17.11.8 onwards. > Do you think that even with that change, we could face ER_LOCK_WAIT_TIMEOUT > again in the future if the number of jobs increase a lot? > > Thanks, Theoretically not. But in practice, there can always raise specific use-cases where (especially under heavy HTC conditions) this could happen. The ER_LOCK_WAIT_TIMEOUT can be caused by either bad things like the deadlock reported in the MySQL traces here, by heavy operations taking too much (rollup process are prone to this) or similar. Although with this patch we've reduced the likelihood of this problem, weird combinations of different kind of operations accessing the same tables could cause problems. We always look forward and strive to avoid all this. I'm gonna go ahead and close this bug. Thanks again for your patience and collaboration. |
Hello, Since our last upgrade to 17.11.5, we have observed 2 types of runaway jobs: * some are seen simultaneously as PENDING: ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 30535967 allocation tars PENDING Unknown Unknown and COMPLETED: $ sacct -j 30535967 JobID JobName Partition QOS State Start Elapsed End ExitCode ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------------------- -------- 30533734_19+ ez_50 dedicated fast COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 30533734_19+ batch COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 30533734_19+ mcell COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 Note that these jobs are jobs that requeue themselves automatically by exiting with the code put as RequeueExit in slurm.conf. * others that are seen as running: ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 32070617 D-15-0013+ dedicated tars RUNNING 2018-04-19T23:43:55 Unknown $ sacct -j 32070617 JobID JobName Partition QOS State Start Elapsed End ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------------------- 32070617 D-15-0013+ dedicated fast RUNNING 2018-04-19T23:43:55 15:45:52 Unknown but are over for several hours: 2018-04-19T22:55:33+02:00 tars-master slurmctld[29480]: _slurm_rpc_submit_batch_job: JobId=32070617 InitPrio=7029 usec=7829 2018-04-19T23:43:55+02:00 tars-master slurmctld[29480]: sched: Allocate JobID=32070617 NodeList=tars-464 #CPUs=1 Partition=dedicated 2018-04-20T00:05:07+02:00 tars-master slurmctld[29480]: _job_complete: JobID=32070617 State=0x1 NodeCnt=1 WEXITSTATUS 0 2018-04-20T00:05:07+02:00 tars-master slurmctld[29480]: _job_complete: JobID=32070617 State=0x8003 NodeCnt=1 done 2018-04-20T00:07:38+02:00 tars-master slurmctld[29480]: Resending TERMINATE_JOB request JobId=32070617 Nodelist=tars-464 2018-04-20T00:08:35+02:00 tars-master slurmctld[29480]: cleanup_completing: job 32070617 completion process took 208 seconds In both cases, we don't understand why they are runaway jobs: * why the first type of jobs are in 2 different states depending on the command to query the db, * why the second type are not marked as completed in the db whereas it seems that everything behaved as expected. Can it be due to the completion process time? As a side effect, for the first type, the information displayed by sacct is correct but when we fix the runaway jobs, the end time is set to the start time and thus becomes wrong: $ sacct -j 30535967 JobID JobName Partition QOS State Start Elapsed End ExitCode ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------------------- -------- 30533734_19+ ez_50 dedicated fast COMPLETED 2018-04-08T16:11:25 00:00:00 2018-04-08T16:11:25 30533734_19+ batch COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 30533734_19+ mcell COMPLETED 2018-04-08T16:11:25 00:20:37 2018-04-08T16:32:02 0:0 what should we do to avoid these 2 situations? We would like to keep the accounting as accurate as possible. I guess runaway jobs should be rare and only appear in special circumstances such as jobs terminating while daemons are stopped because of a maintenance operation (such as patching slurm). Thanks in advance,