Created attachment 2262 [details] clus152 log job 15647267_247(15647864) submitted and ran on clus275 for a while. Then something bad happen to the node and the job is requeued to another node, clus152, which is also in a horrible state. And the job disappear from the queue... ------------------------------------------ - output of zgrep 15647864 slurmctld_log.gz [2015-09-27T15:09:25.906] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus275 #CPUs=32 [2015-09-27T18:18:45.583] requeue job 15647864 due to failure of node clus275 [2015-09-27T18:18:45.583] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0 [2015-09-27T20:25:47.650] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40 [2015-09-27T20:53:16.047] requeue job 15647864 due to failure of node clus152 [2015-09-27T20:53:16.048] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0 [2015-09-27T20:53:36.321] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40 [2015-09-27T21:51:37.731] job_complete: JobID=15647267_247(15647864) State=0x1 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0 [2015-09-27T21:51:37.731] job_complete: JobID=15647267_247(15647864) State=0x8003 NodeCnt=1 done [2015-09-27T21:54:09.809] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T21:56:49.429] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T21:59:20.031] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T21:59:56.985] job_complete: JobID=15647267_247(15647864) State=0x8003 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 0 [2015-09-27T22:01:59.635] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:04:39.379] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:07:19.186] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:09:59.772] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:12:39.544] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:15:19.977] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:17:59.808] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:20:39.248] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:23:19.432] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:25:59.735] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:28:39.647] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:31:19.783] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:33:59.859] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:36:39.297] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:39:19.182] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:41:59.060] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:44:39.124] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:47:19.221] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:49:59.600] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:52:39.180] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:55:19.069] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T22:57:59.483] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:00:39.743] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:03:19.215] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:05:50.433] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:08:21.647] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:11:01.906] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:13:32.027] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:16:11.459] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:18:51.801] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:21:31.586] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:24:02.594] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:26:42.330] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:29:22.739] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:32:02.497] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:34:33.561] Resending TERMINATE_JOB request JobId=15647864 Nodelist=clus152 [2015-09-27T23:36:09.110] cleanup_completing: job 15647864 completion process took 6272 seconds - A portion of clus152 log attached - sacct output [root@pud22 ~]# sacct -j 15647864 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 15647267_247 rt_m141_g+ teamrotto+ 40 COMPLETED 0:9 15647267_24+ batch 40 CANCELLED 0:9 Long story short: once again, with a node that's seriously ill, slurm has failed to recover the task back into the queue, instead dropping it on the floor. This is why I (Phil) continue to insist that the default policy should be to put tasks back into the queue unless it's explicitly canceled by the user, or it runs absolutely to completion and exits with a 0 exit code.
Here are some relevant bits of the slurmctld log file with my comments preceded by "***" [2015-09-27T15:09:25.906] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus275 #CPUs=32 [2015-09-27T18:18:45.583] requeue job 15647864 due to failure of node clus275 [2015-09-27T18:18:45.583] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0 *** JOB ALLOCATED RESOURCES, STARTS, NODE FAILS, JOB REQUEUED [2015-09-27T20:25:47.650] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40 [2015-09-27T20:53:16.047] requeue job 15647864 due to failure of node clus152 [2015-09-27T20:53:16.048] Requeuing JobID=15647267_247(15647864) State=0x0 NodeCnt=0 *** JOB ALLOCATED RESOURCES, STARTS, NODE FAILS, JOB REQUEUED [2015-09-27T20:53:36.321] sched: Allocate JobId=15647267_247 (15647864) NodeList=clus152 #CPUs=40 *** JOB ALLOCATED BACK TO SAME NODE clus152 *** AT THIS POINT WE HAVE TWO COPIES OF THE SAME JOB RUNNING ON THE SAME BAD NODE [2015-09-27T21:51:37.731] job_complete: JobID=15647267_247(15647864) State=0x1 NodeCnt=1 (RUNNING) *** FIRST COPY OF THE JOB COMPLETES AND SLURM CONSIDERS THE JOB DONE BASED UPON MATCHING HOSTNAME AND JOB ID ========================= Do you have something that configures down nodes back online immediately? If so, you could have all sorts of bad vestigial state, including jobs. Slurm uses the job ID and host names to uniquely identify jobs, so it isn't currently possible to distinguish between a vestigial job and new one started on the same node. Are you familiar with available node health check tools? It may be valuable to run something like this either at the end of each job or on a periodic basis. http://warewulf.lbl.gov/trac/wiki/Node%20Health%20Check
> Do you have something that configures down nodes back online immediately? We dont This happen while/after recovering from power outage so it could be a flickering network. > Are you familiar with available node health check tools? > It may be valuable to run something like this either at the end of each job or > on a periodic basis. We dont use that tool, we run our monitoring (nagios/opsview) check periodically that monitor and sometimes fix issues on node So the job vanished because the job get on the same node twice and one of them finished and Slurm removed the job from queue?
Correct On September 30, 2015 9:31:53 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1985 > >--- Comment #2 from Akmal Madzlan <akmalm@dugeo.com> --- >> Do you have something that configures down nodes back online >immediately? > >We dont >This happen while/after recovering from power outage so it could be a >flickering network. > >> Are you familiar with available node health check tools? >> It may be valuable to run something like this either at the end of >each job or > on a periodic basis. > >We dont use that tool, we run our monitoring (nagios/opsview) check >periodically that monitor and sometimes fix issues on node > >So the job vanished because the job get on the same node twice and one >of them >finished and Slurm removed the job from queue? > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
I've got a pretty good idea of what happened then. It took some bad luck and bad timing, but I've implemented changes to prevent it. 1. Job launched 2. Node in some bad state, set DOWN and cancel job request sent 3. Node starts responding, set back to IDLE state 4. Job launched for second time on the same node, at this point 2 copies are running. The problem only happens when that same node is used. 5. First copy of job ends due to cancel request in step #2, job marked as completed 6. Second copy of job discovered and cancelled as it was already considered complete What this patch does differently starts at step 4 4. Job launch request sent to compute node. Slurmd discovers this is a duplicate job request and issues requeue/hold request on the job. Here is the change: https://github.com/SchedMD/slurm/commit/c15139562ff19948c076d2076b05b3fb3f68287e
Thanks, Moe! So just to be clear, if we omit the second job launching from the sequence: > 1. Job launched > 2. Node in some bad state, set DOWN and cancel job request sent > 3. Node starts responding, set back to IDLE state > 4. First copy of job ends due to cancel request in step #2 ...then the job would definitely not be marked as completed, right? Because the fact that the job was marked completed was the only truly troubling part of the story, for me. Cheers, -Phil
(In reply to Phil Schwan from comment #5) > Thanks, Moe! > > So just to be clear, if we omit the second job launching from the sequence: > > > 1. Job launched > > 2. Node in some bad state, set DOWN and cancel job request sent > > 3. Node starts responding, set back to IDLE state > > 4. First copy of job ends due to cancel request in step #2 > > ...then the job would definitely not be marked as completed, right? Because > the fact that the job was marked completed was the only truly troubling part > of the story, for me. Correct. The job only gets marked as completed when a message saying the job ended arrives for a job that was in RUNNING state on the same node that it was dispatched to. It needs to match: 1. Job ID 2. Job state of running 3. Same node as dispatched to