| Summary: | Jobs Disappear from Sequoia | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Don Lipari
2012-10-19 11:57:51 MDT
A complete log would shed some light on what is happening. 2012-10-19T15:57:36 to 2012-10-19T15:57:46 would be helpful. Actually I only need the 2 seconds of time. 2012-10-19T15:57:45 and 2012-10-19T15:57:46 It most likely has to do with this... [2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs running. This will only happen if the cnodes went into error after no jobs were running. It would be interesting to see is something similar happened for the 52719 job as well. (In reply to comment #2) > Actually I only need the 2 seconds of time. > > 2012-10-19T15:57:45 > and > 2012-10-19T15:57:46 > > It most likely has to do with this... > [2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs > running. This will only happen if the cnodes went into error after no jobs > were running. [2012-10-19T15:57:45] Warning: Note very large processing time from _slurm_rpc_job_ready: usec=1531036 [2012-10-19T15:57:45] Job 52715 completion process took 108 seconds [2012-10-19T15:57:45] debug: sched: Running job scheduler [2012-10-19T15:57:45] debug: bluegene:submit_job: 52718 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=144-144-144 [2012-10-19T15:57:45] number of blocks to check: 2 state 512 asking for 1179648-1179648 cpus [2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs running. This will only happen if the cnodes went into error after no jobs were running. [2012-10-19T15:57:45] Don't need to look at myself RMP19Oc131401677 RMP19Oc131401677 [2012-10-19T15:57:45] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:57:45] we found one! RMP19Oc131401677 [2012-10-19T15:57:45] debug: _find_best_block_match RMP19Oc131401677 <seq[0000x2233]> [2012-10-19T15:57:45] debug: 512(0) can start job 52718 at 1350687465 on RMP19Oc131401677(seq[0000x2233]) 1 [2012-10-19T15:57:45] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:57:45] Queue start of job 52718 in BG block RMP19Oc131401677 [2012-10-19T15:57:45] debug: waiting for block RMP19Oc131401677 to free for job 52718 1 other(s) trying to free it [2012-10-19T15:57:45] sched: Allocate JobId=52718 MidplaneList=seq[0000x2233] [2012-10-19T15:57:46] We are freeing a block (RMP19Oc131401677) that has job 52718(52718). [2012-10-19T15:57:46] debug: waiting for block RMP19Oc131401677 to free for job 52718 2 other(s) trying to free it [2012-10-19T15:57:46] We are freeing a block (RMP19Oc131401677) that has job 52718(52718). [2012-10-19T15:57:46] error: Couldn't requeue job 52718, failing it: Requested operation is presently disabled [2012-10-19T15:57:46] Queue termination of job 52718 in BG block RMP19Oc131401677 [2012-10-19T15:57:46] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:57:46] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:57:46] error: Couldn't requeue job 52718, failing it: Job/step already completing or completed [2012-10-19T15:57:46] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:57:46] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:57:46] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:57:46] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:57:46] RMP19Oc131401677 has 1024 in error (In reply to comment #3) > It would be interesting to see is something similar happened for the 52719 > job as well. [2012-10-19T15:58:41] debug: bluegene:submit_job: 52719 mode=513 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=144-144-144 [2012-10-19T15:58:41] debug: 513 can start unassigned job 52719 at 1350687521 on seq[0000x2233] [2012-10-19T15:58:41] _slurm_rpc_submit_batch_job JobId=52719 usec=65663 [2012-10-19T15:58:41] debug: bluegene:submit_job: 52719 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=144-144-144 [2012-10-19T15:58:42] debug: 512(0) can start job 52719 at 1350687521 on RMP19Oc155841431(seq[0000x2233]) 1 [2012-10-19T15:58:42] sync: adding RMP19Oc155841431 0x401ab4072b8 [2012-10-19T15:58:42] Queue start of job 52719 in BG block RMP19Oc155841431 [2012-10-19T15:58:42] debug: _track_freeing_blocks: Going to free 1 for job 52719 [2012-10-19T15:58:42] sched: Allocate JobId=52719 MidplaneList=seq[0000x2233] [2012-10-19T15:58:42] debug: _track_freeing_blocks: freed 0 of 1 for job 52719 [2012-10-19T15:58:42] Warning: Note very large processing time from schedule: usec=1354579 [2012-10-19T15:58:42] Warning: Note very large processing time from _slurmctld_background: usec=1028295 [2012-10-19T15:58:43] debug: waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it [2012-10-19T15:58:43] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:43] debug: slurmdbd: DBD_RC is 7000 from DBD_STEP_COMPLETE(1441): (null) [2012-10-19T15:58:44] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:44] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:45] debug: _track_freeing_blocks: freed 0 of 1 for job 52719 [2012-10-19T15:58:45] debug: waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it [2012-10-19T15:58:45] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:45] error: Couldn't requeue job 52719, failing it: Requested operation is presently disabled [2012-10-19T15:58:45] Queue termination of job 52719 in BG block RMP19Oc155841431 [2012-10-19T15:58:45] RMP19Oc155841431 has 268 in error [2012-10-19T15:58:46] error: Couldn't requeue job 52719, failing it: Job/step already completing or completed [2012-10-19T15:58:47] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up [2012-10-19T15:58:47] Warning: Note very large processing time from _slurm_rpc_job_ready: usec=1345887 [2012-10-19T15:58:47] debug: waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it [2012-10-19T15:58:47] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:58:47] slurmd error running JobId=52719 from front_end=seqlac2: Job credential revoked [2012-10-19T15:58:47] completing job 52719 [2012-10-19T15:58:47] _slurm_rpc_complete_batch_script JobId=52719: Job/step already completing or completed [2012-10-19T15:58:47] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-10-19T15:58:47] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up [2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up [2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up [2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up [2012-10-19T15:58:47] debug: _track_freeing_blocks: Going to free 1 for job 4294967294 [2012-10-19T15:58:48] debug: _track_freeing_blocks: Freed them all for job 4294967294 [2012-10-19T15:58:48] others are modifing this block RMP19Oc155841431, don't clear it up [2012-10-19T15:58:48] debug: sched: Running job scheduler [2012-10-19T15:58:48] debug: waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it [2012-10-19T15:58:48] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 [2012-10-19T15:58:48] debug: _track_freeing_blocks: freed 0 of 1 for job 52719 [2012-10-19T15:58:49] debug: waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it [...] [2012-10-19T16:03:17] debug: _track_freeing_blocks: freed 0 of 1 for job 52719 [2012-10-19T16:03:20] debug: _track_freeing_blocks: Freed them all for job 52719 [2012-10-19T16:03:20] debug: waiting for block RMP19Oc155841431 to free for job 52719 2 other(s) trying to free it [2012-10-19T16:03:22] debug: waiting for block RMP19Oc155841431 to free for job 52719 1 other(s) trying to free it [2012-10-19T16:03:23] debug: job 52719 already finished before boot Also, do you know if the DebugFlags changed at all during this time period? It really looks like the NoRealTime flag was set. It isn't now though. (In reply to comment #6) > Also, do you know if the DebugFlags changed at all during this time period? > It really looks like the NoRealTime flag was set. It isn't now though. No, I have never enabled the NoRealTime, nor do I suspect anyone else has. I searched through the svn log for the slurm.conf file on seq and it has never been included. Nevermind, I found a bug that was preventing messages from being printed. The fix for that is now checked in. Patch 9054e4e0c08870623ceb8a28e2e77fec86e23630. From what I can tell, this has been an extremely difficult one to track down, Something was happening to the hardware during the start up of these 2 jobs. If you notice both jobs are looking to use the same hardware but different block id's. It leads me to think something happened with hardware to make these blocks not only fail but need to be destroyed, the reason the jobs are being removed. I am still looking into it but that is what it appears to be right now. Do you know if any Cable's went into anything but Available during this time? (In reply to comment #9) > Do you know if any Cable's went into anything but Available during this time? Still researching the cable Q. But Dave just reported: Note we just had a similar condition (jobs 52442 and 52445). It this case job 52442 failed due to app software errors. The next job failed #52445 with the message below. It seems in this case the bluegene system was not really ready for another job to start but slurm thought it was so released the job. 2012-10-22 11:01:05.301 (FATAL) [0x40000a18ab0] 16940:ibm.runjob.client.Job: could not start job: block is unavailable due to a previous failure 2012-10-22 11:01:05.303 (FATAL) [0x40000a18ab0] 16940:ibm.runjob.client.Job: node R30-M0-N00-J00 is not available: Software Failure Is this something slurm can deal with better or should I report this to IBM? I'll check this out, but if would be helpful if you were able to grab hold of the patch I just checked in so I can get better debug. Thanks (In reply to comment #11) > I'll check this out, but if would be helpful if you were able to grab hold > of the patch I just checked in so I can get better debug. > > Thanks Will do. Once I install the new build, do you want the NoRealTime included in the debug flags or omitted? (In reply to comment #11) > I'll check this out, but if would be helpful if you were able to grab hold > of the patch I just checked in so I can get better debug. > > Thanks OK. New build is installed and running on seq. There might be some bad info from Dave for job 52445. According to the log 52445 didn't start until after the error message [2012-10-22T11:01:09] Queue start of job 52445 in BG block RMP22Oc103600061 Step 0 didn't start until 11:12 11 minutes after the error message. [2012-10-22T11:12:44] debug: _slurm_rpc_job_alloc_info_lite JobId=52445 NodeList=seq[1000x3233] usec=3 [2012-10-22T11:12:44] select_p_step_pick_nodes: new step for job 52445 will be running on RMP22Oc103600061(seq[1000x3233]) [2012-10-22T11:12:44] sched: _slurm_rpc_job_step_create: StepId=52445.0 seq[1000x3233] usec=46667 The log also states the step ran until it's untimely demise (probably software error, I didn't look more into it) [2012-10-22T11:43:58] Signal 998 of StepId=52445.0 by UID=101: Job/step already completing or completed My guess is instead of 52445 he meant 52443 and 52444 which were both short run 52444 with the block failing. It is hard to see any cnodes that went into error from 52442 though. Do you have information about those 2 jobs? I don't think this is related to this bug though. Don't use the NoRealTime option(In reply to comment #12) > (In reply to comment #11) > > I'll check this out, but if would be helpful if you were able to grab hold > > of the patch I just checked in so I can get better debug. > > > > Thanks > > Will do. Once I install the new build, do you want the NoRealTime included > in the debug flags or omitted? Don't use the NoRealTime option. (In reply to comment #15) > Don't use the NoRealTime option(In reply to comment #12) Ok. That is the way I left it when I restarted the new build. (In reply to comment #14) > There might be some bad info from Dave for job 52445. > > According to the log 52445 didn't start until after the error message > [2012-10-22T11:01:09] Queue start of job 52445 in BG block RMP22Oc103600061 > > Step 0 didn't start until 11:12 11 minutes after the error message. > > [2012-10-22T11:12:44] debug: _slurm_rpc_job_alloc_info_lite JobId=52445 > NodeList=seq[1000x3233] usec=3 > [2012-10-22T11:12:44] select_p_step_pick_nodes: new step for job 52445 will > be running on RMP22Oc103600061(seq[1000x3233]) > [2012-10-22T11:12:44] sched: _slurm_rpc_job_step_create: StepId=52445.0 > seq[1000x3233] usec=46667 > > The log also states the step ran until it's untimely demise (probably > software error, I didn't look more into it) > > [2012-10-22T11:43:58] Signal 998 of StepId=52445.0 by UID=101: Job/step > already completing or completed > > My guess is instead of 52445 he meant 52443 and 52444 which were both short > run 52444 with the block failing. It is hard to see any cnodes that went > into error from 52442 though. Do you have information about those 2 jobs? > I don't think this is related to this bug though. Dave replies: From the bluegene navigator I see Job User code block Size Start End 52445 morozov3 ./hacc_tpm RMP22Oc103600061 73,728 10/22/12 11:12 10/22/12 11:43 52442 morozov3 ./hacc_tpm RMP22Oc103600061 73,728 10/22/12 10:43 10/22/12 11:00 Maybe job 52443 is the one that reported R30-M0-N00-J00 not available, which would seem to agree with what Danny saw, but the bluegen run job never logged slurm job 52443. I will report this to IBM. I got note of 52443 starting a step but not from 52444. If I was notified of bad cnodes (SoftwareError) on this block neither of these jobs would of run. 52442 ran on the same block and there was quite a bit of action going on with the realtime server right before that job ended but if there were any cnodes in an error state the other 2 jobs wouldn't of started. Future viewers of this bug, please disregard comments 10-18 as they do not appear to deal with this bug. I don't think this had anything to do with the Cable as well. I just checked in more debug to get a better idea of where this is happening. Hopefully it happens again. If you could rebuild with the new debug that would be helpful. Anything further on this? Please reopen if this occurs again. Closing This is fixed in patch 8a49d19f4d41bdf98016c01e53c25fac14bbd8bf and will be in the next 2.5. |