On our sequoia machine, Adam B. reported that he has observed on multiple occasions jobs indicate running while the block on which they have been assigned is still booting. He doesn't think this is a significant problem, only new behavior. In the past, jobs remained in configuring while a block was booting. Is this a bug? Is it something that suggests a deeper problem?
I haven't noticed this, but can verify this tomorrow. If the block is booting the job state should be configuring.
Don, I have tried this with the latest 14.03 as well as 14.11 and can not reproduce it. The job is configuring while the block boots. I am not sure how the configuring flag would get cleared before hand either looking at the code. The block would have to go into an inited(booted) state for the flag to be removed, and it is added to the job right before slurm tries to boot the block or checked for the inited state for every job. Do you happen to have any snips of the slurmctld log for one of these jobs or how the state was noticed?
Don, do you have anything further on this?
(In reply to Danny Auble from comment #3) > Don, do you have anything further on this? I'm just relaying Adam's observation. He has yet to reply to me regarding your findings. Let's leave the ticket open for now.
OK, we are seeing the problem consistently on sequoia right now. db2 is a little sluggish and might be what reveals the problem. Here's a summary: We start off by freeing the block (a reservation is in place) submit the job at 11:17:42 as so: sbatch -N 2K -p pbatch --reservation=systemtest hello.test mmcs_server.log shows block booting at 11:17:45 squeue shows job gets a running state a few seconds after this, while block is still booting. runjob_server.log setAllocating() at 11:17:45 runjob_server.log shows block booting at 11:18:47 runjob_server.log doneAllocating() at 11:19:31 runjob_server.log block status changed to initialized at 11:19:40 mmcs_server.log shows block INITIALIZED at 11:19:40 slurmctld.log shows: 11:17:42 _slurm_rpc_submit_batch_job 11:17:43 Queue start of job 2343504 at in BG block RMP04Ja215428500 11:17:43 sched: Allocate JobId=2343504 in MidplaneList=seq[1230x1233] 11:18:04 Configuration for job 2343504 is complete 11:19:04 Configuration for job 2343504 is complete 11:19:42 sched: _slurm_rpc_job_step_create: StepId=2343504.0 seq[1230x1233] useq=1469 11:19:42 sched: _slurm_rpc_step_complete: StepId=2343504.0 useq=34580 11:19:42 completing job 2343504 on status 256 11:19:42 Queue termination of job 2343504 in BG block RMP04Ja215428500 11:19:42 sched: job_complete for JobId=2343504 successful, exit code=256 job fails at 11:19:42 job output file shows: "could not start job: block definition is invalid" "block 'RMP04Ja215428500' with status Initialized is not ready to run jobs.
The lines Configuration for job 2343504 is complete are very interesting. They appear to only happen on a BGQ system when the timelimit of the job happens and the job isn't running, or the nodes in the bitmap are involved with powering on. Which I am fairly sure the job has the running state. You don't happen to have any nodes in a power save mode (most likely not)? This line tells me when the Configuring flag is (erroneously) taken away. I have just reproduced this with 14.11. I'll try to figure out why this code path is happening and get you a fix. From the log it appears the state change is not harmful though. The step isn't created until after the block is booted. Your message in the job not running I have seen, but only in the simulator... 2015-01-05 15:30:57.003 (FATAL) [0x7f6572044780] 19315:ibm.runjob.client.Job: could not start job: block definition is invalid 2015-01-05 15:30:57.003 (FATAL) [0x7f6572044780] 19315:ibm.runjob.client.Job: block 'RMP05Ja153053744' with status Initialized is not ready to run jobs I was told by IBM (long ago) this would not happen on a real system. Perhaps bring this up with them. I am almost positive it has to do with your db2 being sluggish though.
(In reply to Danny Auble from comment #6) [...] > I was told by IBM (long ago) this would not happen on a real system. > Perhaps bring this up with them. I am almost positive it has to do with > your db2 being sluggish though. Yes, that is the key. Once the db2 speeds up, we can't recreate the problem.
This is fixed with commit 7778799925 in 14.11. The patch can fairly simple and can easily be backported to 14.03. Your debug was very handy, thanks. As said before the state issue is only cosmetic, the bug doesn't pose any real threat.
This just in... I read Danny's response with his "fix" and I fear he may not have fixed the same bug we're seeing. He says what he fixed is purely cosmetic, but given that we are experiencing user jobs failing (with the message that Don already placed in the ticket) it seems like it is not purely cosmetic. Any opportunity to clarify with schedmd that the non-cosmetic version of this has been fixed? -Adam
Jobs failing because db2 is something I can't fix, bring this up with IBM. The fix is related to the configuring flag being removed before it should. On January 7, 2015 9:30:39 AM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1316 > >--- Comment #9 from Don Lipari <lipari1@llnl.gov> --- >This just in... >I read Danny's response with his "fix" and I fear he may not have fixed > >the same bug we're seeing. He says what he fixed is purely cosmetic, >but given that we are experiencing user jobs failing (with the message >that Don already placed in the ticket) it seems like it is not purely >cosmetic. Any opportunity to clarify with schedmd that the non-cosmetic > >version of this has been fixed? > >-Adam > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
The problem with this bug is 2 different bugs are here. First the title bug, then the bug where the step fails after the block boots. The fix refers to the title bug. As noted I think the step dieing isn't something slurm can avoid unless we sleep a little before handing things off to runjob. IBM might have a way to handle this now, but they didn't years ago when I questioned them about it. Up to this point I was told this would only happen on my simulator.
(In reply to Danny Auble from comment #11) > The problem with this bug is 2 different bugs are here. First the title > bug, then the bug where the step fails after the block boots. The fix refers > to the title bug. As noted I think the step dieing isn't something slurm > can avoid unless we sleep a little before handing things off to runjob. IBM > might have a way to handle this now, but they didn't years ago when I > questioned them about it. Up to this point I was told this would only > happen on my simulator. So, as I understand you, there is no way for Slurm to tell when the block has finished booting and wait for that to happen before launching the first step. Correct?
My understanding is the block is booted, but for some reason it won't allow jobs to run right away. I believe it is a race condition, Tom Budnick would know. On January 7, 2015 9:57:15 AM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=1316 > >--- Comment #12 from Don Lipari <lipari1@llnl.gov> --- >(In reply to Danny Auble from comment #11) >> The problem with this bug is 2 different bugs are here. First the >title >> bug, then the bug where the step fails after the block boots. The fix >refers >> to the title bug. As noted I think the step dieing isn't something >slurm >> can avoid unless we sleep a little before handing things off to >runjob. IBM >> might have a way to handle this now, but they didn't years ago when I >> questioned them about it. Up to this point I was told this would >only >> happen on my simulator. > >So, as I understand you, there is no way for Slurm to tell when the >block has >finished booting and wait for that to happen before launching the first >step. >Correct? > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Here is the message thread from years ago when I talked with Sam and Tom about it. In this thread it appeared the realtime server wasn't running at the time (perhaps in your case as well?). Begin at the bottom and work your way up. Hopefully it isn't too confusing. >On 12/05/2012 09:42 AM, Danny Auble wrote: >Of course, the message just seemed strange. While we never expect things to >die we always plan for if they do. > >The message just seems contradictory from my point of view. But I understand >the polling issue and the odds of this happening are small so it isn't that big >of a deal. Thanks for explaining it. > >Danny > >On 12/05/2012 09:35 AM, Sam Miller wrote: > > if it dies there is likely a defect that needs to be fixed. The bgmaster_server will then restart it. > Of course, the message just seemed strange. While we never expect things to die we always plan for if they do. > > The message just seems contradictory from my point of view. But I understand the polling issue and the odds of this happening are small so it isn't that big of a deal. Thanks for explaining it. > > Danny > On 12/05/2012 09:35 AM, Sam Miller wrote: >> >> if it dies there is likely a defect that needs to be fixed. The bgmaster_server will then restart it. >> >> Sam Miller >> Blue Gene Control System Team Lead >> IBM Rochester >> samjmill@us.ibm.com >> 507-253-0691 >> >> >> Inactive hide details for Danny Auble ---12/05/2012 10:22:17 AM---What if it just dies? Would this not happen? Sam Miller <samDanny Auble ---12/05/2012 10:22:17 AM---What if it just dies? Would this not happen? Sam Miller <samjmill@us.ibm.com> wrote: >> >> From: >> >> >> Danny Auble <da@schedmd.com> >> >> To: >> >> >> Sam Miller/Rochester/IBM@IBMUS, >> >> Cc: >> >> >> Tom Budnik/Rochester/IBM@IBMUS >> >> Date: >> >> >> 12/05/2012 10:22 AM >> >> Subject: >> >> >> Re: strange error from runjob >> >> >> >> What if it just dies? Would this not happen? >> >> Sam Miller <samjmill@us.ibm.com> wrote: >> >> don't kill the real-time server >> >> Sam Miller >> Blue Gene Control System Team Lead >> IBM Rochester >> samjmill@us.ibm.com >> 507-253-0691 >> Danny Auble ---12/05/2012 09:29:29 AM---But this leads to failed jobs. Is there a way to avoid it? Sam Miller <samjmill@us.ibm.com> wrote: >> From: >> Danny Auble <da@schedmd.com> >> To: >> Sam Miller/Rochester/IBM@IBMUS, >> Cc: >> Tom Budnik/Rochester/IBM@IBMUS >> Date: >> 12/05/2012 09:29 AM >> Subject: >> Re: strange error from runjob >> >> >> >> But this leads to failed jobs. Is there a way to avoid it? >> >> Sam Miller <samjmill@us.ibm.com> wrote: >> this is expected when the real-time server is not active. The runjob_server uses a polling interval in these scenarios, opening up a timing window like what you see here. >> >> Sam Miller >> Blue Gene Control System Team Lead >> IBM Rochester >> samjmill@us.ibm.com >> 507-253-0691 >> Danny Auble ---12/04/2012 06:24:37 PM---I was testing killing the realtime server and such and noticed when I was booting a block (polling >> From: >> Danny Auble <da@schedmd.com> >> To: >> Sam Miller/Rochester/IBM@IBMUS, Tom Budnik/Rochester/IBM@IBMUS, >> Date: >> 12/04/2012 06:24 PM >> Subject: >> strange error from runjob >> >> >> >> I was testing killing the realtime server and such and noticed when I >> was booting a block (polling for state instead of the realtime server >> telling me things) I get this... >> >> 2012-12-04 18:22:22.064 (FATAL) [0xfffa8e68b90] >> 20030:ibm.runjob.client.Job: could not start job: block definition is >> invalid >> 2012-12-04 18:22:22.064 (FATAL) [0xfffa8e68b90] >> 20030:ibm.runjob.client.Job: block 'RMP04De142155117' with status >> Initialized is not ready to run jobs >> >> What does this mean. The block state is 'I' as it says and the user has >> been added, so what makes it not able to run jobs? >> >> If I run a job a second later things are just fine. Seems strange. >> >> Danny >>
Ok. Thanks for the history. We'll submit a PMR to IBM.