1316 – Discrepancy between block and job states

Ticket 1316 - Discrepancy between block and job states

Summary: Discrepancy between block and job states

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Bluegene select plugin (show other tickets)
Version:	14.03.9
Hardware:	IBM BlueGene Linux

Severity:	4 - Minor Issue
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-12-10 10:49 MST by Don Lipari
Modified:	2015-01-07 05:13 MST (History)
CC List:	2 users (show)

See Also:
Site:	LLNL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.3 15.08.0-0pre2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Don Lipari 2014-12-10 10:49:08 MST

On our sequoia machine, Adam B. reported that he has observed on multiple occasions jobs indicate running while the block on which they have been assigned is still booting.  He doesn't think this is a significant problem, only new behavior.  In the past, jobs remained in configuring while a block was booting.

Is this a bug?  Is it something that suggests a deeper problem?

Comment 1 Danny Auble 2014-12-10 10:52:35 MST

I haven't noticed this, but can verify this tomorrow.  If the block is booting the job state should be configuring.

Comment 2 Danny Auble 2014-12-11 08:15:56 MST

Don, I have tried this with the latest 14.03 as well as 14.11 and can not reproduce it.  The job is configuring while the block boots.

I am not sure how the configuring flag would get cleared before hand either looking at the code.

The block would have to go into an inited(booted) state for the flag to be removed, and it is added to the job right before slurm tries to boot the block or checked for the inited state for every job.

Do you happen to have any snips of the slurmctld log for one of these jobs or how the state was noticed?

Comment 3 Danny Auble 2014-12-17 05:49:24 MST

Don, do you have anything further on this?

Comment 4 Don Lipari 2014-12-17 05:51:12 MST

(In reply to Danny Auble from comment #3)
> Don, do you have anything further on this?

I'm just relaying Adam's observation.  He has yet to reply to me regarding your findings.  Let's leave the ticket open for now.

Comment 5 Don Lipari 2015-01-05 06:00:05 MST

OK, we are seeing the problem consistently on sequoia right now.  db2 is a little sluggish and might be what reveals the problem.

Here's a summary:

We start off by freeing the block (a reservation is in place)

submit the job at 11:17:42 as so:

sbatch -N 2K -p pbatch --reservation=systemtest hello.test

mmcs_server.log shows block booting at 11:17:45

squeue shows job gets a running state a few seconds after this, while block is still booting.

runjob_server.log setAllocating() at 11:17:45
runjob_server.log shows block booting at 11:18:47
runjob_server.log doneAllocating() at 11:19:31
runjob_server.log block status changed to initialized at 11:19:40

mmcs_server.log shows block INITIALIZED at 11:19:40

slurmctld.log shows:
11:17:42 _slurm_rpc_submit_batch_job
11:17:43 Queue start of job 2343504 at in BG block RMP04Ja215428500
11:17:43 sched: Allocate JobId=2343504 in MidplaneList=seq[1230x1233]
11:18:04 Configuration for job 2343504 is complete
11:19:04 Configuration for job 2343504 is complete
11:19:42 sched: _slurm_rpc_job_step_create: StepId=2343504.0 seq[1230x1233] useq=1469
11:19:42 sched: _slurm_rpc_step_complete: StepId=2343504.0 useq=34580
11:19:42 completing job 2343504 on status 256
11:19:42 Queue termination of job 2343504 in BG block RMP04Ja215428500
11:19:42 sched: job_complete for JobId=2343504 successful, exit code=256


job fails at 11:19:42

job output file shows:
"could not start job: block definition is invalid"
"block 'RMP04Ja215428500' with status Initialized is not ready to run jobs.

Comment 6 Danny Auble 2015-01-05 09:33:41 MST

The lines

Configuration for job 2343504 is complete

are very interesting.  They appear to only happen on a BGQ system when the timelimit of the job happens and the job isn't running, or the nodes in the bitmap are involved with powering on.  Which I am fairly sure the job has the running state.  You don't happen to have any nodes in a power save mode (most likely not)?  This line tells me when the Configuring flag is (erroneously) taken away.

I have just reproduced this with 14.11. I'll try to figure out why this code path is happening and get you a fix.

From the log it appears the state change is not harmful though.  The step isn't created until after the block is booted.

Your message in the job not running I have seen, but only in the simulator...

2015-01-05 15:30:57.003 (FATAL) [0x7f6572044780] 19315:ibm.runjob.client.Job: could not start job: block definition is invalid
2015-01-05 15:30:57.003 (FATAL) [0x7f6572044780] 19315:ibm.runjob.client.Job: block 'RMP05Ja153053744' with status Initialized is not ready to run jobs

I was told by IBM (long ago) this would not happen on a real system.  Perhaps bring this up with them.  I am almost positive it has to do with your db2 being sluggish though.

Comment 7 Don Lipari 2015-01-05 10:03:41 MST

(In reply to Danny Auble from comment #6)
[...]
> I was told by IBM (long ago) this would not happen on a real system. 
> Perhaps bring this up with them.  I am almost positive it has to do with
> your db2 being sluggish though.

Yes, that is the key.  Once the db2 speeds up, we can't recreate the problem.

Comment 8 Danny Auble 2015-01-05 10:04:30 MST

This is fixed with commit 7778799925 in 14.11.  The patch can fairly simple and can easily be backported to 14.03.

Your debug was very handy, thanks.

As said before the state issue is only cosmetic, the bug doesn't pose any real threat.

Comment 9 Don Lipari 2015-01-07 03:30:39 MST

This just in...
I read Danny's response with his "fix" and I fear he may not have fixed 
the same bug we're seeing.  He says what he fixed is purely cosmetic, 
but given that we are experiencing user jobs failing (with the message 
that Don already placed in the ticket) it seems like it is not purely 
cosmetic. Any opportunity to clarify with schedmd that the non-cosmetic 
version of this has been fixed?

-Adam

Comment 10 Danny Auble 2015-01-07 03:34:40 MST

Jobs failing because db2 is something I can't fix, bring this up with IBM.  The fix is related to the configuring flag being removed before it should. 

On January 7, 2015 9:30:39 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1316
>
>--- Comment #9 from Don Lipari <lipari1@llnl.gov> ---
>This just in...
>I read Danny's response with his "fix" and I fear he may not have fixed
>
>the same bug we're seeing.  He says what he fixed is purely cosmetic, 
>but given that we are experiencing user jobs failing (with the message 
>that Don already placed in the ticket) it seems like it is not purely 
>cosmetic. Any opportunity to clarify with schedmd that the non-cosmetic
>
>version of this has been fixed?
>
>-Adam
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 11 Danny Auble 2015-01-07 03:43:23 MST

The problem with this bug is 2 different bugs are here.  First the title bug, then the bug where the step fails after the block boots. The fix refers to the title bug.  As noted I think the step dieing isn't something slurm can avoid unless we sleep a little before handing things off to runjob. IBM might have a way to handle this now, but they didn't years ago when I questioned them about it.  Up to this point I was told this would only happen on my simulator.

Comment 12 Don Lipari 2015-01-07 03:57:15 MST

(In reply to Danny Auble from comment #11)
> The problem with this bug is 2 different bugs are here.  First the title
> bug, then the bug where the step fails after the block boots. The fix refers
> to the title bug.  As noted I think the step dieing isn't something slurm
> can avoid unless we sleep a little before handing things off to runjob. IBM
> might have a way to handle this now, but they didn't years ago when I
> questioned them about it.  Up to this point I was told this would only
> happen on my simulator.

So, as I understand you, there is no way for Slurm to tell when the block has finished booting and wait for that to happen before launching the first step.  Correct?

Comment 13 Danny Auble 2015-01-07 04:01:44 MST

My understanding is the block is booted, but for some reason it won't allow jobs to run right away.  I believe it is a race condition, Tom Budnick would know. 

On January 7, 2015 9:57:15 AM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=1316
>
>--- Comment #12 from Don Lipari <lipari1@llnl.gov> ---
>(In reply to Danny Auble from comment #11)
>> The problem with this bug is 2 different bugs are here.  First the
>title
>> bug, then the bug where the step fails after the block boots. The fix
>refers
>> to the title bug.  As noted I think the step dieing isn't something
>slurm
>> can avoid unless we sleep a little before handing things off to
>runjob. IBM
>> might have a way to handle this now, but they didn't years ago when I
>> questioned them about it.  Up to this point I was told this would
>only
>> happen on my simulator.
>
>So, as I understand you, there is no way for Slurm to tell when the
>block has
>finished booting and wait for that to happen before launching the first
>step. 
>Correct?
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 14 Danny Auble 2015-01-07 04:27:57 MST

Here is the message thread from years ago when I talked with Sam and Tom about it.

In this thread it appeared the realtime server wasn't running at the time (perhaps in your case as well?).  Begin at the bottom and work your way up.

Hopefully it isn't too confusing.

>On 12/05/2012 09:42 AM, Danny Auble wrote:
>Of course, the message just seemed strange.  While we never expect things to >die we always plan for if they do.
>
>The message just seems contradictory from my point of view.  But I understand >the polling issue and the odds of this happening are small so it isn't that big >of a deal.  Thanks for explaining it.
>
>Danny
>
>On 12/05/2012 09:35 AM, Sam Miller wrote:
>
> if it dies there is likely a defect that needs to be fixed. The bgmaster_server will then restart it.

> Of course, the message just seemed strange.  While we never expect things to die we always plan for if they do.
>
> The message just seems contradictory from my point of view.  But I understand the polling issue and the odds of this happening are small so it isn't that big of a deal.  Thanks for explaining it.
>
> Danny
> On 12/05/2012 09:35 AM, Sam Miller wrote:
>>
>> if it dies there is likely a defect that needs to be fixed. The bgmaster_server will then restart it.
>>
>>     Sam Miller
>>     Blue Gene Control System Team Lead
>>     IBM Rochester
>>     samjmill@us.ibm.com
>>     507-253-0691 
>>
>>
>> Inactive hide details for Danny Auble ---12/05/2012 10:22:17 AM---What if it just dies? Would this not happen? Sam Miller <samDanny Auble ---12/05/2012 10:22:17 AM---What if it just dies?  Would this not happen? Sam Miller <samjmill@us.ibm.com> wrote:
>>
>>     From: 
>>
>> 	
>> Danny Auble <da@schedmd.com>
>>
>>     To: 
>>
>> 	
>> Sam Miller/Rochester/IBM@IBMUS,
>>
>>     Cc: 
>>
>> 	
>> Tom Budnik/Rochester/IBM@IBMUS
>>
>>     Date: 
>>
>> 	
>> 12/05/2012 10:22 AM
>>
>>     Subject: 
>>
>> 	
>> Re: strange error from runjob
>>
>>
>>
>> What if it just dies? Would this not happen?
>>
>> Sam Miller <samjmill@us.ibm.com> wrote:
>>
>>     don't kill the real-time server
>>
>>         Sam Miller
>>         Blue Gene Control System Team Lead
>>         IBM Rochester
>>         samjmill@us.ibm.com
>>         507-253-0691 
>>     Danny Auble ---12/05/2012 09:29:29 AM---But this leads to failed jobs.  Is there a way to avoid it?  Sam Miller <samjmill@us.ibm.com> wrote:
>>         From: 	
>>     Danny Auble <da@schedmd.com>
>>         To: 	
>>     Sam Miller/Rochester/IBM@IBMUS,
>>         Cc: 	
>>     Tom Budnik/Rochester/IBM@IBMUS
>>         Date: 	
>>     12/05/2012 09:29 AM
>>         Subject: 	
>>     Re: strange error from runjob
>>
>>
>>
>>     But this leads to failed jobs. Is there a way to avoid it?
>>
>>     Sam Miller <samjmill@us.ibm.com> wrote:
>>         this is expected when the real-time server is not active. The runjob_server uses a polling interval in these scenarios, opening up a timing window like what you see here. 
>>
>>             Sam Miller
>>             Blue Gene Control System Team Lead
>>             IBM Rochester
>>             samjmill@us.ibm.com
>>             507-253-0691 
>>         Danny Auble ---12/04/2012 06:24:37 PM---I was testing killing the realtime server and such and noticed when I  was booting a block (polling
>>             From: 	
>>         Danny Auble <da@schedmd.com>
>>             To: 	
>>         Sam Miller/Rochester/IBM@IBMUS, Tom Budnik/Rochester/IBM@IBMUS,
>>             Date: 	
>>         12/04/2012 06:24 PM
>>             Subject: 	
>>         strange error from runjob
>>
>>
>>
>>         I was testing killing the realtime server and such and noticed when I
>>         was booting a block (polling for state instead of the realtime server
>>         telling me things) I get this...
>>
>>         2012-12-04 18:22:22.064 (FATAL) [0xfffa8e68b90]
>>         20030:ibm.runjob.client.Job: could not start job: block definition is
>>         invalid
>>         2012-12-04 18:22:22.064 (FATAL) [0xfffa8e68b90]
>>         20030:ibm.runjob.client.Job: block 'RMP04De142155117' with status
>>         Initialized is not ready to run jobs
>>
>>         What does this mean.  The block state is 'I' as it says and the user has
>>         been added, so what makes it not able to run jobs?
>>
>>         If I run a job a second later things are just fine.  Seems strange.
>>
>>         Danny
>>

Comment 15 Don Lipari 2015-01-07 05:13:06 MST

Ok.  Thanks for the history.  We'll submit a PMR to IBM.