Ticket 146

Summary:	Jobs Disappear from Sequoia
Product:	Slurm	Reporter:	Don Lipari <lipari1>
Component:	Bluegene select plugin	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	2.4.x
Hardware:	IBM BlueGene
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Don Lipari 2012-10-19 11:57:51 MDT

This problem has been reported sporadically for some time now.  I could use some help diagnosing what is causing it.  From the user's perspective, the job just disappears from the queue without running.

A bit before 4pm a 72K job died on seq.  The next 2 jobs to try to run "disappears" from the queue very quickly see report below.

(03:59:12 PM) draeger-home: fox18:  I just had a job (52718) disappear from the queue with no output. 
(03:59:22 PM) draeger-home: And another one:  52719

The next 72K failed to boot after multiple tries due to a link hardware error.

The question is why do the jobs that try to start right after the failure disappear and can that be fixed?

Here's an example of job 52718 from the logs:

[2012-10-19T15:57:36] debug:  bluegene:submit_job: 52718 mode=514 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=144-144-144
[2012-10-19T15:57:36] debug:  514 can start unassigned job 52718 at 1350687461 on seq[0000x2233]
[2012-10-19T15:57:45] debug:  bluegene:submit_job: 52718 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=144-144-144
[2012-10-19T15:57:45] number of blocks to check: 2 state 512 asking for 1179648-1179648 cpus
[2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs running.  This will only happen if the cnodes went into error after no jobs were running.
[2012-10-19T15:57:45] Don't need to look at myself RMP19Oc131401677 RMP19Oc131401677
[2012-10-19T15:57:45] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:45] we found one! RMP19Oc131401677
[2012-10-19T15:57:45] debug:  _find_best_block_match RMP19Oc131401677 <seq[0000x2233]>
[2012-10-19T15:57:45] debug:  512(0) can start job 52718 at 1350687465 on RMP19Oc131401677(seq[0000x2233]) 1
[2012-10-19T15:57:45] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:45] Queue start of job 52718 in BG block RMP19Oc131401677
[2012-10-19T15:57:45] debug:  waiting for block RMP19Oc131401677 to free for job 52718 1 other(s) trying to free it
[2012-10-19T15:57:45] sched: Allocate JobId=52718 MidplaneList=seq[0000x2233]
[2012-10-19T15:57:46] We are freeing a block (RMP19Oc131401677) that has job 52718(52718).
[2012-10-19T15:57:46] debug:  waiting for block RMP19Oc131401677 to free for job 52718 2 other(s) trying to free it
[2012-10-19T15:57:46] We are freeing a block (RMP19Oc131401677) that has job 52718(52718).
[2012-10-19T15:57:46] error: Couldn't requeue job 52718, failing it: Requested operation is presently disabled
[2012-10-19T15:57:46] Queue termination of job 52718 in BG block RMP19Oc131401677
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:46] error: Couldn't requeue job 52718, failing it: Job/step already completing or completed
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:46] RMP19Oc131401677 has 1024 in error
[2012-10-19T15:57:47] debug:  waiting for block RMP19Oc131401677 to free for job 52718 4 other(s) trying to free it
[2012-10-19T15:57:48] debug:  waiting for block RMP19Oc131401677 to free for job 52718 4 other(s) trying to free it
[2012-10-19T15:57:49] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:49] slurmd error running JobId=52718 from front_end=seqlac1: Job credential revoked
[2012-10-19T15:57:49] completing job 52718
[2012-10-19T15:57:49] _slurm_rpc_complete_batch_script JobId=52718: Job/step already completing or completed 
[2012-10-19T15:57:50] debug:  waiting for block RMP19Oc131401677 to free for job 52718 4 other(s) trying to free it
[2012-10-19T15:57:51] debug:  waiting for block RMP19Oc131401677 to free for job 52718 4 other(s) trying to free it
[...]
[2012-10-19T16:03:22] debug:  waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it
[2012-10-19T16:03:23] debug:  job 52718 already finished before boot

Comment 1 Danny Auble 2012-10-19 12:45:53 MDT

A complete log would shed some light on what is happening.

2012-10-19T15:57:36
to
2012-10-19T15:57:46

would be helpful.

Comment 2 Danny Auble 2012-10-19 12:48:08 MDT

Actually I only need the 2 seconds of time.

2012-10-19T15:57:45
and
2012-10-19T15:57:46

It most likely has to do with this...
[2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs running.  This will only happen if the cnodes went into error after no jobs were running.

Comment 3 Danny Auble 2012-10-19 12:52:03 MDT

It would be interesting to see is something similar happened for the 52719 job as well.

Comment 4 Don Lipari 2012-10-20 11:15:33 MDT

(In reply to comment #2)
> Actually I only need the 2 seconds of time.
> 
> 2012-10-19T15:57:45
> and
> 2012-10-19T15:57:46
> 
> It most likely has to do with this...
> [2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs
> running.  This will only happen if the cnodes went into error after no jobs
> were running.

[2012-10-19T15:57:45] Warning: Note very large processing time from _slurm_rpc_job_ready: usec=1531036
[2012-10-19T15:57:45] Job 52715 completion process took 108 seconds
[2012-10-19T15:57:45] debug:  sched: Running job scheduler
[2012-10-19T15:57:45] debug:  bluegene:submit_job: 52718 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=144-144-144
[2012-10-19T15:57:45] number of blocks to check: 2 state 512 asking for 1179648-1179648 cpus
[2012-10-19T15:57:45] going to free block RMP19Oc131401677 there are no jobs running.  This will only happen if the cnodes went into error after no jobs were running.
[2012-10-19T15:57:45] Don't need to look at myself RMP19Oc131401677 RMP19Oc131401677
[2012-10-19T15:57:45] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:45] we found one! RMP19Oc131401677
[2012-10-19T15:57:45] debug:  _find_best_block_match RMP19Oc131401677 <seq[0000x2233]>
[2012-10-19T15:57:45] debug:  512(0) can start job 52718 at 1350687465 on RMP19Oc131401677(seq[0000x2233]) 1
[2012-10-19T15:57:45] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:45] Queue start of job 52718 in BG block RMP19Oc131401677
[2012-10-19T15:57:45] debug:  waiting for block RMP19Oc131401677 to free for job 52718 1 other(s) trying to free it
[2012-10-19T15:57:45] sched: Allocate JobId=52718 MidplaneList=seq[0000x2233]
[2012-10-19T15:57:46] We are freeing a block (RMP19Oc131401677) that has job 52718(52718).
[2012-10-19T15:57:46] debug:  waiting for block RMP19Oc131401677 to free for job 52718 2 other(s) trying to free it
[2012-10-19T15:57:46] We are freeing a block (RMP19Oc131401677) that has job 52718(52718).
[2012-10-19T15:57:46] error: Couldn't requeue job 52718, failing it: Requested operation is presently disabled
[2012-10-19T15:57:46] Queue termination of job 52718 in BG block RMP19Oc131401677
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:46] error: Couldn't requeue job 52718, failing it: Job/step already completing or completed
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:46] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:57:46] RMP19Oc131401677 has 1024 in error

Comment 5 Don Lipari 2012-10-20 11:22:05 MDT

(In reply to comment #3)
> It would be interesting to see is something similar happened for the 52719
> job as well.

[2012-10-19T15:58:41] debug:  bluegene:submit_job: 52719 mode=513 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=144-144-144
[2012-10-19T15:58:41] debug:  513 can start unassigned job 52719 at 1350687521 on seq[0000x2233]
[2012-10-19T15:58:41] _slurm_rpc_submit_batch_job JobId=52719 usec=65663
[2012-10-19T15:58:41] debug:  bluegene:submit_job: 52719 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=144-144-144
[2012-10-19T15:58:42] debug:  512(0) can start job 52719 at 1350687521 on RMP19Oc155841431(seq[0000x2233]) 1
[2012-10-19T15:58:42] sync: adding RMP19Oc155841431 0x401ab4072b8
[2012-10-19T15:58:42] Queue start of job 52719 in BG block RMP19Oc155841431
[2012-10-19T15:58:42] debug:  _track_freeing_blocks: Going to free 1 for job 52719
[2012-10-19T15:58:42] sched: Allocate JobId=52719 MidplaneList=seq[0000x2233]
[2012-10-19T15:58:42] debug:  _track_freeing_blocks: freed 0 of 1 for job 52719
[2012-10-19T15:58:42] Warning: Note very large processing time from schedule: usec=1354579
[2012-10-19T15:58:42] Warning: Note very large processing time from _slurmctld_background: usec=1028295
[2012-10-19T15:58:43] debug:  waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it
[2012-10-19T15:58:43] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:43] debug:  slurmdbd: DBD_RC is 7000 from DBD_STEP_COMPLETE(1441): (null)
[2012-10-19T15:58:44] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:44] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:45] debug:  _track_freeing_blocks: freed 0 of 1 for job 52719
[2012-10-19T15:58:45] debug:  waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it
[2012-10-19T15:58:45] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:45] error: Couldn't requeue job 52719, failing it: Requested operation is presently disabled
[2012-10-19T15:58:45] Queue termination of job 52719 in BG block RMP19Oc155841431
[2012-10-19T15:58:45] RMP19Oc155841431 has 268 in error
[2012-10-19T15:58:46] error: Couldn't requeue job 52719, failing it: Job/step already completing or completed
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Freed them all for job 4294967294
[2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up
[2012-10-19T15:58:47] Warning: Note very large processing time from _slurm_rpc_job_ready: usec=1345887
[2012-10-19T15:58:47] debug:  waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:58:47] slurmd error running JobId=52719 from front_end=seqlac2: Job credential revoked
[2012-10-19T15:58:47] completing job 52719
[2012-10-19T15:58:47] _slurm_rpc_complete_batch_script JobId=52719: Job/step already completing or completed 
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Freed them all for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Freed them all for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Freed them all for job 4294967294
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Freed them all for job 4294967294
[2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up
[2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up
[2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up
[2012-10-19T15:58:47] others are modifing this block RMP19Oc155841431, don't clear it up
[2012-10-19T15:58:47] debug:  _track_freeing_blocks: Going to free 1 for job 4294967294
[2012-10-19T15:58:48] debug:  _track_freeing_blocks: Freed them all for job 4294967294
[2012-10-19T15:58:48] others are modifing this block RMP19Oc155841431, don't clear it up
[2012-10-19T15:58:48] debug:  sched: Running job scheduler
[2012-10-19T15:58:48] debug:  waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it
[2012-10-19T15:58:48] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294
[2012-10-19T15:58:48] debug:  _track_freeing_blocks: freed 0 of 1 for job 52719
[2012-10-19T15:58:49] debug:  waiting for block RMP19Oc131401677 to free for job 52718 5 other(s) trying to free it
[...]
[2012-10-19T16:03:17] debug:  _track_freeing_blocks: freed 0 of 1 for job 52719
[2012-10-19T16:03:20] debug:  _track_freeing_blocks: Freed them all for job 52719
[2012-10-19T16:03:20] debug:  waiting for block RMP19Oc155841431 to free for job 52719 2 other(s) trying to free it
[2012-10-19T16:03:22] debug:  waiting for block RMP19Oc155841431 to free for job 52719 1 other(s) trying to free it
[2012-10-19T16:03:23] debug:  job 52719 already finished before boot

Comment 6 Danny Auble 2012-10-22 08:45:35 MDT

Also, do you know if the DebugFlags changed at all during this time period?  It really looks like the NoRealTime flag was set.  It isn't now though.

Comment 7 Don Lipari 2012-10-22 08:58:56 MDT

(In reply to comment #6)
> Also, do you know if the DebugFlags changed at all during this time period? 
> It really looks like the NoRealTime flag was set.  It isn't now though.

No, I have never enabled the NoRealTime, nor do I suspect anyone else has.  I searched through the svn log for the slurm.conf file on seq and it has never been included.

Comment 8 Danny Auble 2012-10-22 09:04:24 MDT

Nevermind, I found a bug that was preventing messages from being printed.  The fix for that is now checked in. Patch 9054e4e0c08870623ceb8a28e2e77fec86e23630.

From what I can tell, this has been an extremely difficult one to track down, Something was happening to the hardware during the start up of these 2 jobs.  If you notice both jobs are looking to use the same hardware but different block id's.  It leads me to think something happened with hardware to make these blocks not only fail but need to be destroyed, the reason the jobs are being removed.  I am still looking into it but that is what it appears to be right now.

Comment 9 Danny Auble 2012-10-22 09:09:43 MDT

Do you know if any Cable's went into anything but Available during this time?

Comment 10 Don Lipari 2012-10-22 09:18:45 MDT

(In reply to comment #9)
> Do you know if any Cable's went into anything but Available during this time?

Still researching the cable Q.  But Dave just reported:

Note we just had a similar condition (jobs 52442 and 52445).  It this case 
job 52442 failed due to app software errors.  The next job failed #52445
with the message below.  It seems in this case the bluegene system was
not really ready for another job to start but slurm thought it was so
released the job.

2012-10-22 11:01:05.301 (FATAL) [0x40000a18ab0] 16940:ibm.runjob.client.Job:
could not start job: block is unavailable due to a previous failure
2012-10-22 11:01:05.303 (FATAL) [0x40000a18ab0] 16940:ibm.runjob.client.Job:
node R30-M0-N00-J00 is not available: Software Failure


Is this something slurm can deal with better or should I report this
to IBM?

Comment 11 Danny Auble 2012-10-22 09:23:54 MDT

I'll check this out, but if would be helpful if you were able to grab hold of the patch I just checked in so I can get better debug.

Thanks

Comment 12 Don Lipari 2012-10-22 09:27:10 MDT

(In reply to comment #11)
> I'll check this out, but if would be helpful if you were able to grab hold
> of the patch I just checked in so I can get better debug.
> 
> Thanks

Will do.  Once I install the new build, do you want the NoRealTime included in the debug flags or omitted?

Comment 13 Don Lipari 2012-10-22 09:36:28 MDT

(In reply to comment #11)
> I'll check this out, but if would be helpful if you were able to grab hold
> of the patch I just checked in so I can get better debug.
> 
> Thanks

OK.  New build is installed and running on seq.

Comment 14 Danny Auble 2012-10-22 09:48:15 MDT

There might be some bad info from Dave for job 52445.

According to the log 52445 didn't start until after the error message
[2012-10-22T11:01:09] Queue start of job 52445 in BG block RMP22Oc103600061

Step 0 didn't start until 11:12 11 minutes after the error message.

[2012-10-22T11:12:44] debug:  _slurm_rpc_job_alloc_info_lite JobId=52445 NodeList=seq[1000x3233] usec=3
[2012-10-22T11:12:44] select_p_step_pick_nodes: new step for job 52445 will be running on RMP22Oc103600061(seq[1000x3233])
[2012-10-22T11:12:44] sched: _slurm_rpc_job_step_create: StepId=52445.0 seq[1000x3233] usec=46667

The log also states the step ran until it's untimely demise (probably software error, I didn't look more into it)

[2012-10-22T11:43:58] Signal 998 of StepId=52445.0 by UID=101: Job/step already completing or completed

My guess is instead of 52445 he meant 52443 and 52444 which were both short run 52444 with the block failing.  It is hard to see any cnodes that went into error from 52442 though.  Do you have information about those 2 jobs?  I don't think this is related to this bug though.

Comment 15 Danny Auble 2012-10-22 09:49:04 MDT

Don't use the NoRealTime option(In reply to comment #12)
> (In reply to comment #11)
> > I'll check this out, but if would be helpful if you were able to grab hold
> > of the patch I just checked in so I can get better debug.
> > 
> > Thanks
> 
> Will do.  Once I install the new build, do you want the NoRealTime included
> in the debug flags or omitted?

Don't use the NoRealTime option.

Comment 16 Don Lipari 2012-10-22 09:52:31 MDT

(In reply to comment #15)
> Don't use the NoRealTime option(In reply to comment #12)

Ok.  That is the way I left it when I restarted the new build.

Comment 17 Don Lipari 2012-10-22 10:04:19 MDT

(In reply to comment #14)
> There might be some bad info from Dave for job 52445.
> 
> According to the log 52445 didn't start until after the error message
> [2012-10-22T11:01:09] Queue start of job 52445 in BG block RMP22Oc103600061
> 
> Step 0 didn't start until 11:12 11 minutes after the error message.
> 
> [2012-10-22T11:12:44] debug:  _slurm_rpc_job_alloc_info_lite JobId=52445
> NodeList=seq[1000x3233] usec=3
> [2012-10-22T11:12:44] select_p_step_pick_nodes: new step for job 52445 will
> be running on RMP22Oc103600061(seq[1000x3233])
> [2012-10-22T11:12:44] sched: _slurm_rpc_job_step_create: StepId=52445.0
> seq[1000x3233] usec=46667
> 
> The log also states the step ran until it's untimely demise (probably
> software error, I didn't look more into it)
> 
> [2012-10-22T11:43:58] Signal 998 of StepId=52445.0 by UID=101: Job/step
> already completing or completed
> 
> My guess is instead of 52445 he meant 52443 and 52444 which were both short
> run 52444 with the block failing.  It is hard to see any cnodes that went
> into error from 52442 though.  Do you have information about those 2 jobs? 
> I don't think this is related to this bug though.

Dave replies:
From the bluegene navigator I see

Job   User     code       block            Size   Start           End
52445 morozov3 ./hacc_tpm RMP22Oc103600061 73,728 10/22/12 11:12 10/22/12 11:43
52442 morozov3 ./hacc_tpm RMP22Oc103600061 73,728 10/22/12 10:43 10/22/12 11:00

Maybe job 52443 is the one that reported R30-M0-N00-J00 not available, which
would seem to agree with what Danny saw, but the bluegen run job never
logged slurm job 52443.

I will report this to IBM.

Comment 18 Danny Auble 2012-10-22 10:12:27 MDT

I got note of 52443 starting a step but not from 52444.

If I was notified of bad cnodes (SoftwareError) on this block neither of these jobs would of run.  52442 ran on the same block and there was quite a bit of action going on with the realtime server right before that job ended but if there were any cnodes in an error state the other 2 jobs wouldn't of started.

Comment 19 Danny Auble 2012-10-22 11:09:21 MDT

Future viewers of this bug, please disregard comments 10-18 as they do not appear to deal with this bug.

I don't think this had anything to do with the Cable as well.

I just checked in more debug to get a better idea of where this is happening.  Hopefully it happens again.  If you could rebuild with the new debug that would be helpful.

Comment 20 Danny Auble 2012-10-29 08:58:25 MDT

Anything further on this?

Comment 21 Moe Jette 2013-02-26 08:47:45 MST

Please reopen if this occurs again. Closing

Comment 22 Danny Auble 2013-04-01 12:10:48 MDT

This is fixed in patch 8a49d19f4d41bdf98016c01e53c25fac14bbd8bf and will be in the next 2.5.