Ticket 218

Summary:	Inconsistent Block States on Sequoia
Product:	Slurm	Reporter:	Don Lipari <lipari1>
Component:	Bluegene select plugin	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	2.5.x
Hardware:	IBM BlueGene
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Don Lipari 2013-02-01 02:45:30 MST

Running the v2.5.1 version with the core fixes (i.e, almost the tagged v2.5.2 version) and seeing block discrepancies.  It appears that the system is taking an overly long time to free a block, SLURM marks the block in an error state.  Once the mmcs actually frees the block, SLURM never updates the block state.  Furthermore, there are now two blocks with overlapping nodes, the original in an error state and a new one in a free state.

Here is the relevant reports/logs.  Note the IBM log indicates 22:05 for the block to allocate, and the slurm time stamp is 12m before the shutdown completed.

# seqsn3 /bgsys/logs/BGQ.sn > sinfo -o "%50E %H %N"
REASON                                             TIMESTAMP MIDPLANELIST
Block would not deallocate                         2013-01-31T23:40:03 seq[0000x1133]


# seqsn3 /bgsys/logs/BGQ.sn > smap -cDb -n seq[0000x1133]
Fri Feb  1 06:43:53 2013
PARTITION         BG_BLOCK STATE     JOBID      CONN NODES MIDPLANELIST
    pbatch RMP31Ja232959535 Free           -  T,T,T,T   16K seq[0000x0133]
    pbatch RMP31Ja230103286 Error(F        -  M,T,T,T   32K seq[0000x1133]

slurmctld.log
[2013-01-31T23:30:00-08:00] debug:  freeing block RMP31Ja230103286
[2013-01-31T23:30:00-08:00] bridge_destroy RMP31Ja230103286
[2013-01-31T23:30:00-08:00] debug:  freeing block RMP31Ja230103286
[2013-01-31T23:30:00-08:00] bridge_block_free(RMP31Ja230103286): BlockFreeError (Most likely the block has pending action, should clear up shortly, check bridgeapi.log for further info) State = Ready
[2013-01-31T23:30:01-08:00] debug:  state of Block RMP31Ja230103286 was Ready and now is Term
[2013-01-31T23:30:01-08:00] debug:  Block RMP31Ja230103286 was in a ready state but is being freed. No job running.
[2013-01-31T23:40:03-08:00] 1 others are trying to destroy this block RMP31Ja230103286
[2013-01-31T23:40:03-08:00] _post_block_free: block RMP31Ja230103286 is not in state free (Term), putting it in error state.
[2013-01-31T23:40:03-08:00] Setting Block RMP31Ja230103286 to ERROR state. (reason: 'Block would not deallocate')
[2013-01-31T23:52:19-08:00] debug:  state of Block RMP31Ja230103286 was Error(Term) and now is Free

bridgeapi.log
[2013-01-31T23:30:00] ERROR: ibm.bgsched.Block: Block free request failed because block RMP31Ja230103286 has pending action [0x4000d4ff1f0]

mc_server log:
2013-01-31 23:01:25.760 (INFO ) [slurm@seqsn3:156] boot - cookie=101381507, bootId=RMP31Ja230103286, MachineController::configureAndBootBlock starting
2013-01-31 23:03:58.369 (INFO ) [slurm@seqsn3:156] boot - cookie=101381507, bootId=RMP31Ja230103286, MachineController::configureAndBootBlock finished, duration 152.609297, ok=1
2013-01-31 23:30:06.583 (INFO ) [slurm@seqsn3:156] shutdown - MachineController::shutdown() - starting - cookie=101381507, bootId=RMP31Ja230103286
2013-01-31 23:52:11.987 (INFO ) [slurm@seqsn3:156] shutdown - MachineController::shutdown() - did not lower pgood on these computes because of the global flag setting - cookie=101381507, bootId=RMP31Ja230103286
2013-01-31 23:52:11.987 (INFO ) [slurm@seqsn3:156] shutdown - MachineController::shutdown() - finished - cookie=101381507, bootId=RMP31Ja230103286, duration 1325.404124, ok=1

Comment 1 Danny Auble 2013-02-01 03:32:48 MST

Slurm will never take a block out of an error state by itself.  It always requires operator intervention.

The overlapping block is the interesting one though.

The happings of the RMP31Ja232959535 block would be interesting.  Could you send me the logs on it?

Comment 2 Don Lipari 2013-02-01 04:10:57 MST

(In reply to comment #1)
> Slurm will never take a block out of an error state by itself.  It always
> requires operator intervention.
> 
> The overlapping block is the interesting one though.
> 
> The happings of the RMP31Ja232959535 block would be interesting.  Could you
> send me the logs on it?

The whole log was too big to send.  I sent you just the grep of RMP31Ja232959535.  Let me know which time span you need to see and I'll scope the logs down to just that time frame.

Comment 3 Danny Auble 2013-02-01 04:17:11 MST

Could you send me the full log for this minute?

2013-01-31T23:40

Comment 4 Don Lipari 2013-02-06 01:56:27 MST

Found the machine messed up this moring.  Also see what looks like
a similar situation, note the 2 blocks that over lap.  This seems
to indicate that slurm tried to boot the 64K block when the 96K
was in error, thus I expect a 64K job was lost.

pscale RMP06Fe032206144 Free           -  M,T,T,T   64K seq[0000x1133,3000x4133]
pscale RMP05Fe225439754 Error(F        -  M,T,T,T   96K seq[0000x1233,3000x4233]

# seqsn3 / > sinfo -Rl
Wed Feb  6 06:57:25 2013
REASON               USER         TIMESTAMP           STATE  MIDPLANELIST
Block would not deal slurm(101)   2013-02-06T03:32:25 err    seq[0000x1233,3000x4233]

David Fox

Comment 5 Danny Auble 2013-02-06 04:40:51 MST

This appears related.  If there was a job scheduled to run on RMP06Fe032206144 before RMP05Fe225439754 was unable to free in a timely manner then yes it is the exact same problem.

Comment 6 Don Lipari 2013-02-06 10:16:01 MST

(In reply to comment #5)
> This appears related.  If there was a job scheduled to run on
> RMP06Fe032206144 before RMP05Fe225439754 was unable to free in a timely
> manner then yes it is the exact same problem.

It looks like job 55528 was scheduled onto the same block after step 55527.2 (IBM job 73870) failed:

sacct -j 55527,55528 -o JobID,block,State,ExitCode,start,end
       JobID          BlockID      State ExitCode               Start                 End 
------------ ---------------- ---------- -------- ------------------- ------------------- 
55527        RMP05Fe152628577     FAILED      1:0 2013-02-05T15:26:28 2013-02-05T16:01:46 
55527.0                        COMPLETED      0:0 2013-02-05T15:37:11 2013-02-05T15:44:24 
55527.1                           FAILED      1:0 2013-02-05T15:47:54 2013-02-05T16:01:31 
55527.2                           FAILED      1:0 2013-02-05T16:01:44 2013-02-05T16:01:44 
55528        RMP05Fe152628577     FAILED      0:1 2013-02-05T16:01:50 2013-02-05T16:21:57

2013-02-05 15:37:11.920 (INFO ) [0x4000190c7b0] RMP05Fe152628577:3693:morrone:ibm.runjob.server.Job: creating
2013-02-05 15:37:12.327 (INFO ) [0x4002f7ff0c0] RMP05Fe152628577:3693:ibm.database.job.Operations: Inserted job 73869 on 'RMP05Fe152628577' with qualifier '570620379' for 'morrone'
2013-02-05 15:37:12.328 (INFO ) [0x4002f7ff0c0] RMP05Fe152628577:3693:ibm.runjob.server.job.Container: assigned id 73869
2013-02-05 15:37:12.387 (INFO ) [0x4000190c7b0] RMP05Fe152628577:3693:ibm.runjob.server.job.Status: SETUP
2013-02-05 15:37:13.152 (INFO ) [0x4000190c7b0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: LOADING
2013-02-05 15:37:29.755 (INFO ) [0x4000d7ff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: CLIENT STARTING
2013-02-05 15:37:30.550 (INFO ) [0x4000e3ff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: CONTROL STARTING
2013-02-05 15:37:30.658 (INFO ) [0x4002e3ff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: RUNNING
2013-02-05 15:44:14.319 (INFO ) [0x4000dfff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: CLEANUP
2013-02-05 15:44:17.967 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:73869:ibm.database.job.Operations: Removing job 73869
2013-02-05 15:47:54.662 (INFO ) [0x4002ebff0c0] RMP05Fe152628577:3694:morrone:ibm.runjob.server.Job: creating
2013-02-05 15:47:54.780 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:3694:ibm.database.job.Operations: Inserted job 73870 on 'RMP05Fe152628577' with qualifier '570620379' for 'morrone'
2013-02-05 15:47:54.781 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:3694:ibm.runjob.server.job.Container: assigned id 73870
2013-02-05 15:47:54.835 (INFO ) [0x4000fbff0c0] RMP05Fe152628577:3694:ibm.runjob.server.job.Status: SETUP
2013-02-05 15:47:55.270 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: LOADING
2013-02-05 15:48:07.799 (INFO ) [0x4002cfff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: CLIENT STARTING
2013-02-05 15:48:08.066 (INFO ) [0x4000ebff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: CONTROL STARTING
2013-02-05 15:48:08.185 (INFO ) [0x4002cfff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: RUNNING
2013-02-05 16:00:28.928 (INFO ) [0x4000dbff0c0] RMP05Fe152628577:73870:R03-ID-J05:ibm.runjob.server.job.ExitStatus: abnormal termination by signal 6 from rank 49808
2013-02-05 16:00:28.961 (INFO ) [0x4000dbff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Signal: sent signal 9
2013-02-05 16:01:19.842 (INFO ) [0x4000d7ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Signal: sent signal 2
2013-02-05 16:01:19.868 (INFO ) [0x4000d7ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Signal: sent signal 2
2013-02-05 16:01:29.059 (WARN ) [0x4000d3ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.block.Compute: 98304 nodes unavailable
2013-02-05 16:01:29.170 (INFO ) [0x4000f7ff0c0] RMP05Fe152628577:73870:ibm.database.job.Operations: Removing job 73870
2013-02-05 16:01:44.182 (INFO ) [0x4000d3ff0c0] RMP05Fe152628577:3695:morrone:ibm.runjob.server.Job: error code:    block is unavailable due to a previous failure
2013-02-05 16:01:44.182 (INFO ) [0x4000d3ff0c0] RMP05Fe152628577:3695:morrone:ibm.runjob.server.Job: error message: 'node R00-M0-N00-J00 is not available: Software Failure'

[2013-02-05T16:01:50-08:00] debug:  bluegene:submit_job: 55528 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Bl
ock_ID=(null) mps=192-192-192
[2013-02-05T16:01:50-08:00] number of blocks to check: 169 state 512 asking for 1572864-1572864 cpus
[2013-02-05T16:01:50-08:00] Don't need to look at myself RMP05Fe152628577 RMP05Fe152628577
[2013-02-05T16:01:50-08:00] we found one! RMP05Fe152628577
[2013-02-05T16:01:50-08:00] debug:  _find_best_block_match RMP05Fe152628577 <seq[0000x1233,3000x4233]>
[2013-02-05T16:01:50-08:00] debug:  512(0) can start job 55528 at 1360108910 on RMP05Fe152628577(seq[0000x1233,3000x4233]) 1
[2013-02-05T16:01:50-08:00] Queue start of job 55528 in BG block RMP05Fe152628577
[2013-02-05T16:01:50-08:00] debug:  Waiting for block RMP05Fe152628577 to free for job 55528.  1 thread(s) trying to free it
[2013-02-05T16:01:50-08:00] Warning: Note very large processing time from _slurm_rpc_allocate_resources: usec=2214650
[2013-02-05T16:01:50-08:00] sched: _slurm_rpc_allocate_resources JobId=55528 NodeList=seq[0000x1233,3000x4233] usec=2214650
[2013-02-05T16:01:50-08:00] Warning: Note very large processing time from dump_all_job_state: usec=2466824
[2013-02-05T16:01:51-08:00] debug:  Waiting for block RMP05Fe152628577 to free for job 55528.  1 thread(s) trying to free it
[2013-02-05T16:01:52-08:00] debug:  Waiting for block RMP05Fe152628577 to free for job 55528.  1 thread(s) trying to free it
[2013-02-05T16:01:53-08:00] debug:  _track_freeing_blocks: freed 0 of 1 for job 4294967294

Comment 7 Don Lipari 2013-02-08 10:12:40 MST

More info on this problem.  The theory is that SLURM is not handling the times when blocks take excessive amounts of time to de-allocate.  Here is a report of just such a case:

Thanks -- we will have to look into NFS issue early this morning.

Paul we can close this PMR.

Don, we still need to dig into how Slurm deals with this case.  Slurm marked this block in error, but the system has no fatal events.  Maybe a follow on job was also lost trying to use this block before it de-allocated.

David Fox

On Fri, 8 Feb 2013, Pat McCarthy wrote:

> 
> The jobs on the IO node were having issues with NFS reads:
> 
> me 2013-02-08 00:00:51.591497
> Message ID 000B0006
> Category Software_Error
> Component CIOS
> Severity WARN
> Block ID R5-llnl
> Location R50-ID-J01
> ECID 8196081140141596030BC670F589686700401040532A20726200000000000000
> Serial Number 74Y9670YL1CK2054061
> CPU 56
> Message The sysiod process seems to be stuck in a system call while running.
> Flight log entry= 16 timestamp= 11e655f6f5bd BG_STUCK_MSG: svc=4 
> type=4077
> rank=8227 seq=6 msglen=72 jobid=74277 sysio::Pread64 syscall slow for
> seconds=300 _serviceId=140 length=1048576 
> involvesfd:/proc/5211/fd/15->/nfs/tmp2/tangn/build_zeusDepot_rzuseq_po
> lyhedral_rad_
> static/bin/kull
> 
> 
> ID 3816094
> Time 2013-02-08 09:18:14.163877
> Message ID 000B0006
> Category Software_Error
> Component CIOS
> Severity WARN
> Block ID R5-llnl
> Location R50-ID-J01
> ECID 8196081140141596030BC670F589686700401040532A20726200000000000000
> Serial Number 74Y9670YL1CK2054061
> CPU 0
> Message The sysiod process seems to be stuck in a system call while running.
> Flight log entry= 9f timestamp= 4194f48b784f BG_STUCK_MSG: svc=4 
> type=4081
> rank=2051 seq=62 msglen=64 jobid=74367 sysio::Read syscall hung for
> seconds=900 _serviceId=654 length=6584 
> involvesfd:/proc/10602/fd/15->/nfs/tmp2/crs/sequoia/hotqcd/DWF/2+1f/64
> nt8/IWASAKI+D 
> SDR/b1.707/ls16/M1.8/ms0.05230/ml0.00112/evol0_llnl/rngs/ckpoint_rng.
> 
> Pat McCarthy

Comment 8 Danny Auble 2013-02-15 08:57:12 MST

*** Ticket 236 has been marked as a duplicate of this ticket. ***

Comment 9 Danny Auble 2013-02-25 09:37:57 MST

This is now fixed and will be in the next 2.5 release.  Commit 7af7d1082ee6525a1ea394417c01eff732108a3e