| Summary: | Inconsistent Block States on Sequoia | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 2.5.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
Slurm will never take a block out of an error state by itself. It always requires operator intervention. The overlapping block is the interesting one though. The happings of the RMP31Ja232959535 block would be interesting. Could you send me the logs on it? (In reply to comment #1) > Slurm will never take a block out of an error state by itself. It always > requires operator intervention. > > The overlapping block is the interesting one though. > > The happings of the RMP31Ja232959535 block would be interesting. Could you > send me the logs on it? The whole log was too big to send. I sent you just the grep of RMP31Ja232959535. Let me know which time span you need to see and I'll scope the logs down to just that time frame. Could you send me the full log for this minute? 2013-01-31T23:40 Found the machine messed up this moring. Also see what looks like a similar situation, note the 2 blocks that over lap. This seems to indicate that slurm tried to boot the 64K block when the 96K was in error, thus I expect a 64K job was lost. pscale RMP06Fe032206144 Free - M,T,T,T 64K seq[0000x1133,3000x4133] pscale RMP05Fe225439754 Error(F - M,T,T,T 96K seq[0000x1233,3000x4233] # seqsn3 / > sinfo -Rl Wed Feb 6 06:57:25 2013 REASON USER TIMESTAMP STATE MIDPLANELIST Block would not deal slurm(101) 2013-02-06T03:32:25 err seq[0000x1233,3000x4233] David Fox This appears related. If there was a job scheduled to run on RMP06Fe032206144 before RMP05Fe225439754 was unable to free in a timely manner then yes it is the exact same problem. (In reply to comment #5) > This appears related. If there was a job scheduled to run on > RMP06Fe032206144 before RMP05Fe225439754 was unable to free in a timely > manner then yes it is the exact same problem. It looks like job 55528 was scheduled onto the same block after step 55527.2 (IBM job 73870) failed: sacct -j 55527,55528 -o JobID,block,State,ExitCode,start,end JobID BlockID State ExitCode Start End ------------ ---------------- ---------- -------- ------------------- ------------------- 55527 RMP05Fe152628577 FAILED 1:0 2013-02-05T15:26:28 2013-02-05T16:01:46 55527.0 COMPLETED 0:0 2013-02-05T15:37:11 2013-02-05T15:44:24 55527.1 FAILED 1:0 2013-02-05T15:47:54 2013-02-05T16:01:31 55527.2 FAILED 1:0 2013-02-05T16:01:44 2013-02-05T16:01:44 55528 RMP05Fe152628577 FAILED 0:1 2013-02-05T16:01:50 2013-02-05T16:21:57 2013-02-05 15:37:11.920 (INFO ) [0x4000190c7b0] RMP05Fe152628577:3693:morrone:ibm.runjob.server.Job: creating 2013-02-05 15:37:12.327 (INFO ) [0x4002f7ff0c0] RMP05Fe152628577:3693:ibm.database.job.Operations: Inserted job 73869 on 'RMP05Fe152628577' with qualifier '570620379' for 'morrone' 2013-02-05 15:37:12.328 (INFO ) [0x4002f7ff0c0] RMP05Fe152628577:3693:ibm.runjob.server.job.Container: assigned id 73869 2013-02-05 15:37:12.387 (INFO ) [0x4000190c7b0] RMP05Fe152628577:3693:ibm.runjob.server.job.Status: SETUP 2013-02-05 15:37:13.152 (INFO ) [0x4000190c7b0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: LOADING 2013-02-05 15:37:29.755 (INFO ) [0x4000d7ff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: CLIENT STARTING 2013-02-05 15:37:30.550 (INFO ) [0x4000e3ff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: CONTROL STARTING 2013-02-05 15:37:30.658 (INFO ) [0x4002e3ff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: RUNNING 2013-02-05 15:44:14.319 (INFO ) [0x4000dfff0c0] RMP05Fe152628577:73869:ibm.runjob.server.job.Status: CLEANUP 2013-02-05 15:44:17.967 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:73869:ibm.database.job.Operations: Removing job 73869 2013-02-05 15:47:54.662 (INFO ) [0x4002ebff0c0] RMP05Fe152628577:3694:morrone:ibm.runjob.server.Job: creating 2013-02-05 15:47:54.780 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:3694:ibm.database.job.Operations: Inserted job 73870 on 'RMP05Fe152628577' with qualifier '570620379' for 'morrone' 2013-02-05 15:47:54.781 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:3694:ibm.runjob.server.job.Container: assigned id 73870 2013-02-05 15:47:54.835 (INFO ) [0x4000fbff0c0] RMP05Fe152628577:3694:ibm.runjob.server.job.Status: SETUP 2013-02-05 15:47:55.270 (INFO ) [0x4000c3ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: LOADING 2013-02-05 15:48:07.799 (INFO ) [0x4002cfff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: CLIENT STARTING 2013-02-05 15:48:08.066 (INFO ) [0x4000ebff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: CONTROL STARTING 2013-02-05 15:48:08.185 (INFO ) [0x4002cfff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Status: RUNNING 2013-02-05 16:00:28.928 (INFO ) [0x4000dbff0c0] RMP05Fe152628577:73870:R03-ID-J05:ibm.runjob.server.job.ExitStatus: abnormal termination by signal 6 from rank 49808 2013-02-05 16:00:28.961 (INFO ) [0x4000dbff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Signal: sent signal 9 2013-02-05 16:01:19.842 (INFO ) [0x4000d7ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Signal: sent signal 2 2013-02-05 16:01:19.868 (INFO ) [0x4000d7ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.job.Signal: sent signal 2 2013-02-05 16:01:29.059 (WARN ) [0x4000d3ff0c0] RMP05Fe152628577:73870:ibm.runjob.server.block.Compute: 98304 nodes unavailable 2013-02-05 16:01:29.170 (INFO ) [0x4000f7ff0c0] RMP05Fe152628577:73870:ibm.database.job.Operations: Removing job 73870 2013-02-05 16:01:44.182 (INFO ) [0x4000d3ff0c0] RMP05Fe152628577:3695:morrone:ibm.runjob.server.Job: error code: block is unavailable due to a previous failure 2013-02-05 16:01:44.182 (INFO ) [0x4000d3ff0c0] RMP05Fe152628577:3695:morrone:ibm.runjob.server.Job: error message: 'node R00-M0-N00-J00 is not available: Software Failure' [2013-02-05T16:01:50-08:00] debug: bluegene:submit_job: 55528 mode=512 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Bl ock_ID=(null) mps=192-192-192 [2013-02-05T16:01:50-08:00] number of blocks to check: 169 state 512 asking for 1572864-1572864 cpus [2013-02-05T16:01:50-08:00] Don't need to look at myself RMP05Fe152628577 RMP05Fe152628577 [2013-02-05T16:01:50-08:00] we found one! RMP05Fe152628577 [2013-02-05T16:01:50-08:00] debug: _find_best_block_match RMP05Fe152628577 <seq[0000x1233,3000x4233]> [2013-02-05T16:01:50-08:00] debug: 512(0) can start job 55528 at 1360108910 on RMP05Fe152628577(seq[0000x1233,3000x4233]) 1 [2013-02-05T16:01:50-08:00] Queue start of job 55528 in BG block RMP05Fe152628577 [2013-02-05T16:01:50-08:00] debug: Waiting for block RMP05Fe152628577 to free for job 55528. 1 thread(s) trying to free it [2013-02-05T16:01:50-08:00] Warning: Note very large processing time from _slurm_rpc_allocate_resources: usec=2214650 [2013-02-05T16:01:50-08:00] sched: _slurm_rpc_allocate_resources JobId=55528 NodeList=seq[0000x1233,3000x4233] usec=2214650 [2013-02-05T16:01:50-08:00] Warning: Note very large processing time from dump_all_job_state: usec=2466824 [2013-02-05T16:01:51-08:00] debug: Waiting for block RMP05Fe152628577 to free for job 55528. 1 thread(s) trying to free it [2013-02-05T16:01:52-08:00] debug: Waiting for block RMP05Fe152628577 to free for job 55528. 1 thread(s) trying to free it [2013-02-05T16:01:53-08:00] debug: _track_freeing_blocks: freed 0 of 1 for job 4294967294 More info on this problem. The theory is that SLURM is not handling the times when blocks take excessive amounts of time to de-allocate. Here is a report of just such a case:
Thanks -- we will have to look into NFS issue early this morning.
Paul we can close this PMR.
Don, we still need to dig into how Slurm deals with this case. Slurm marked this block in error, but the system has no fatal events. Maybe a follow on job was also lost trying to use this block before it de-allocated.
David Fox
On Fri, 8 Feb 2013, Pat McCarthy wrote:
>
> The jobs on the IO node were having issues with NFS reads:
>
> me 2013-02-08 00:00:51.591497
> Message ID 000B0006
> Category Software_Error
> Component CIOS
> Severity WARN
> Block ID R5-llnl
> Location R50-ID-J01
> ECID 8196081140141596030BC670F589686700401040532A20726200000000000000
> Serial Number 74Y9670YL1CK2054061
> CPU 56
> Message The sysiod process seems to be stuck in a system call while running.
> Flight log entry= 16 timestamp= 11e655f6f5bd BG_STUCK_MSG: svc=4
> type=4077
> rank=8227 seq=6 msglen=72 jobid=74277 sysio::Pread64 syscall slow for
> seconds=300 _serviceId=140 length=1048576
> involvesfd:/proc/5211/fd/15->/nfs/tmp2/tangn/build_zeusDepot_rzuseq_po
> lyhedral_rad_
> static/bin/kull
>
>
> ID 3816094
> Time 2013-02-08 09:18:14.163877
> Message ID 000B0006
> Category Software_Error
> Component CIOS
> Severity WARN
> Block ID R5-llnl
> Location R50-ID-J01
> ECID 8196081140141596030BC670F589686700401040532A20726200000000000000
> Serial Number 74Y9670YL1CK2054061
> CPU 0
> Message The sysiod process seems to be stuck in a system call while running.
> Flight log entry= 9f timestamp= 4194f48b784f BG_STUCK_MSG: svc=4
> type=4081
> rank=2051 seq=62 msglen=64 jobid=74367 sysio::Read syscall hung for
> seconds=900 _serviceId=654 length=6584
> involvesfd:/proc/10602/fd/15->/nfs/tmp2/crs/sequoia/hotqcd/DWF/2+1f/64
> nt8/IWASAKI+D
> SDR/b1.707/ls16/M1.8/ms0.05230/ml0.00112/evol0_llnl/rngs/ckpoint_rng.
>
> Pat McCarthy
*** Ticket 236 has been marked as a duplicate of this ticket. *** This is now fixed and will be in the next 2.5 release. Commit 7af7d1082ee6525a1ea394417c01eff732108a3e |
Running the v2.5.1 version with the core fixes (i.e, almost the tagged v2.5.2 version) and seeing block discrepancies. It appears that the system is taking an overly long time to free a block, SLURM marks the block in an error state. Once the mmcs actually frees the block, SLURM never updates the block state. Furthermore, there are now two blocks with overlapping nodes, the original in an error state and a new one in a free state. Here is the relevant reports/logs. Note the IBM log indicates 22:05 for the block to allocate, and the slurm time stamp is 12m before the shutdown completed. # seqsn3 /bgsys/logs/BGQ.sn > sinfo -o "%50E %H %N" REASON TIMESTAMP MIDPLANELIST Block would not deallocate 2013-01-31T23:40:03 seq[0000x1133] # seqsn3 /bgsys/logs/BGQ.sn > smap -cDb -n seq[0000x1133] Fri Feb 1 06:43:53 2013 PARTITION BG_BLOCK STATE JOBID CONN NODES MIDPLANELIST pbatch RMP31Ja232959535 Free - T,T,T,T 16K seq[0000x0133] pbatch RMP31Ja230103286 Error(F - M,T,T,T 32K seq[0000x1133] slurmctld.log [2013-01-31T23:30:00-08:00] debug: freeing block RMP31Ja230103286 [2013-01-31T23:30:00-08:00] bridge_destroy RMP31Ja230103286 [2013-01-31T23:30:00-08:00] debug: freeing block RMP31Ja230103286 [2013-01-31T23:30:00-08:00] bridge_block_free(RMP31Ja230103286): BlockFreeError (Most likely the block has pending action, should clear up shortly, check bridgeapi.log for further info) State = Ready [2013-01-31T23:30:01-08:00] debug: state of Block RMP31Ja230103286 was Ready and now is Term [2013-01-31T23:30:01-08:00] debug: Block RMP31Ja230103286 was in a ready state but is being freed. No job running. [2013-01-31T23:40:03-08:00] 1 others are trying to destroy this block RMP31Ja230103286 [2013-01-31T23:40:03-08:00] _post_block_free: block RMP31Ja230103286 is not in state free (Term), putting it in error state. [2013-01-31T23:40:03-08:00] Setting Block RMP31Ja230103286 to ERROR state. (reason: 'Block would not deallocate') [2013-01-31T23:52:19-08:00] debug: state of Block RMP31Ja230103286 was Error(Term) and now is Free bridgeapi.log [2013-01-31T23:30:00] ERROR: ibm.bgsched.Block: Block free request failed because block RMP31Ja230103286 has pending action [0x4000d4ff1f0] mc_server log: 2013-01-31 23:01:25.760 (INFO ) [slurm@seqsn3:156] boot - cookie=101381507, bootId=RMP31Ja230103286, MachineController::configureAndBootBlock starting 2013-01-31 23:03:58.369 (INFO ) [slurm@seqsn3:156] boot - cookie=101381507, bootId=RMP31Ja230103286, MachineController::configureAndBootBlock finished, duration 152.609297, ok=1 2013-01-31 23:30:06.583 (INFO ) [slurm@seqsn3:156] shutdown - MachineController::shutdown() - starting - cookie=101381507, bootId=RMP31Ja230103286 2013-01-31 23:52:11.987 (INFO ) [slurm@seqsn3:156] shutdown - MachineController::shutdown() - did not lower pgood on these computes because of the global flag setting - cookie=101381507, bootId=RMP31Ja230103286 2013-01-31 23:52:11.987 (INFO ) [slurm@seqsn3:156] shutdown - MachineController::shutdown() - finished - cookie=101381507, bootId=RMP31Ja230103286, duration 1325.404124, ok=1