116 – slurmctld deadlock

Ticket 116 - slurmctld deadlock

Summary: slurmctld deadlock

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Bluegene select plugin (show other tickets)
Version:	2.4.x
Hardware:	IBM BlueGene Linux

Severity:	2 - High Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2012-08-29 05:32 MDT by Don Lipari
Modified:	2012-08-29 08:58 MDT (History)
CC List:	0 users

See Also:
Site:	LLNL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
gdb bt 20 highest threads (54.42 KB, application/octet-stream) 2012-08-29 05:32 MDT, Don Lipari	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Don Lipari 2012-08-29 05:32:47 MDT

Created attachment 115 [details]
gdb bt 20 highest threads

rzuseq's slurmctld became unresponsive last night.  It had reached its thread limit.  It is running the latest 2.4 branch as of Monday Aug 27.  Attached are the back traces of the 20 highest threads (until they started repeating)

[2012-08-28T17:59:52] debug:  sched: Running job scheduler
[2012-08-28T18:00:01] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2012-08-28T18:00:01] debug:  sched: begin reconfiguration
[2012-08-28T18:00:01] debug:  No DownNodes
[2012-08-28T18:00:01] restoring original state of nodes
[2012-08-28T18:00:01] Reading the bluegene.conf file
[2012-08-28T18:00:01] Bridge api file set to /var/log/slurm/bridgeapi.log, verbose level 2
[2012-08-28T18:00:01] /etc/slurm/bluegene.conf unchanged
[2012-08-28T18:00:01] debug:  bluegene: select_p_state_restore
[2012-08-28T18:00:01] debug:  Updating partition uid access list
[2012-08-28T18:00:01] read_slurm_conf: backup_controller not specified.
[2012-08-28T18:00:01] debug:  Spawning agent msg_type=1003
[2012-08-28T18:00:01] _slurm_rpc_reconfigure_controller: completed usec=2360
[2012-08-28T18:00:01] debug:  power_save module disabled, SuspendTime < 0
[2012-08-28T18:00:01] debug:  sched: Running job scheduler
[2012-08-28T18:00:01] debug:  bluegene:submit_job: 27942 mode=256 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1
[2012-08-28T18:00:01] number of blocks to check: 1 state 256 asking for 3456-3456 cpus
[2012-08-28T18:00:01] Don't need to look at myself RMP16Au110528152 RMP16Au110528152
[2012-08-28T18:00:01] ba_sub_block_in_bitmap: looking for 216 in a field of 253 (00011x33331,00101x33301,01001x33001,30001).
[2012-08-28T18:00:01] debug:  _find_geo_table: requested sub-block larger than block
[2012-08-28T18:00:01] block RMP16Au110528152 does not have a placement for a sub-block of this size (3456) 
[2012-08-28T18:00:01] going to create 1
[2012-08-28T18:00:01] trying with 1
[2012-08-28T18:00:01] adding RMP16Au110528152(rzuseq0000) Ready 0000 1111 512
[2012-08-28T18:00:01] allocate failure for 1 midplanes with free midplanes
[2012-08-28T18:00:01] error: This size 216 is unknown on this system
[2012-08-28T18:00:01] debug:  _find_best_block_match none found
[2012-08-28T18:00:01] doing preemption
[2012-08-28T18:00:01] removing job 27892 running on RMP16Au110528152
[2012-08-28T18:00:01] number of blocks to check: 1 state 768 asking for 3456-3456 cpus
[2012-08-28T18:00:01] going to free block RMP16Au110528152 there are no jobs running.  This will only happen if the cnodes went into error after no jobs were running.
[2012-08-28T18:00:01] We are freeing a block (RMP16Au110528152) that has at least 1 job.
[2012-08-28T21:01:06] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 27892.0 nodes 0-0 rc=0 uid=41557
[2012-08-28T22:34:51] server_thread_count over limit (256), waiting

Comment 1 Danny Auble 2012-08-29 07:54:05 MDT

Could you get me what block RMP16Au110528152 looked like?

Comment 2 Danny Auble 2012-08-29 07:58:46 MDT

Could you give me

print *bg_record

in thread 258

#6  0x00000400003cfaa0 in _find_matching_block (block_list=0x13c38b88, job_ptr=0x4005c0025d8, slurm_block_bitmap=0x40090004560, 
    request=0x4000f63d2c8, max_cpus=<value optimized out>, allow=0x4000f63d36c, check_image=<value optimized out>, 
    overlap_check=<value optimized out>, overlapped_list=0x0, query_mode=768) at bg_job_place.c:402
        tmp_list = 0x13c38768
        found_record = <value optimized out>
        bg_record = 0x400900046a8
        itr = 0x40004000c60
        tmp_char = "\000\000\004\000\017c\371@\000\000\004\000\000A7x\000\000\004\000\000?\332X", '\000' <repeats 18 times>, "\004\000\017c\313\070\000\000\004\000\017c\313\200\000\000\004\000\017c\313X\000\000\004\000\017c\371\020", '\000' <repeats 18 times>, "\004\000\017c\324P\000\000\004\000\017c\313@\000\000\004\000\017c\313\220\000\000\004\000\017c\313l\000\000\004\000\017c\313`\000\000\000\000\020\036\223\260\000\000\004\000\017c\321@\000\000\004\000\017c\313\310\000\000\004\000\017c\313\060\000\000\000\000\020\036\223\260\000\000\004\000\017c\371 \000\000\000\000\020\035o\n\000\000\004\000\017c\323@\000\000\004\000\017c\313P\000\000\004\000\017c\313H\377\377\377\377\377\377\377\377\000\000\004\000\017c\313\020\000\000\000\000\020$(h\000\000\004\000\001\332\000\000\000\000\000\000\000\000\000E\000\000\004"...
        dim = <value optimized out>

Comment 3 Don Lipari 2012-08-29 08:04:45 MDT

the slurmctld on rzuseq has been since restarted.  The slurmctld on seq is still hung.  So, I can provide info only on seq.

Comment 4 Danny Auble 2012-08-29 08:58:15 MDT

Thanks for the great logs a confirmed patch is here (241c9263f63f3f6a7081f25edc422c20676fd833) and will be in 2.4.3.