| Summary: | slurmctld deadlock | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 2.4.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | gdb bt 20 highest threads | ||
Could you get me what block RMP16Au110528152 looked like? Could you give me
print *bg_record
in thread 258
#6 0x00000400003cfaa0 in _find_matching_block (block_list=0x13c38b88, job_ptr=0x4005c0025d8, slurm_block_bitmap=0x40090004560,
request=0x4000f63d2c8, max_cpus=<value optimized out>, allow=0x4000f63d36c, check_image=<value optimized out>,
overlap_check=<value optimized out>, overlapped_list=0x0, query_mode=768) at bg_job_place.c:402
tmp_list = 0x13c38768
found_record = <value optimized out>
bg_record = 0x400900046a8
itr = 0x40004000c60
tmp_char = "\000\000\004\000\017c\371@\000\000\004\000\000A7x\000\000\004\000\000?\332X", '\000' <repeats 18 times>, "\004\000\017c\313\070\000\000\004\000\017c\313\200\000\000\004\000\017c\313X\000\000\004\000\017c\371\020", '\000' <repeats 18 times>, "\004\000\017c\324P\000\000\004\000\017c\313@\000\000\004\000\017c\313\220\000\000\004\000\017c\313l\000\000\004\000\017c\313`\000\000\000\000\020\036\223\260\000\000\004\000\017c\321@\000\000\004\000\017c\313\310\000\000\004\000\017c\313\060\000\000\000\000\020\036\223\260\000\000\004\000\017c\371 \000\000\000\000\020\035o\n\000\000\004\000\017c\323@\000\000\004\000\017c\313P\000\000\004\000\017c\313H\377\377\377\377\377\377\377\377\000\000\004\000\017c\313\020\000\000\000\000\020$(h\000\000\004\000\001\332\000\000\000\000\000\000\000\000\000E\000\000\004"...
dim = <value optimized out>
the slurmctld on rzuseq has been since restarted. The slurmctld on seq is still hung. So, I can provide info only on seq. Thanks for the great logs a confirmed patch is here (241c9263f63f3f6a7081f25edc422c20676fd833) and will be in 2.4.3. |
Created attachment 115 [details] gdb bt 20 highest threads rzuseq's slurmctld became unresponsive last night. It had reached its thread limit. It is running the latest 2.4 branch as of Monday Aug 27. Attached are the back traces of the 20 highest threads (until they started repeating) [2012-08-28T17:59:52] debug: sched: Running job scheduler [2012-08-28T18:00:01] Processing RPC: REQUEST_RECONFIGURE from uid=0 [2012-08-28T18:00:01] debug: sched: begin reconfiguration [2012-08-28T18:00:01] debug: No DownNodes [2012-08-28T18:00:01] restoring original state of nodes [2012-08-28T18:00:01] Reading the bluegene.conf file [2012-08-28T18:00:01] Bridge api file set to /var/log/slurm/bridgeapi.log, verbose level 2 [2012-08-28T18:00:01] /etc/slurm/bluegene.conf unchanged [2012-08-28T18:00:01] debug: bluegene: select_p_state_restore [2012-08-28T18:00:01] debug: Updating partition uid access list [2012-08-28T18:00:01] read_slurm_conf: backup_controller not specified. [2012-08-28T18:00:01] debug: Spawning agent msg_type=1003 [2012-08-28T18:00:01] _slurm_rpc_reconfigure_controller: completed usec=2360 [2012-08-28T18:00:01] debug: power_save module disabled, SuspendTime < 0 [2012-08-28T18:00:01] debug: sched: Running job scheduler [2012-08-28T18:00:01] debug: bluegene:submit_job: 27942 mode=256 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1 [2012-08-28T18:00:01] number of blocks to check: 1 state 256 asking for 3456-3456 cpus [2012-08-28T18:00:01] Don't need to look at myself RMP16Au110528152 RMP16Au110528152 [2012-08-28T18:00:01] ba_sub_block_in_bitmap: looking for 216 in a field of 253 (00011x33331,00101x33301,01001x33001,30001). [2012-08-28T18:00:01] debug: _find_geo_table: requested sub-block larger than block [2012-08-28T18:00:01] block RMP16Au110528152 does not have a placement for a sub-block of this size (3456) [2012-08-28T18:00:01] going to create 1 [2012-08-28T18:00:01] trying with 1 [2012-08-28T18:00:01] adding RMP16Au110528152(rzuseq0000) Ready 0000 1111 512 [2012-08-28T18:00:01] allocate failure for 1 midplanes with free midplanes [2012-08-28T18:00:01] error: This size 216 is unknown on this system [2012-08-28T18:00:01] debug: _find_best_block_match none found [2012-08-28T18:00:01] doing preemption [2012-08-28T18:00:01] removing job 27892 running on RMP16Au110528152 [2012-08-28T18:00:01] number of blocks to check: 1 state 768 asking for 3456-3456 cpus [2012-08-28T18:00:01] going to free block RMP16Au110528152 there are no jobs running. This will only happen if the cnodes went into error after no jobs were running. [2012-08-28T18:00:01] We are freeing a block (RMP16Au110528152) that has at least 1 job. [2012-08-28T21:01:06] debug: Processing RPC: REQUEST_STEP_COMPLETE for 27892.0 nodes 0-0 rc=0 uid=41557 [2012-08-28T22:34:51] server_thread_count over limit (256), waiting