Created attachment 115 [details] gdb bt 20 highest threads rzuseq's slurmctld became unresponsive last night. It had reached its thread limit. It is running the latest 2.4 branch as of Monday Aug 27. Attached are the back traces of the 20 highest threads (until they started repeating) [2012-08-28T17:59:52] debug: sched: Running job scheduler [2012-08-28T18:00:01] Processing RPC: REQUEST_RECONFIGURE from uid=0 [2012-08-28T18:00:01] debug: sched: begin reconfiguration [2012-08-28T18:00:01] debug: No DownNodes [2012-08-28T18:00:01] restoring original state of nodes [2012-08-28T18:00:01] Reading the bluegene.conf file [2012-08-28T18:00:01] Bridge api file set to /var/log/slurm/bridgeapi.log, verbose level 2 [2012-08-28T18:00:01] /etc/slurm/bluegene.conf unchanged [2012-08-28T18:00:01] debug: bluegene: select_p_state_restore [2012-08-28T18:00:01] debug: Updating partition uid access list [2012-08-28T18:00:01] read_slurm_conf: backup_controller not specified. [2012-08-28T18:00:01] debug: Spawning agent msg_type=1003 [2012-08-28T18:00:01] _slurm_rpc_reconfigure_controller: completed usec=2360 [2012-08-28T18:00:01] debug: power_save module disabled, SuspendTime < 0 [2012-08-28T18:00:01] debug: sched: Running job scheduler [2012-08-28T18:00:01] debug: bluegene:submit_job: 27942 mode=256 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=(null) mps=1-1-1 [2012-08-28T18:00:01] number of blocks to check: 1 state 256 asking for 3456-3456 cpus [2012-08-28T18:00:01] Don't need to look at myself RMP16Au110528152 RMP16Au110528152 [2012-08-28T18:00:01] ba_sub_block_in_bitmap: looking for 216 in a field of 253 (00011x33331,00101x33301,01001x33001,30001). [2012-08-28T18:00:01] debug: _find_geo_table: requested sub-block larger than block [2012-08-28T18:00:01] block RMP16Au110528152 does not have a placement for a sub-block of this size (3456) [2012-08-28T18:00:01] going to create 1 [2012-08-28T18:00:01] trying with 1 [2012-08-28T18:00:01] adding RMP16Au110528152(rzuseq0000) Ready 0000 1111 512 [2012-08-28T18:00:01] allocate failure for 1 midplanes with free midplanes [2012-08-28T18:00:01] error: This size 216 is unknown on this system [2012-08-28T18:00:01] debug: _find_best_block_match none found [2012-08-28T18:00:01] doing preemption [2012-08-28T18:00:01] removing job 27892 running on RMP16Au110528152 [2012-08-28T18:00:01] number of blocks to check: 1 state 768 asking for 3456-3456 cpus [2012-08-28T18:00:01] going to free block RMP16Au110528152 there are no jobs running. This will only happen if the cnodes went into error after no jobs were running. [2012-08-28T18:00:01] We are freeing a block (RMP16Au110528152) that has at least 1 job. [2012-08-28T21:01:06] debug: Processing RPC: REQUEST_STEP_COMPLETE for 27892.0 nodes 0-0 rc=0 uid=41557 [2012-08-28T22:34:51] server_thread_count over limit (256), waiting
Could you get me what block RMP16Au110528152 looked like?
Could you give me print *bg_record in thread 258 #6 0x00000400003cfaa0 in _find_matching_block (block_list=0x13c38b88, job_ptr=0x4005c0025d8, slurm_block_bitmap=0x40090004560, request=0x4000f63d2c8, max_cpus=<value optimized out>, allow=0x4000f63d36c, check_image=<value optimized out>, overlap_check=<value optimized out>, overlapped_list=0x0, query_mode=768) at bg_job_place.c:402 tmp_list = 0x13c38768 found_record = <value optimized out> bg_record = 0x400900046a8 itr = 0x40004000c60 tmp_char = "\000\000\004\000\017c\371@\000\000\004\000\000A7x\000\000\004\000\000?\332X", '\000' <repeats 18 times>, "\004\000\017c\313\070\000\000\004\000\017c\313\200\000\000\004\000\017c\313X\000\000\004\000\017c\371\020", '\000' <repeats 18 times>, "\004\000\017c\324P\000\000\004\000\017c\313@\000\000\004\000\017c\313\220\000\000\004\000\017c\313l\000\000\004\000\017c\313`\000\000\000\000\020\036\223\260\000\000\004\000\017c\321@\000\000\004\000\017c\313\310\000\000\004\000\017c\313\060\000\000\000\000\020\036\223\260\000\000\004\000\017c\371 \000\000\000\000\020\035o\n\000\000\004\000\017c\323@\000\000\004\000\017c\313P\000\000\004\000\017c\313H\377\377\377\377\377\377\377\377\000\000\004\000\017c\313\020\000\000\000\000\020$(h\000\000\004\000\001\332\000\000\000\000\000\000\000\000\000E\000\000\004"... dim = <value optimized out>
the slurmctld on rzuseq has been since restarted. The slurmctld on seq is still hung. So, I can provide info only on seq.
Thanks for the great logs a confirmed patch is here (241c9263f63f3f6a7081f25edc422c20676fd833) and will be in 2.4.3.