In December, we pulled a rack of nodes out of our small BGQ cluster (rzuseq) to use as spares since our support contract with IBM was expiring. Since we did so, the slurmctld process has been crashing pretty regularly. From the core files, it looks like it's crashing while trying to get a block to run the job on: (gdb) where #0 0x0000040000174dc0 in .raise () from /lib64/libc.so.6 #1 0x0000040000176d54 in .abort () from /lib64/libc.so.6 #2 0x000004000016b8ec in .__assert_fail_base () from /lib64/libc.so.6 #3 0x000004000016ba04 in .__assert_fail () from /lib64/libc.so.6 #4 0x000000001017eac4 in bit_test (b=0x40064092700, bit=3) at bitstring.c:229 #5 0x00000400003f8480 in _internal_removable_set_mps (level=4, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:514 #6 0x00000400003f8308 in _internal_removable_set_mps (level=3, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502 #7 0x00000400003f8308 in _internal_removable_set_mps (level=2, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502 #8 0x00000400003f8308 in _internal_removable_set_mps (level=1, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502 #9 0x00000400003f8308 in _internal_removable_set_mps (level=0, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502 #10 0x00000400003fc9f4 in ba_set_removable_mps (bitmap=0x40064092700, except=true) at ba_common.c:1539 #11 0x00000400003df25c in create_dynamic_block (block_list=0x4007005f5e0, request=0x400291fd318, my_block_list=0x4007005f5e0, track_down_nodes=true) at bg_dynamic_block.c:210 #12 0x00000400003e4518 in _dynamically_request (block_list=0x4007005f5e0, blocks_added=0x400291fd6bc, request=0x400291fd318, user_req_nodes=0x0, query_mode=512) at bg_job_place.c:1105 #13 0x00000400003e526c in _find_best_block_match (block_list=0x4007005f5e0, blocks_added=0x400291fd6bc, job_ptr=0x400640919f0, slurm_block_bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, found_bg_record=0x400291fd690, query_mode=512, avail_cpus=24672, exc_core_bitmap=0x0) at bg_job_place.c:1420 #14 0x00000400003e6a50 in submit_job (job_ptr=0x400640919f0, slurm_block_bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x400291fdd60, exc_core_bitmap=0x0) at bg_job_place.c:1915 #15 0x00000400003cee24 in select_p_job_test (job_ptr=0x400640919f0, bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x400291fdd60, exc_core_bitmap=0x0) at select_bluegene.c:1635 #16 0x00000000101a83f4 in select_g_job_test (job_ptr=0x400640919f0, bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x400291fdd60, exc_core_bitmap=0x0) at node_select.c:576 #17 0x00000000100cece8 in _pick_best_nodes (node_set_ptr=0x40064004ad0, node_set_size=1, select_bitmap=0x400291fdd80, job_ptr=0x400640919f0, part_ptr=0x4005c018520, min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_candidates=0x0, preemptee_job_list=0x400291fdd60, has_xand=false, exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:1854 #18 0x00000000100cd308 in _get_req_features (node_set_ptr=0x40064004ad0, node_set_size=1, select_bitmap=0x400291fdd80, job_ptr=0x400640919f0, part_ptr=0x4005c018520, min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_job_list=0x400291fdd60, can_reboot=true) at node_scheduler.c:1301 #19 0x00000000100d0730 in select_nodes (job_ptr=0x400640919f0, test_only=false, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x400291fe3f8) at node_scheduler.c:2361 #20 0x00000000100773fc in _select_nodes_parts (job_ptr=0x400640919f0, test_only=false, select_node_bitmap=0x0, err_msg=0x400291fe3f8) at job_mgr.c:4197 #21 0x0000000010077e64 in job_allocate (job_specs=0x40064001a90, immediate=0, will_run=0, resp=0x0, allocate=1, submit_uid=3495, job_pptr=0x400291fe400, err_msg=0x400291fe3f8, protocol_version=7936) at job_mgr.c:4401 #22 0x00000000100ebb18 in _slurm_rpc_allocate_resources (msg=0x400291fe6d8) at proc_req.c:1134 #23 0x00000000100e8b74 in slurmctld_req (msg=0x400291fe6d8, arg=0x4002c000e50) at proc_req.c:308 #24 0x000000001004b59c in _service_connection (arg=0x4002c000e50) at controller.c:1133 #25 0x00000400000dc6fc in .start_thread () from /lib64/libpthread.so.0 #26 0x000004000023ae1c in .__clone () from /lib64/libc.so.6 (gdb) We updated slurm.conf and ran 'scontrol reconfigure' after taking the nodes out, but I don't think we restarted any of the IBM runjob, etc. daemons. I was wondering if you had any suggestions about specific things that we might look at restarting or if we should just restart everything.
Ryan, could you give me a slurmctld log when this happened?
What midplanes you actually removed would be of interest as well. Meaning how did you do it? If you remove nodes/midplanes from the slurm.conf scontrol reconfigure will not suffice. You need to restart all the daemons. Though I wouldn't expect a segfault. I would expect the IBM database to report to Slurm the midplanes disappeared. You say the segfault is happening all the time though, I am wondering if there is a disconnect? Perhaps the bridgeapi log would be handy as well.
Created attachment 5992 [details] slurmctld.log from time of crash Here's the slurmctld.log that includes the time of the crash. There were actually two crashes in the time frame covered by this log, with core files generated at 1/22-22:16 and 1/22-23:09. There haven't been any messages in the bridgeapi.log since 1/10, so I don't think there's anything useful there.
The change in slurm.conf to remove the nodes was: tyche@day36:svn diff -r 26830:26735 slurm.conf Index: slurm.conf =================================================================== --- slurm.conf (revision 26830) +++ slurm.conf (revision 26735) @@ -74,10 +74,10 @@ # COMPUTE NODES FrontendName=rzuseqlac[3-4] FrontendAddr=erzuseqlac[3-4] NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN -NodeName=rzuseq[0000,0010] +NodeName=rzuseq[0000,0010x0011] NodeName=rzuseq0001 State=DOWN Include /etc/slurm/slurm.conf.updates PartitionName=pbatch Nodes=rzuseq0010 Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=24:00:00 MaxNodes=256 AllowGroups=langer1,leon,bgldev -PartitionName=pall Nodes=rzuseq[0000,0010] Default=No State=Down Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=1536 AllowGroups=bgldev +PartitionName=pall Nodes=rzuseq[0000,0010,0011] Default=No State=Down Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=1536 AllowGroups=bgldev tyche@day36:
Thanks Ryan, from bg_console could you send me the output of list bgqnode
Both places seem to have died differently. Could you send the backtrace for the other one? Could you also send your bluegene.conf as well. It is clear from the logs rzuseq0011 is not found and drained... [2018-01-22T23:30:29.589] error: find_node_record: lookup failure for rzuseq0011 [2018-01-22T23:30:29.590] error: find_node_record: lookup failure for rzuseq0011 [2018-01-22T23:30:29.590] error: drain_nodes: node rzuseq0011 does not exist [2018-01-22T23:30:29.651] error: find_node_record: lookup failure for rzuseq0011 [2018-01-22T23:30:29.651] error: find_node_record: lookup failure for rzuseq0011 [2018-01-22T23:30:29.651] error: drain_nodes: node rzuseq0011 does not exist I notice you also have NodeName=rzuseq0001 State=DOWN Is this correct?
Created attachment 5996 [details] output of bg_console list bgqnodes
Created attachment 5997 [details] bluegene.conf
The backtrace in the first comment was from the 22:16 core file. The backtrace from the 23:09 core file looks like: (gdb) where #0 0x0000040000174dc0 in .raise () from /lib64/libc.so.6 #1 0x0000040000176d54 in .abort () from /lib64/libc.so.6 #2 0x000004000016b8ec in .__assert_fail_base () from /lib64/libc.so.6 #3 0x000004000016ba04 in .__assert_fail () from /lib64/libc.so.6 #4 0x000000001017eac4 in bit_test (b=0x400780a2da0, bit=3) at bitstring.c:229 #5 0x00000400003f8480 in _internal_removable_set_mps (level=4, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:514 #6 0x00000400003f8308 in _internal_removable_set_mps (level=3, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502 #7 0x00000400003f8308 in _internal_removable_set_mps (level=2, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502 #8 0x00000400003f8308 in _internal_removable_set_mps (level=1, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502 #9 0x00000400003f8308 in _internal_removable_set_mps (level=0, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502 #10 0x00000400003fc9f4 in ba_set_removable_mps (bitmap=0x400780a2da0, except=true) at ba_common.c:1539 #11 0x00000400003df25c in create_dynamic_block (block_list=0x1080bd70, request=0x400290fd318, my_block_list=0x1080bd70, track_down_nodes=true) at bg_dynamic_block.c:210 #12 0x00000400003e4518 in _dynamically_request (block_list=0x1080bd70, blocks_added=0x400290fd6bc, request=0x400290fd318, user_req_nodes=0x0, query_mode=512) at bg_job_place.c:1105 #13 0x00000400003e526c in _find_best_block_match (block_list=0x1080bd70, blocks_added=0x400290fd6bc, job_ptr=0x400780a2530, slurm_block_bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, found_bg_record=0x400290fd690, query_mode=512, avail_cpus=26848, exc_core_bitmap=0x0) at bg_job_place.c:1420 #14 0x00000400003e6a50 in submit_job (job_ptr=0x400780a2530, slurm_block_bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x400290fdd60, exc_core_bitmap=0x0) at bg_job_place.c:1915 #15 0x00000400003cee24 in select_p_job_test (job_ptr=0x400780a2530, bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x400290fdd60, exc_core_bitmap=0x0) at select_bluegene.c:1635 #16 0x00000000101a83f4 in select_g_job_test (job_ptr=0x400780a2530, bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, preemptee_job_list=0x400290fdd60, exc_core_bitmap=0x0) at node_select.c:576 #17 0x00000000100cece8 in _pick_best_nodes (node_set_ptr=0x40078075ba0, node_set_size=1, select_bitmap=0x400290fdd80, job_ptr=0x400780a2530, part_ptr=0x10821ea0, min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_candidates=0x0, preemptee_job_list=0x400290fdd60, has_xand=false, exc_core_bitmap=0x0, resv_overlap=false) at node_scheduler.c:1854 #18 0x00000000100cd308 in _get_req_features (node_set_ptr=0x40078075ba0, node_set_size=1, select_bitmap=0x400290fdd80, job_ptr=0x400780a2530, part_ptr=0x10821ea0, min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_job_list=0x400290fdd60, can_reboot=true) at node_scheduler.c:1301 #19 0x00000000100d0730 in select_nodes (job_ptr=0x400780a2530, test_only=false, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x400290fe3f8) at node_scheduler.c:2361 #20 0x00000000100773fc in _select_nodes_parts (job_ptr=0x400780a2530, test_only=false, select_node_bitmap=0x0, err_msg=0x400290fe3f8) at job_mgr.c:4197 #21 0x0000000010077e64 in job_allocate (job_specs=0x400780a2100, immediate=0, will_run=0, resp=0x0, allocate=1, submit_uid=53591, job_pptr=0x400290fe400, err_msg=0x400290fe3f8, protocol_version=7936) at job_mgr.c:4401 #22 0x00000000100ebb18 in _slurm_rpc_allocate_resources (msg=0x400290fe6d8) at proc_req.c:1134 #23 0x00000000100e8b74 in slurmctld_req (msg=0x400290fe6d8, arg=0x4002c000c60) at proc_req.c:308 #24 0x000000001004b59c in _service_connection (arg=0x4002c000c60) at controller.c:1133 #25 0x00000400000dc6fc in .start_thread () from /lib64/libpthread.so.0 #26 0x000004000023ae1c in .__clone () from /lib64/libc.so.6 (gdb)
Also, it is correct that we have 'NodeName=rzuseq0001 State=DOWN' in slurm.conf. I'm not sure why that's there though. Possibly should be rzuseq0011...
actually, the NodeName=rzuseq0001 State=DOWN has been there since at least 2013, so I don't think it's the issue.
Interesting. Since 0011 is non-existent why was it added? -NodeName=rzuseq[0000,0010] +NodeName=rzuseq[0000,0010x0011] I am guessing you only have 0000,0010 then? 2 midplanes in 2 racks?
That's just the order that I put things in my svn diff. The current slurm.conf is the one with NodeName=rzuseq[0000,0010]. I'll attach it too. (In reply to Danny Auble from comment #12) > Interesting. > > Since 0011 is non-existent why was it added? > > -NodeName=rzuseq[0000,0010] > +NodeName=rzuseq[0000,0010x0011] > > I am guessing you only have 0000,0010 then? 2 midplanes in 2 racks?
Created attachment 5998 [details] rzuseq slurm.conf
Ah, well that is different then :). I read the + as the new line. Thanks for the actual .conf.
I am wondering if you put it back in as DOWN things would be better. It is clear something thinks it is there. I am guessing the code really wants geometric prisms.
That sounds reasonable. I'll do that, restart the slurmctld and slurmds, and cross my fingers. (In reply to Danny Auble from comment #16) > I am wondering if you put it back in as DOWN things would be better. It is > clear something thinks it is there. I am guessing the code really wants > geometric prisms.
I'll ping you next week and see if this fixes the issue.
Ryan any failures since the configuration change?
No crashes since last Tuesday. Looks pretty good. (In reply to Danny Auble from comment #19) > Ryan any failures since the configuration change?
That is great. It makes sense the perfect prism config would make a difference. If you are ok to close this I am as well. I don't think I really want to chase down why the non-prism config wasn't happy if I don't have to as I doubt that will be a common standing config. Let me know what you think. We can always reopen if need be.
Definitely feel free to close this. It was a hopefully one-off situation and I'm planning to avoid updating the BGQs as much as possible. (In reply to Danny Auble from comment #21) > That is great. It makes sense the perfect prism config would make a > difference. If you are ok to close this I am as well. I don't think I > really want to chase down why the non-prism config wasn't happy if I don't > have to as I doubt that will be a common standing config. > > Let me know what you think. We can always reopen if need be.
I agree.