Ticket 4668

Summary: slurmctld crashing while trying to get block for job
Product: Slurm Reporter: Ryan Day <day36>
Component: slurmctldAssignee: Danny Auble <da>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.02.5   
Hardware: IBM BlueGene   
OS: Linux   
Site: LLNL Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld.log from time of crash
output of bg_console list bgqnodes
bluegene.conf
rzuseq slurm.conf

Description Ryan Day 2018-01-23 11:17:02 MST
In December, we pulled a rack of nodes out of our small BGQ cluster (rzuseq) to use as spares since our support contract with IBM was expiring. Since we did so, the slurmctld process has been crashing pretty regularly. From the core files, it looks like it's crashing while trying to get a block to run the job on:

(gdb) where
#0  0x0000040000174dc0 in .raise () from /lib64/libc.so.6
#1  0x0000040000176d54 in .abort () from /lib64/libc.so.6
#2  0x000004000016b8ec in .__assert_fail_base () from /lib64/libc.so.6
#3  0x000004000016ba04 in .__assert_fail () from /lib64/libc.so.6
#4  0x000000001017eac4 in bit_test (b=0x40064092700, bit=3) at bitstring.c:229
#5  0x00000400003f8480 in _internal_removable_set_mps (level=4, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:514
#6  0x00000400003f8308 in _internal_removable_set_mps (level=3, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502
#7  0x00000400003f8308 in _internal_removable_set_mps (level=2, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502
#8  0x00000400003f8308 in _internal_removable_set_mps (level=1, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502
#9  0x00000400003f8308 in _internal_removable_set_mps (level=0, bitmap=0x40064092700, coords=0x400291fcf98, mark=true, except=true) at ba_common.c:502
#10 0x00000400003fc9f4 in ba_set_removable_mps (bitmap=0x40064092700, except=true) at ba_common.c:1539
#11 0x00000400003df25c in create_dynamic_block (block_list=0x4007005f5e0, request=0x400291fd318, my_block_list=0x4007005f5e0, track_down_nodes=true)
    at bg_dynamic_block.c:210
#12 0x00000400003e4518 in _dynamically_request (block_list=0x4007005f5e0, blocks_added=0x400291fd6bc, request=0x400291fd318, user_req_nodes=0x0, query_mode=512)
    at bg_job_place.c:1105
#13 0x00000400003e526c in _find_best_block_match (block_list=0x4007005f5e0, blocks_added=0x400291fd6bc, job_ptr=0x400640919f0, slurm_block_bitmap=0x40064092700, 
    min_nodes=1, max_nodes=1, req_nodes=1, found_bg_record=0x400291fd690, query_mode=512, avail_cpus=24672, exc_core_bitmap=0x0) at bg_job_place.c:1420
#14 0x00000400003e6a50 in submit_job (job_ptr=0x400640919f0, slurm_block_bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, 
    preemptee_candidates=0x0, preemptee_job_list=0x400291fdd60, exc_core_bitmap=0x0) at bg_job_place.c:1915
#15 0x00000400003cee24 in select_p_job_test (job_ptr=0x400640919f0, bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, 
    preemptee_job_list=0x400291fdd60, exc_core_bitmap=0x0) at select_bluegene.c:1635
#16 0x00000000101a83f4 in select_g_job_test (job_ptr=0x400640919f0, bitmap=0x40064092700, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, 
    preemptee_job_list=0x400291fdd60, exc_core_bitmap=0x0) at node_select.c:576
#17 0x00000000100cece8 in _pick_best_nodes (node_set_ptr=0x40064004ad0, node_set_size=1, select_bitmap=0x400291fdd80, job_ptr=0x400640919f0, part_ptr=0x4005c018520, 
    min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_candidates=0x0, preemptee_job_list=0x400291fdd60, has_xand=false, exc_core_bitmap=0x0, 
    resv_overlap=false) at node_scheduler.c:1854
#18 0x00000000100cd308 in _get_req_features (node_set_ptr=0x40064004ad0, node_set_size=1, select_bitmap=0x400291fdd80, job_ptr=0x400640919f0, part_ptr=0x4005c018520, 
    min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_job_list=0x400291fdd60, can_reboot=true) at node_scheduler.c:1301
#19 0x00000000100d0730 in select_nodes (job_ptr=0x400640919f0, test_only=false, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x400291fe3f8)
    at node_scheduler.c:2361
#20 0x00000000100773fc in _select_nodes_parts (job_ptr=0x400640919f0, test_only=false, select_node_bitmap=0x0, err_msg=0x400291fe3f8) at job_mgr.c:4197
#21 0x0000000010077e64 in job_allocate (job_specs=0x40064001a90, immediate=0, will_run=0, resp=0x0, allocate=1, submit_uid=3495, job_pptr=0x400291fe400, 
    err_msg=0x400291fe3f8, protocol_version=7936) at job_mgr.c:4401
#22 0x00000000100ebb18 in _slurm_rpc_allocate_resources (msg=0x400291fe6d8) at proc_req.c:1134
#23 0x00000000100e8b74 in slurmctld_req (msg=0x400291fe6d8, arg=0x4002c000e50) at proc_req.c:308
#24 0x000000001004b59c in _service_connection (arg=0x4002c000e50) at controller.c:1133
#25 0x00000400000dc6fc in .start_thread () from /lib64/libpthread.so.0
#26 0x000004000023ae1c in .__clone () from /lib64/libc.so.6
(gdb)

We updated slurm.conf and ran 'scontrol reconfigure' after taking the nodes out, but I don't think we restarted any of the IBM runjob, etc. daemons. I was wondering if you had any suggestions about specific things that we might look at restarting or if we should just restart everything.
Comment 1 Danny Auble 2018-01-23 11:30:46 MST
Ryan, could you give me a slurmctld log when this happened?
Comment 2 Danny Auble 2018-01-23 11:36:20 MST
What midplanes you actually removed would be of interest as well.

Meaning how did you do it?

If you remove nodes/midplanes from the slurm.conf scontrol reconfigure will not suffice.  You need to restart all the daemons.  Though I wouldn't expect a segfault.

I would expect the IBM database to report to Slurm the midplanes disappeared.  You say the segfault is happening all the time though, I am wondering if there is a disconnect?

Perhaps the bridgeapi log would be handy as well.
Comment 3 Ryan Day 2018-01-23 11:46:14 MST
Created attachment 5992 [details]
slurmctld.log from time of crash

Here's the slurmctld.log that includes the time of the crash. There were actually two  crashes in the time frame covered by this log, with core files generated at 1/22-22:16 and 1/22-23:09.

There haven't been any messages in the bridgeapi.log since 1/10, so I don't think there's anything useful there.
Comment 4 Ryan Day 2018-01-23 11:50:58 MST
The change in slurm.conf to remove the nodes was:

tyche@day36:svn diff -r 26830:26735 slurm.conf
Index: slurm.conf
===================================================================
--- slurm.conf	(revision 26830)
+++ slurm.conf	(revision 26735)
@@ -74,10 +74,10 @@
 # COMPUTE NODES
 FrontendName=rzuseqlac[3-4] FrontendAddr=erzuseqlac[3-4]
 NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN
-NodeName=rzuseq[0000,0010]
+NodeName=rzuseq[0000,0010x0011]
 NodeName=rzuseq0001 State=DOWN
 
 Include /etc/slurm/slurm.conf.updates
 
 PartitionName=pbatch Nodes=rzuseq0010 Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=24:00:00 MaxNodes=256 AllowGroups=langer1,leon,bgldev
-PartitionName=pall Nodes=rzuseq[0000,0010] Default=No State=Down Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=1536 AllowGroups=bgldev
+PartitionName=pall Nodes=rzuseq[0000,0010,0011] Default=No State=Down Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=1536 AllowGroups=bgldev
tyche@day36:
Comment 5 Danny Auble 2018-01-23 12:03:59 MST
Thanks Ryan, from bg_console could you send me the output of 

list bgqnode
Comment 6 Danny Auble 2018-01-23 12:09:51 MST
Both places seem to have died differently.  Could you send the backtrace for the other one?

Could you also send your bluegene.conf as well.

It is clear from the logs rzuseq0011 is not found and drained...

[2018-01-22T23:30:29.589] error: find_node_record: lookup failure for rzuseq0011
[2018-01-22T23:30:29.590] error: find_node_record: lookup failure for rzuseq0011
[2018-01-22T23:30:29.590] error: drain_nodes: node rzuseq0011 does not exist
[2018-01-22T23:30:29.651] error: find_node_record: lookup failure for rzuseq0011
[2018-01-22T23:30:29.651] error: find_node_record: lookup failure for rzuseq0011
[2018-01-22T23:30:29.651] error: drain_nodes: node rzuseq0011 does not exist

I notice you also have

NodeName=rzuseq0001 State=DOWN

Is this correct?
Comment 7 Ryan Day 2018-01-23 12:23:12 MST
Created attachment 5996 [details]
output of bg_console list bgqnodes
Comment 8 Ryan Day 2018-01-23 12:23:47 MST
Created attachment 5997 [details]
bluegene.conf
Comment 9 Ryan Day 2018-01-23 12:27:51 MST
The backtrace in the first comment was from the 22:16 core file. The backtrace from the 23:09 core file looks like:

(gdb) where
#0  0x0000040000174dc0 in .raise () from /lib64/libc.so.6
#1  0x0000040000176d54 in .abort () from /lib64/libc.so.6
#2  0x000004000016b8ec in .__assert_fail_base () from /lib64/libc.so.6
#3  0x000004000016ba04 in .__assert_fail () from /lib64/libc.so.6
#4  0x000000001017eac4 in bit_test (b=0x400780a2da0, bit=3) at bitstring.c:229
#5  0x00000400003f8480 in _internal_removable_set_mps (level=4, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:514
#6  0x00000400003f8308 in _internal_removable_set_mps (level=3, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502
#7  0x00000400003f8308 in _internal_removable_set_mps (level=2, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502
#8  0x00000400003f8308 in _internal_removable_set_mps (level=1, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502
#9  0x00000400003f8308 in _internal_removable_set_mps (level=0, bitmap=0x400780a2da0, coords=0x400290fcf98, mark=true, except=true) at ba_common.c:502
#10 0x00000400003fc9f4 in ba_set_removable_mps (bitmap=0x400780a2da0, except=true) at ba_common.c:1539
#11 0x00000400003df25c in create_dynamic_block (block_list=0x1080bd70, request=0x400290fd318, my_block_list=0x1080bd70, track_down_nodes=true)
    at bg_dynamic_block.c:210
#12 0x00000400003e4518 in _dynamically_request (block_list=0x1080bd70, blocks_added=0x400290fd6bc, request=0x400290fd318, user_req_nodes=0x0, query_mode=512)
    at bg_job_place.c:1105
#13 0x00000400003e526c in _find_best_block_match (block_list=0x1080bd70, blocks_added=0x400290fd6bc, job_ptr=0x400780a2530, slurm_block_bitmap=0x400780a2da0, 
    min_nodes=1, max_nodes=1, req_nodes=1, found_bg_record=0x400290fd690, query_mode=512, avail_cpus=26848, exc_core_bitmap=0x0) at bg_job_place.c:1420
#14 0x00000400003e6a50 in submit_job (job_ptr=0x400780a2530, slurm_block_bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, 
    preemptee_candidates=0x0, preemptee_job_list=0x400290fdd60, exc_core_bitmap=0x0) at bg_job_place.c:1915
#15 0x00000400003cee24 in select_p_job_test (job_ptr=0x400780a2530, bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, 
    preemptee_job_list=0x400290fdd60, exc_core_bitmap=0x0) at select_bluegene.c:1635
#16 0x00000000101a83f4 in select_g_job_test (job_ptr=0x400780a2530, bitmap=0x400780a2da0, min_nodes=1, max_nodes=1, req_nodes=1, mode=0, preemptee_candidates=0x0, 
    preemptee_job_list=0x400290fdd60, exc_core_bitmap=0x0) at node_select.c:576
#17 0x00000000100cece8 in _pick_best_nodes (node_set_ptr=0x40078075ba0, node_set_size=1, select_bitmap=0x400290fdd80, job_ptr=0x400780a2530, part_ptr=0x10821ea0, 
    min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_candidates=0x0, preemptee_job_list=0x400290fdd60, has_xand=false, exc_core_bitmap=0x0, 
    resv_overlap=false) at node_scheduler.c:1854
#18 0x00000000100cd308 in _get_req_features (node_set_ptr=0x40078075ba0, node_set_size=1, select_bitmap=0x400290fdd80, job_ptr=0x400780a2530, part_ptr=0x10821ea0, 
    min_nodes=1, max_nodes=1, req_nodes=1, test_only=false, preemptee_job_list=0x400290fdd60, can_reboot=true) at node_scheduler.c:1301
#19 0x00000000100d0730 in select_nodes (job_ptr=0x400780a2530, test_only=false, select_node_bitmap=0x0, unavail_node_str=0x0, err_msg=0x400290fe3f8)
    at node_scheduler.c:2361
#20 0x00000000100773fc in _select_nodes_parts (job_ptr=0x400780a2530, test_only=false, select_node_bitmap=0x0, err_msg=0x400290fe3f8) at job_mgr.c:4197
#21 0x0000000010077e64 in job_allocate (job_specs=0x400780a2100, immediate=0, will_run=0, resp=0x0, allocate=1, submit_uid=53591, job_pptr=0x400290fe400, 
    err_msg=0x400290fe3f8, protocol_version=7936) at job_mgr.c:4401
#22 0x00000000100ebb18 in _slurm_rpc_allocate_resources (msg=0x400290fe6d8) at proc_req.c:1134
#23 0x00000000100e8b74 in slurmctld_req (msg=0x400290fe6d8, arg=0x4002c000c60) at proc_req.c:308
#24 0x000000001004b59c in _service_connection (arg=0x4002c000c60) at controller.c:1133
#25 0x00000400000dc6fc in .start_thread () from /lib64/libpthread.so.0
#26 0x000004000023ae1c in .__clone () from /lib64/libc.so.6
(gdb)
Comment 10 Ryan Day 2018-01-23 12:30:21 MST
Also, it is correct that we have 'NodeName=rzuseq0001 State=DOWN' in slurm.conf. I'm not sure why that's there though. Possibly should be rzuseq0011...
Comment 11 Ryan Day 2018-01-23 14:48:27 MST
actually, the NodeName=rzuseq0001 State=DOWN has been there since at least 2013, so I don't think it's the issue.
Comment 12 Danny Auble 2018-01-23 14:59:22 MST
Interesting.

Since 0011 is non-existent why was it added?

-NodeName=rzuseq[0000,0010]
+NodeName=rzuseq[0000,0010x0011]

I am guessing you only have 0000,0010 then?  2 midplanes in 2 racks?
Comment 13 Ryan Day 2018-01-23 15:16:42 MST
That's just the order that I put things in my svn diff. The current slurm.conf is the one with NodeName=rzuseq[0000,0010]. I'll attach it too.

(In reply to Danny Auble from comment #12)
> Interesting.
> 
> Since 0011 is non-existent why was it added?
> 
> -NodeName=rzuseq[0000,0010]
> +NodeName=rzuseq[0000,0010x0011]
> 
> I am guessing you only have 0000,0010 then?  2 midplanes in 2 racks?
Comment 14 Ryan Day 2018-01-23 15:18:27 MST
Created attachment 5998 [details]
rzuseq slurm.conf
Comment 15 Danny Auble 2018-01-23 15:19:30 MST
Ah, well that is different then :). I read the + as the new line.

Thanks for the actual .conf.
Comment 16 Danny Auble 2018-01-23 15:23:17 MST
I am wondering if you put it back in as DOWN things would be better.  It is clear something thinks it is there.  I am guessing the code really wants geometric prisms.
Comment 17 Ryan Day 2018-01-23 15:45:18 MST
That sounds reasonable. I'll do that, restart the slurmctld and slurmds, and cross my fingers.

(In reply to Danny Auble from comment #16)
> I am wondering if you put it back in as DOWN things would be better.  It is
> clear something thinks it is there.  I am guessing the code really wants
> geometric prisms.
Comment 18 Danny Auble 2018-01-24 13:56:54 MST
I'll ping you next week and see if this fixes the issue.
Comment 19 Danny Auble 2018-01-30 09:30:55 MST
Ryan any failures since the configuration change?
Comment 20 Ryan Day 2018-01-30 14:32:36 MST
No crashes since last Tuesday. Looks pretty good. 

(In reply to Danny Auble from comment #19)
> Ryan any failures since the configuration change?
Comment 21 Danny Auble 2018-01-30 15:08:33 MST
That is great.  It makes sense the perfect prism config would make a difference.  If you are ok to close this I am as well.  I don't think I really want to chase down why the non-prism config wasn't happy if I don't have to as I doubt that will be a common standing config.

Let me know what you think.  We can always reopen if need be.
Comment 22 Ryan Day 2018-01-30 15:27:19 MST
Definitely feel free to close this. It was a hopefully one-off situation and I'm planning to avoid updating the BGQs as much as possible.

(In reply to Danny Auble from comment #21)
> That is great.  It makes sense the perfect prism config would make a
> difference.  If you are ok to close this I am as well.  I don't think I
> really want to chase down why the non-prism config wasn't happy if I don't
> have to as I doubt that will be a common standing config.
> 
> Let me know what you think.  We can always reopen if need be.
Comment 23 Danny Auble 2018-01-30 15:30:03 MST
I agree.