Ticket 15007

Summary: slurmctld segfault (_build_sock_gres_by_topo)
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmctldAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tripiana
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: gdb 't a a bt'

Description Kilian Cavalotti 2022-09-20 14:25:34 MDT
Created attachment 26892 [details]
gdb 't a a bt'

Hi SchedMD,

Yet another slurmctld segfault to report... 
22.05 doesn't seem like a great vintage so far, I'm afraid :(

-- 8< -------------------------------------------------------------------------
(gdb) bt
#0  bit_size (b=b@entry=0x0) at bitstring.c:286
#1  0x00007f4b95bb3a20 in bit_copy (b=0x0) at bitstring.c:795
#2  0x00007f4b942e43f2 in _build_sock_gres_by_topo (node_inx=1387, user_id=339761, req_sock_map=0x7f4b7fbeb5a8, s_p_n=4294967294, enforce_binding=false, node_name=0x7f49b4b38c80 "sh03-14n02", job_id=62836033, cores_per_sock=16, sockets=2, core_bitmap=0x7f4b8298b820,
    use_total_gres=false, gres_state_node=0x2816b30, gres_state_job=0x5f68330) at gres_sched.c:179
#3  gres_sched_create_sock_gres_list (job_gres_list=<optimized out>, node_gres_list=node_gres_list@entry=0x28169c0, use_total_gres=use_total_gres@entry=false, core_bitmap=0x7f4b8298b820, sockets=2, cores_per_sock=16, job_id=62836033,
    node_name=0x7f49b4b38c80 "sh03-14n02", enforce_binding=enforce_binding@entry=false, s_p_n=s_p_n@entry=1, req_sock_map=req_sock_map@entry=0x7f4b7fbeb5a8, user_id=339761, node_inx=node_inx@entry=1387) at gres_sched.c:782
#4  0x00007f4b942ed3cb in can_job_run_on_node (job_ptr=0x5f67c50, core_map=0x7f4b82576340, node_i=1387, s_p_n=1, node_usage=<optimized out>, cr_type=<optimized out>, test_only=false, will_run=true, part_core_map=0x0) at job_test.c:3342
#5  0x00007f4b942fd460 in _get_res_avail (part_core_map=0x0, will_run=<optimized out>, test_only=<optimized out>, cr_type=20, node_usage=0x7f49b4d1e4d0, core_map=0x7f4b82576340, node_map=0x7f4b82af59b0, job_ptr=0x5f67c50) at job_test.c:347
#6  _select_nodes (job_ptr=job_ptr@entry=0x5f67c50, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, node_bitmap=node_bitmap@entry=0x7f4b82af59b0, avail_core=0x7f4b82576340, node_usage=node_usage@entry=0x7f49b4d1e4d0,
    cr_type=cr_type@entry=20, test_only=test_only@entry=false, will_run=will_run@entry=true, part_core_map=0x0, prefer_alloc_nodes=prefer_alloc_nodes@entry=false, tres_mc_ptr=0x7f4b8285b720) at job_test.c:505
#7  0x00007f4b942fe52b in _job_test (job_ptr=job_ptr@entry=0x5f67c50, node_bitmap=node_bitmap@entry=0x7f4b82af59b0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, mode=mode@entry=2, cr_type=cr_type@entry=20,
    job_node_req=job_node_req@entry=NODE_CR_ONE_ROW, cr_part_ptr=0x7f49b44d8d10, node_usage=0x7f49b4d1e4d0, exc_cores=exc_cores@entry=0x0, prefer_alloc_nodes=prefer_alloc_nodes@entry=false, qos_preemptor=qos_preemptor@entry=false, preempt_mode=preempt_mode@entry=false)
    at job_test.c:925
#8  0x00007f4b943009d1 in _will_run_test (exc_core_bitmap=0x0, preemptee_job_list=0x0, preemptee_candidates=0x0, job_node_req=1, req_nodes=1, max_nodes=500000, min_nodes=1, node_bitmap=0x7f4b82af59b0, job_ptr=0x5f67c50) at job_test.c:1886
#9  common_job_test (job_ptr=job_ptr@entry=0x5f67c50, node_bitmap=node_bitmap@entry=0x7f4b82af59b0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0,
    preemptee_job_list=preemptee_job_list@entry=0x0, exc_cores=0x0) at job_test.c:2344
#10 0x00007f4b942ee855 in select_p_job_test (job_ptr=0x5f67c50, node_bitmap=0x7f4b82af59b0, min_nodes=1, max_nodes=500000, req_nodes=1, mode=<optimized out>, preemptee_candidates=0x0, preemptee_job_list=0x0, exc_core_bitmap=0x0) at select_cons_tres.c:501
#11 0x00007f4b95c22498 in select_g_job_test (job_ptr=job_ptr@entry=0x5f67c50, bitmap=0x7f4b82af59b0, min_nodes=min_nodes@entry=1, max_nodes=max_nodes@entry=500000, req_nodes=req_nodes@entry=1, mode=mode@entry=2, preemptee_candidates=preemptee_candidates@entry=0x0,
    preemptee_job_list=preemptee_job_list@entry=0x0, exc_core_bitmap=exc_core_bitmap@entry=0x0) at select.c:499
#12 0x00007f4b7fbf372b in _try_sched (job_ptr=0x5f67c50, avail_bitmap=avail_bitmap@entry=0x7f4b7fbebd50, min_nodes=1, max_nodes=500000, req_nodes=1, exc_core_bitmap=0x0) at backfill.c:613
#13 0x00007f4b7fbf7eb1 in _attempt_backfill () at backfill.c:2517
#14 0x00007f4b7fbfa1f8 in backfill_agent (args=<optimized out>) at backfill.c:1082
#15 0x00007f4b95723ea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f4b94f30b0d in clone () from /lib64/libc.so.6
-- 8< -------------------------------------------------------------------------


The usual is attached.

Thanks!
--
Kilian
Comment 1 Carlos Tripiana Montes 2022-09-21 08:45:06 MDT
Hi Kilian,

Whatever has happened:

bit_copy (b=0x0)

is bad. gres_sched.c:179:_build_sock_gres_by_topo says:

if (!sock_gres->bits_any_sock) {
	sock_gres->bits_any_sock =
		bit_copy(gres_ns->
			 topo_gres_bitmap[i]);
}

So gres_ns->topo_gres_bitmap[i] is NULL (gres_ns = gres_state_node->gres_data, and i loops up to gres_ns->topo_cnt).

My 1st request goes for your slurm.conf and your gres.conf :).

Cheers,
Carlos.
Comment 2 Carlos Tripiana Montes 2022-09-21 09:12:17 MDT
As commented in Bug 15009 Comment 0, it might be already fixed in Bug 14885 but I need to check.

I'll keep you posted.
Comment 3 Kilian Cavalotti 2022-09-21 09:26:56 MDT
Hi Carlos, 

(In reply to Carlos Tripiana Montes from comment #2)
> As commented in Bug 15009 Comment 0, it might be already fixed in Bug 14885
> but I need to check.

Thanks for taking a look!

I believe that those 3 segfaults (bugs 14885, 15007, 15009) may be related, but since the backtraces were a little different each time, I just wanted to report them separately, to make sure they could be double-checked, and get confirmation those cases were indeed related.

We had not applied the patch Dominik provided in bug 14885 until last night, so 15007 and 15009 occurred without the patch.

But we're running with https://github.com/SchedMD/slurm/commit/4be0358ce9 now, so hopefully we won't see any more occurrence.

Thanks!
--
Kilian
Comment 4 Carlos Tripiana Montes 2022-09-22 02:18:33 MDT
By now, I am not able to reproduce the issue in the current 22.05 branch HEAD.

If you hit the issue again, please report more or less what did you do to reproduce it.

I have been playing a bit sending jobs and changing things in gres.conf to make the _try_sched of a job to fire this issue following similar approach as in Bug 14885, but so far it's working fine.
Comment 5 Kilian Cavalotti 2022-09-22 09:16:06 MDT
Hi Carlos, 

(In reply to Carlos Tripiana Montes from comment #4)
> By now, I am not able to reproduce the issue in the current 22.05 branch
> HEAD.
> 
> If you hit the issue again, please report more or less what did you do to
> reproduce it.
> 
> I have been playing a bit sending jobs and changing things in gres.conf to
> make the _try_sched of a job to fire this issue following similar approach
> as in Bug 14885, but so far it's working fine.

Great to hear! No new segfault to report on our side since we applied the patch either, so that looks very promising.

I'll be sure to report any new issue we could see related to that, but in the meantime, I guess bug 15007 and bug 15009 could both be closed, or marked as duplicates of bug 14885.

Thanks!
--
Kilian
Comment 6 Carlos Tripiana Montes 2022-09-23 04:55:02 MDT
Marking as duplicate now.

Cheers,
Carlos.

*** This ticket has been marked as a duplicate of ticket 14885 ***