Ticket 15009

Summary: slurmctld segfault (_job_alloc)
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: slurmctldAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tripiana
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: Stanford Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: gdb 't a a bt'

Description Kilian Cavalotti 2022-09-20 16:25:18 MDT
Created attachment 26893 [details]
gdb 't a a bt'

sorry the bug summaries are not very creative, but we just hit another segfault, yet another different backtrace (although I suspect 15007 and this one may be related to 14885?)


From core.24193
-- 8< --------------------------------------------------------------------------
(gdb) bt
#0  bit_size (b=0x0) at bitstring.c:286
#1  0x000000000044345c in _job_alloc (gres_state_job=gres_state_job@entry=0x49b04f0, job_gres_list_alloc=0x49b0580, gres_state_node=<optimized out>, node_cnt=node_cnt@entry=1, node_index=node_index@entry=719, node_offset=node_offset@entry=0,
    job_id=job_id@entry=62849174, node_name=node_name@entry=0x7f6a98084c10 "sh02-12n04", core_bitmap=core_bitmap@entry=0x7f6a99266310, new_alloc=new_alloc@entry=false) at gres_ctld.c:460
#2  0x0000000000444c72 in gres_ctld_job_alloc (job_gres_list=<optimized out>, job_gres_list_alloc=job_gres_list_alloc@entry=0x49afdc8, node_gres_list=node_gres_list@entry=0x2418130, node_cnt=1, node_index=node_index@entry=719, node_offset=node_offset@entry=0,
    job_id=62849174, node_name=0x7f6a98084c10 "sh02-12n04", core_bitmap=0x7f6a99266310, new_alloc=new_alloc@entry=false) at gres_ctld.c:951
#3  0x00007f6d2e573339 in job_res_add_job (job_ptr=job_ptr@entry=0x49afcb0, action=action@entry=JOB_RES_ACTION_NORMAL) at job_resources.c:328
#4  0x00007f6d2e568b5f in select_p_select_nodeinfo_set (job_ptr=0x49afcb0) at cons_common.c:1892
#5  0x00007f6d2fe99bc9 in select_g_select_nodeinfo_set (job_ptr=job_ptr@entry=0x49afcb0) at select.c:812
#6  0x00000000004aa6f8 in _sync_jobs_to_conf () at read_config.c:1382
#7  0x00000000004ad0de in read_slurm_conf (recover=recover@entry=1, reconfig=reconfig@entry=true) at read_config.c:1694
#8  0x00000000004a66f1 in _slurm_rpc_reconfigure_controller (msg=0x7f6a981c8190) at proc_req.c:3324
#9  0x00000000004a8230 in slurmctld_req (msg=msg@entry=0x7f6a981c8190) at proc_req.c:6676
#10 0x000000000042de35 in _service_connection (arg=0x0) at controller.c:1380
#11 0x00007f6d2f99aea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f6d2f1a7b0d in clone () from /lib64/libc.so.6
-- 8< --------------------------------------------------------------------------

"t a a bt" attached

Cheers,
--
Kilian
Comment 1 Carlos Tripiana Montes 2022-09-21 09:33:58 MDT
Same applies here. As we have a good reproducer of the issue in Bug 14885, I can follow the code while debugging to see how things are while hitting the other 2 segfaults with that patched version.

I guess if it doesn't segfault it's because it's already fixed :).
Comment 2 Carlos Tripiana Montes 2022-09-23 04:55:33 MDT
Marking as duplicate now.

Cheers,
Carlos.

*** This ticket has been marked as a duplicate of ticket 14885 ***