Ticket 7940 - slurmctld segfault if /etc/slurm/acct_gather.conf doesn't exist
Summary: slurmctld segfault if /etc/slurm/acct_gather.conf doesn't exist
Status: RESOLVED DUPLICATE of ticket 7893
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Gavin D. Howard
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-10-16 11:02 MDT by Kilian Cavalotti
Modified: 2019-10-16 12:22 MDT (History)
0 users

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2019-10-16 11:02:45 MDT
Hi SchedMD!

What would be an upgrade to a new major version without a segfault? :) 

So here it is: while 18.08's slurmctld (and all versions before) was fine to have AcctGather*Type options defined in slurm.conf without any /etc/slurm/acct_gather.conf file on the controller, 19.05 is not, and segfaults at start with this:

# slurmctld -Dvvv
slurmctld: debug:  Log file re-opened
slurmctld: slurmctld version 19.05.3-2 started on cluster sherlock
slurmctld: Munge credential signature plugin loaded
slurmctld: debug:  Munge authentication plugin loaded
slurmctld: Linear node selection plugin loaded with argument 20
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 20
slurmctld: select/cons_tres loaded with argument 20
slurmctld: Cray/Aries node selection plugin loaded
slurmctld: debug:  init: Gres GPU plugin loaded
slurmctld: preempt/partition_prio loaded
slurmctld: debug:  Checkpoint plugin loaded: checkpoint/none
slurmctld: debug:  AcctGatherProfile NONE plugin loaded
slurmctld: debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
Segmentation fault


From the core file:

Program terminated with signal 11, Segmentation fault.
#0  0x00007f0d7e5c0402 in s_p_pack_hashtbl (hashtbl=hashtbl@entry=0x0, options=0x2041730, cnt=2) at parse_config.c:2192
2192    parse_config.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.3-2.el7.x86_64
(gdb) bt
#0  0x00007f0d7e5c0402 in s_p_pack_hashtbl (hashtbl=hashtbl@entry=0x0, options=0x2041730, cnt=2) at parse_config.c:2192
#1  0x00007f0d7e5d8a4f in acct_gather_conf_init () at slurm_acct_gather.c:142
#2  0x000000000042d250 in main (argc=1, argv=<optimized out>) at controller.c:532


Maybe an extra check could make slurmctld graciously exit if the file is missing, instead of throwing a segfault? 

Also, our /etc/slurm/acct_gather.conf file is empty, because we don't have any particular option to specify here. So, a note in the release notes about the fact that the file is now required to exist, even if empty, would be nice. 

Other than that, smooth upgrade so far, but we're not done yet, so stay tuned! :)

Cheers,
--
Kilian
Comment 2 Gavin D. Howard 2019-10-16 12:21:02 MDT
Killian,

After testing, I have found that this bug is a duplicate of bug 7893. In that bug, Slurm was built with debug info, so it aborted instead of seg faulting, but the assert which was triggered was there to make sure a pointer was not NULL because the pointer is dereferenced right away. And that dereference of a NULL pointer is what causes the seg fault.

We have a patch under review for the fix. Please see bug 7893 for progress because I am going to close this one as a duplicate.

*** This ticket has been marked as a duplicate of ticket 7893 ***
Comment 3 Kilian Cavalotti 2019-10-16 12:22:22 MDT
Thanks!

Cheers,