Hi SchedMD! What would be an upgrade to a new major version without a segfault? :) So here it is: while 18.08's slurmctld (and all versions before) was fine to have AcctGather*Type options defined in slurm.conf without any /etc/slurm/acct_gather.conf file on the controller, 19.05 is not, and segfaults at start with this: # slurmctld -Dvvv slurmctld: debug: Log file re-opened slurmctld: slurmctld version 19.05.3-2 started on cluster sherlock slurmctld: Munge credential signature plugin loaded slurmctld: debug: Munge authentication plugin loaded slurmctld: Linear node selection plugin loaded with argument 20 slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 20 slurmctld: select/cons_tres loaded with argument 20 slurmctld: Cray/Aries node selection plugin loaded slurmctld: debug: init: Gres GPU plugin loaded slurmctld: preempt/partition_prio loaded slurmctld: debug: Checkpoint plugin loaded: checkpoint/none slurmctld: debug: AcctGatherProfile NONE plugin loaded slurmctld: debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf) Segmentation fault From the core file: Program terminated with signal 11, Segmentation fault. #0 0x00007f0d7e5c0402 in s_p_pack_hashtbl (hashtbl=hashtbl@entry=0x0, options=0x2041730, cnt=2) at parse_config.c:2192 2192 parse_config.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.3-2.el7.x86_64 (gdb) bt #0 0x00007f0d7e5c0402 in s_p_pack_hashtbl (hashtbl=hashtbl@entry=0x0, options=0x2041730, cnt=2) at parse_config.c:2192 #1 0x00007f0d7e5d8a4f in acct_gather_conf_init () at slurm_acct_gather.c:142 #2 0x000000000042d250 in main (argc=1, argv=<optimized out>) at controller.c:532 Maybe an extra check could make slurmctld graciously exit if the file is missing, instead of throwing a segfault? Also, our /etc/slurm/acct_gather.conf file is empty, because we don't have any particular option to specify here. So, a note in the release notes about the fact that the file is now required to exist, even if empty, would be nice. Other than that, smooth upgrade so far, but we're not done yet, so stay tuned! :) Cheers, -- Kilian
Killian, After testing, I have found that this bug is a duplicate of bug 7893. In that bug, Slurm was built with debug info, so it aborted instead of seg faulting, but the assert which was triggered was there to make sure a pointer was not NULL because the pointer is dereferenced right away. And that dereference of a NULL pointer is what causes the seg fault. We have a patch under review for the fix. Please see bug 7893 for progress because I am going to close this one as a duplicate. *** This ticket has been marked as a duplicate of ticket 7893 ***
Thanks! Cheers,