Ticket 7893

Summary: slurmctld core dump with acct_gather_infiniband/ofed
Product: Slurm Reporter: Mathis Clayer <mathis.clayer>
Component: AccountingAssignee: Gavin D. Howard <gavin>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: kilian
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=7940
Site: Atos/Eviden Sites Alineos Sites: ---
Atos/Eviden Sites: Grenoble Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Mathis Clayer 2019-10-08 11:02:03 MDT
Hello,
With the option AcctGatherInfinibandType=acct_gather_infiniband/ofed, and without the file acct_gather.conf. I got a slurmctld coredump:

 slurmctld: debug2: No acct_gather.conf file (/usr/local/mathis/19.05/etc/acct_gather.conf)
slurmctld: debug2: before rc
slurmctld: debug2: after rc
slurmctld: debug2:  in s_p_pack_hashtbl
slurmctld: debug2:  in s_p_pack_hashtbl
slurmctld: error: parse_config.c:2190: s_p_pack_hashtbl(): Assertion (p) failed.
Aborted (core dumped)


If I create the file acct_gather.conf, I don't have a core dump anymore.
Comment 1 Gavin D. Howard 2019-10-08 13:48:00 MDT
Thank you for the report. I am looking into it.
Comment 2 Gavin D. Howard 2019-10-09 11:20:20 MDT
I am making progress, but I have a request: can you send me your slurm.conf?
Comment 3 Mathis Clayer 2019-10-10 07:02:34 MDT
Created attachment 11900 [details]
slurm.conf
Comment 5 Gavin D. Howard 2019-10-10 11:40:17 MDT
Good news! I have found the bug and have started the review process for a patch.
Comment 9 Gavin D. Howard 2019-10-16 12:21:02 MDT
*** Ticket 7940 has been marked as a duplicate of this ticket. ***
Comment 16 Gavin D. Howard 2019-10-25 16:23:20 MDT
This bug has been fixed, and the fix will be in 19.05.4.

I am closing this bug, but feel free to reopen if 19.05.4 has the same problem.