Ticket 17323

Summary: slurmctld SEGFAULT - related to removing "CpuSpecList=" from slurm configuration
Product: Slurm Reporter: Greg Wickham <greg.wickham>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: bart
Version: 23.02.4   
Hardware: Linux   
OS: Linux   
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.5 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: SEGFAULT information from coredump of slurmctld

Description Greg Wickham 2023-08-01 05:34:14 MDT
Created attachment 31540 [details]
SEGFAULT information from coredump of slurmctld

With this node definition in nodes.conf:

NodeName=gpu202-16-l CoresPerSocket=64 CpuSpecList=1,2 Features=4gpus,a100,amd,cpu_amd_epyc_7702,el7,gpu,gpu_a100,ibex2019,local_200G,local_400G,local_500G,local_950G,milan,nolmem Gres=gpu:a100:4 RealMemory=508928 Sockets=1 ThreadsPerCore=1 Weight=12160

If "CpuSpecList=1,2" is removed, and slurmctld restarted it will SEGFAULT when node "gpu202-16-l" tries to register with slurmctld.

See the attached file.

   -greg
Comment 3 Dominik Bartkiewicz 2023-08-01 06:04:28 MDT
Hi

I can reproduce this issue.
I have a patch that fixes this segfault and should be safe, but it still needs to be fully tested.
If you need it, I can share it with you.

Dominik
Comment 4 Greg Wickham 2023-08-01 07:45:32 MDT
Hi Dominik.

Thanks.

We can wait until the patch has passed regression testing.

For now we know what causes the issue and will actively avoid it.

   -greg
Comment 7 Dominik Bartkiewicz 2023-08-10 02:25:54 MDT
Hi

This commit fixes this bug, 
https://github.com/SchedMD/slurm/commit/f0635ad40a7
It will be available in 23.02.5 and above.

Let me know if we can close this issue or if you have any additional questions.

Dominik
Comment 8 Greg Wickham 2023-08-10 02:27:35 MDT
Dominik,

You may close this case.

  -greg