| Summary: | slurmctld SEGFAULT - related to removing "CpuSpecList=" from slurm configuration | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Greg Wickham <greg.wickham> |
| Component: | slurmctld | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | bart |
| Version: | 23.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | KAUST | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | SEGFAULT information from coredump of slurmctld | ||
Hi I can reproduce this issue. I have a patch that fixes this segfault and should be safe, but it still needs to be fully tested. If you need it, I can share it with you. Dominik Hi Dominik. Thanks. We can wait until the patch has passed regression testing. For now we know what causes the issue and will actively avoid it. -greg Hi This commit fixes this bug, https://github.com/SchedMD/slurm/commit/f0635ad40a7 It will be available in 23.02.5 and above. Let me know if we can close this issue or if you have any additional questions. Dominik Dominik, You may close this case. -greg |
Created attachment 31540 [details] SEGFAULT information from coredump of slurmctld With this node definition in nodes.conf: NodeName=gpu202-16-l CoresPerSocket=64 CpuSpecList=1,2 Features=4gpus,a100,amd,cpu_amd_epyc_7702,el7,gpu,gpu_a100,ibex2019,local_200G,local_400G,local_500G,local_950G,milan,nolmem Gres=gpu:a100:4 RealMemory=508928 Sockets=1 ThreadsPerCore=1 Weight=12160 If "CpuSpecList=1,2" is removed, and slurmctld restarted it will SEGFAULT when node "gpu202-16-l" tries to register with slurmctld. See the attached file. -greg