| Summary: | Got segfault on Slurmstepd | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will Dennis <wdennis> |
| Component: | slurmstepd | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | bart |
| Version: | 21.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NEC Labs | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
ml-gpu11_slurmd.log
ml-gpu10_slurmd.log |
||
|
Description
Will Dennis
2022-04-05 17:38:48 MDT
From sacct output, believe it was this job: JobID|Start|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|End|State|ExitCode|AllocTRES| 225|2022-04-05T09:54:05|frame-train|dpatel|gpu_scav|None assigned|96|480G|00:00:00|normal|2022-04-05T09:54:05|CANCELLED by 8701|0:0|| Hi Do you have core dump files from these crashes? If yes could you load them to gdb and generate backtraces: eg.: gdb <slurmstepd_path> <core_file> -batch -ex 't a a bt' Could you send me slurmd.log from the affected nodes? Dominik Created attachment 24265 [details] ml-gpu11_slurmd.log Sorry, we can find no "core" file resulting from the crash... Attached are the slurmd.log files from the affected nodes. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, April 6, 2022 5:02 AM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 13774] Got segfault on Slurmstepd Comment # 2<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774%23c2&data=04%7C01%7Cwdennis%40nec-labs.com%7Caa762f2b7a454ea3cf1608da17ac144e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637848325101918744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zGlFI%2Fm03jkljzbrHbXeQSftT8ipEiR7TJDe1vYlDVk%3D&reserved=0> on bug 13774<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774&data=04%7C01%7Cwdennis%40nec-labs.com%7Caa762f2b7a454ea3cf1608da17ac144e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637848325101918744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2D3BkefVINCQmS0M85OZ0Id7H8UyINJRWCDs0KAKiZM%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Do you have core dump files from these crashes? If yes could you load them to gdb and generate backtraces: eg.: gdb <slurmstepd_path> <core_file> -batch -ex 't a a bt' Could you send me slurmd.log from the affected nodes? Dominik ________________________________ You are receiving this mail because: * You reported the bug. Created attachment 24266 [details]
ml-gpu10_slurmd.log
Hi Could you send me output from: addr2line -e <path to libslurmfull.so> 179ED8 -f Dominik root@ml-gpu10:~# addr2line -e /opt/slurm/21.08.5/lib/slurm/libslurmfull.so 179ED8 -f slurm_xfree /usr/src/slurm/slurm-21.08.5/src/common/xmalloc.c:212 From: bugs@schedmd.com <bugs@schedmd.com> Date: Friday, April 8, 2022 at 12:32 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 13774] Got segfault on Slurmstepd Comment # 7<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774%23c7&data=04%7C01%7Cwdennis%40nec-labs.com%7Cfdf221cd7342450eb08708da197d20a9%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637850323548518308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LRqasF5LTc%2BSDr0pTwHeWdVGcqYCD%2FChmXzBdZatThQ%3D&reserved=0> on bug 13774<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774&data=04%7C01%7Cwdennis%40nec-labs.com%7Cfdf221cd7342450eb08708da197d20a9%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637850323548518308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZXycANcEHRYX%2Bljrw0%2BbJHNcJiSxiEoR7SXYldVxF5o%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Could you send me output from: addr2line -e <path to libslurmfull.so> 179ED8 -f Dominik ________________________________ You are receiving this mail because: * You reported the bug. Hi I am afraid that without new data I am unable to track this issue. I don't see anything specific in logs and xfree is used too often to give any clue. Dominik OK, we’ll see if it happens again, go ahead and close I guess… I’ll reference this ticket on a new ticket if we see this happen again. Hi I will close this ticket for now. I suspect that this is related to bug 12969 and this commit should fix it: https://github.com/SchedMD/slurm/commit/7b1cf827526c Please re-open if this issue occurred again. Dominik |