User reports he scancel'd a job, and then noticed nodes went into "DRAIN" state. `sinfo -R` indicated "Kill job failed", and when I looked at dmesg output on the nodes, both had errors as this: (node ml-gpu10) [Apr 5 09:40] slurmstepd[45784]: segfault at 60 ip 0000149df7a87ed8 sp 00007ffcaaa4c920 error 6 in libslurmfull.so[149df790e000+1d4000] node ml-gpu11) [Apr 5 09:40] slurmstepd[3663]: segfault at 60 ip 0000151dc28e7ed8 sp 00007ffc709caeb0 error 6 in libslurmfull.so[151dc276e000+1d4000] I restarted slurmd on the nodes, and set to "resume", nodes seem fine now.
From sacct output, believe it was this job: JobID|Start|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|End|State|ExitCode|AllocTRES| 225|2022-04-05T09:54:05|frame-train|dpatel|gpu_scav|None assigned|96|480G|00:00:00|normal|2022-04-05T09:54:05|CANCELLED by 8701|0:0||
Hi Do you have core dump files from these crashes? If yes could you load them to gdb and generate backtraces: eg.: gdb <slurmstepd_path> <core_file> -batch -ex 't a a bt' Could you send me slurmd.log from the affected nodes? Dominik
Created attachment 24265 [details] ml-gpu11_slurmd.log Sorry, we can find no "core" file resulting from the crash... Attached are the slurmd.log files from the affected nodes. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, April 6, 2022 5:02 AM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 13774] Got segfault on Slurmstepd Comment # 2<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774%23c2&data=04%7C01%7Cwdennis%40nec-labs.com%7Caa762f2b7a454ea3cf1608da17ac144e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637848325101918744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zGlFI%2Fm03jkljzbrHbXeQSftT8ipEiR7TJDe1vYlDVk%3D&reserved=0> on bug 13774<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774&data=04%7C01%7Cwdennis%40nec-labs.com%7Caa762f2b7a454ea3cf1608da17ac144e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637848325101918744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2D3BkefVINCQmS0M85OZ0Id7H8UyINJRWCDs0KAKiZM%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Do you have core dump files from these crashes? If yes could you load them to gdb and generate backtraces: eg.: gdb <slurmstepd_path> <core_file> -batch -ex 't a a bt' Could you send me slurmd.log from the affected nodes? Dominik ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 24266 [details] ml-gpu10_slurmd.log
Hi Could you send me output from: addr2line -e <path to libslurmfull.so> 179ED8 -f Dominik
root@ml-gpu10:~# addr2line -e /opt/slurm/21.08.5/lib/slurm/libslurmfull.so 179ED8 -f slurm_xfree /usr/src/slurm/slurm-21.08.5/src/common/xmalloc.c:212 From: bugs@schedmd.com <bugs@schedmd.com> Date: Friday, April 8, 2022 at 12:32 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 13774] Got segfault on Slurmstepd Comment # 7<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774%23c7&data=04%7C01%7Cwdennis%40nec-labs.com%7Cfdf221cd7342450eb08708da197d20a9%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637850323548518308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LRqasF5LTc%2BSDr0pTwHeWdVGcqYCD%2FChmXzBdZatThQ%3D&reserved=0> on bug 13774<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774&data=04%7C01%7Cwdennis%40nec-labs.com%7Cfdf221cd7342450eb08708da197d20a9%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637850323548518308%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZXycANcEHRYX%2Bljrw0%2BbJHNcJiSxiEoR7SXYldVxF5o%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Could you send me output from: addr2line -e <path to libslurmfull.so> 179ED8 -f Dominik ________________________________ You are receiving this mail because: * You reported the bug.
Hi I am afraid that without new data I am unable to track this issue. I don't see anything specific in logs and xfree is used too often to give any clue. Dominik
OK, we’ll see if it happens again, go ahead and close I guess… I’ll reference this ticket on a new ticket if we see this happen again.
Hi I will close this ticket for now. I suspect that this is related to bug 12969 and this commit should fix it: https://github.com/SchedMD/slurm/commit/7b1cf827526c Please re-open if this issue occurred again. Dominik