Ticket 13774 - Got segfault on Slurmstepd
Summary: Got segfault on Slurmstepd
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 21.08.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-04-05 17:38 MDT by Will Dennis
Modified: 2022-04-12 05:51 MDT (History)
1 user (show)

See Also:
Site: NEC Labs
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
ml-gpu11_slurmd.log (131.42 KB, application/octet-stream)
2022-04-06 12:04 MDT, Will Dennis
Details
ml-gpu10_slurmd.log (272.54 KB, application/octet-stream)
2022-04-06 12:04 MDT, Will Dennis
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Will Dennis 2022-04-05 17:38:48 MDT
User reports he scancel'd a job, and then noticed nodes went into "DRAIN" state. `sinfo -R` indicated "Kill job failed", and when I looked at dmesg output on the nodes, both had errors as this:

(node ml-gpu10)
[Apr 5 09:40] slurmstepd[45784]: segfault at 60 ip 0000149df7a87ed8 sp 00007ffcaaa4c920 error 6 in libslurmfull.so[149df790e000+1d4000]

node ml-gpu11)
[Apr 5 09:40] slurmstepd[3663]: segfault at 60 ip 0000151dc28e7ed8 sp 00007ffc709caeb0 error 6 in libslurmfull.so[151dc276e000+1d4000]

I restarted slurmd on the nodes, and set to "resume", nodes seem fine now.
Comment 1 Will Dennis 2022-04-05 17:41:30 MDT
From sacct output, believe it was this job:

JobID|Start|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|End|State|ExitCode|AllocTRES|
225|2022-04-05T09:54:05|frame-train|dpatel|gpu_scav|None assigned|96|480G|00:00:00|normal|2022-04-05T09:54:05|CANCELLED by 8701|0:0||
Comment 2 Dominik Bartkiewicz 2022-04-06 03:01:45 MDT
Hi

Do you have core dump files from these crashes?
If yes could you load them to gdb and generate backtraces:
eg.: gdb <slurmstepd_path> <core_file> -batch -ex 't a a bt'

Could you send me slurmd.log from the affected nodes? 

Dominik
Comment 3 Will Dennis 2022-04-06 12:04:27 MDT
Created attachment 24265 [details]
ml-gpu11_slurmd.log

Sorry, we can find no "core" file resulting from the crash...

Attached are the slurmd.log files from the affected nodes.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, April 6, 2022 5:02 AM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 13774] Got segfault on Slurmstepd

Comment # 2<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774%23c2&data=04%7C01%7Cwdennis%40nec-labs.com%7Caa762f2b7a454ea3cf1608da17ac144e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637848325101918744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zGlFI%2Fm03jkljzbrHbXeQSftT8ipEiR7TJDe1vYlDVk%3D&reserved=0> on bug 13774<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D13774&data=04%7C01%7Cwdennis%40nec-labs.com%7Caa762f2b7a454ea3cf1608da17ac144e%7C3a53c312806b4cd8a6c4774c50def6f4%7C0%7C0%7C637848325101918744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2D3BkefVINCQmS0M85OZ0Id7H8UyINJRWCDs0KAKiZM%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Do you have core dump files from these crashes?

If yes could you load them to gdb and generate backtraces:

eg.: gdb <slurmstepd_path> <core_file> -batch -ex 't a a bt'



Could you send me slurmd.log from the affected nodes?



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 4 Will Dennis 2022-04-06 12:04:27 MDT
Created attachment 24266 [details]
ml-gpu10_slurmd.log
Comment 7 Dominik Bartkiewicz 2022-04-08 10:30:42 MDT
Hi

Could you send me output from:
addr2line -e <path to libslurmfull.so> 179ED8 -f 

Dominik
Comment 9 Dominik Bartkiewicz 2022-04-08 11:21:34 MDT
Hi

I am afraid that without new data I am unable to track this issue. I don't see anything specific in logs and xfree is used too often to give any clue.

Dominik
Comment 10 Will Dennis 2022-04-08 11:43:43 MDT
OK, we’ll see if it happens again, go ahead and close I guess…

I’ll reference this ticket on a new ticket if we see this happen again.
Comment 11 Dominik Bartkiewicz 2022-04-12 05:51:26 MDT
Hi

I will close this ticket for now.
I suspect that this is related to bug 12969 and this commit should fix it: https://github.com/SchedMD/slurm/commit/7b1cf827526c
Please re-open if this issue occurred again.

Dominik