Ticket 10496

Summary: Bus error (core dumped)
Product: Slurm Reporter: Alex Mamach <alex.mamach>
Component: slurmdAssignee: Tim McMullan <mcmullan>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 20.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=10492
Site: Northwestern Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: syslog
dmesg
slurmd.log

Description Alex Mamach 2020-12-21 13:21:20 MST
Hi,

One of our users is seeing an error similar to the following for many of their jobs. Am I correct in interpreting this as an out of memory error, or is something else happening? Usually we see oom errors show up in dmesg, but I'm not seeing anything in dmesg on the affected nodes. We have cgroups enabled for memory, cores and gpus, for reference.

I also checked and don't see any core dumps in /var/log/slurm/ or /var/spool/slurmd on the affected nodes.

/var/spool/slurmd/job6006161/slurm_script: line 17: 21119 Bus error               (core dumped)

Thanks!

Alex
Comment 1 Jason Booth 2020-12-21 13:38:14 MST
Alex - can you send us the slurmd.log, syslog, and the output of dmesg from that system?
Comment 2 Alex Mamach 2020-12-21 13:53:22 MST
Created attachment 17239 [details]
syslog
Comment 3 Alex Mamach 2020-12-21 13:53:34 MST
Created attachment 17240 [details]
dmesg
Comment 4 Alex Mamach 2020-12-21 13:54:31 MST
Created attachment 17241 [details]
slurmd.log
Comment 5 Alex Mamach 2020-12-21 13:56:30 MST
I've uploaded the requested files. Fore reference, the job in question ran from 2020-12-20T14:25:34 to 2020-12-20T14:27:30
Comment 8 Tim McMullan 2020-12-22 11:00:50 MST
Hi Alex,

I'm not seeing anything too telling in these logs as far as the bus error goes.  It may be helpful to run the logs at least at the "debug" level to see a little more info about what is going on here from the slurm perspective.

Where is this bus error showing up? Is this bus error showing up in the output of the users job?  Bus errors are usually  more about bad access than out of memory to allocate.

Thanks!
--Tim
Comment 9 Tim McMullan 2021-01-06 06:57:14 MST
Hi Alex!

I just wanted to check in and see if you were able to get some of the additional logs!

Thanks!
--Tim
Comment 10 Alex Mamach 2021-01-06 11:50:30 MST
Hi Tim,

The error was showing up in the users' job output, but we haven't been able to replicate it after asking them to allocate more memory to their jobs. If that changes I can open another ticket, but for now I think we're good to close this one. Thanks for your help!

Thanks,

Alex