Hi, One of our users is seeing an error similar to the following for many of their jobs. Am I correct in interpreting this as an out of memory error, or is something else happening? Usually we see oom errors show up in dmesg, but I'm not seeing anything in dmesg on the affected nodes. We have cgroups enabled for memory, cores and gpus, for reference. I also checked and don't see any core dumps in /var/log/slurm/ or /var/spool/slurmd on the affected nodes. /var/spool/slurmd/job6006161/slurm_script: line 17: 21119 Bus error (core dumped) Thanks! Alex
Alex - can you send us the slurmd.log, syslog, and the output of dmesg from that system?
Created attachment 17239 [details] syslog
Created attachment 17240 [details] dmesg
Created attachment 17241 [details] slurmd.log
I've uploaded the requested files. Fore reference, the job in question ran from 2020-12-20T14:25:34 to 2020-12-20T14:27:30
Hi Alex, I'm not seeing anything too telling in these logs as far as the bus error goes. It may be helpful to run the logs at least at the "debug" level to see a little more info about what is going on here from the slurm perspective. Where is this bus error showing up? Is this bus error showing up in the output of the users job? Bus errors are usually more about bad access than out of memory to allocate. Thanks! --Tim
Hi Alex! I just wanted to check in and see if you were able to get some of the additional logs! Thanks! --Tim
Hi Tim, The error was showing up in the users' job output, but we haven't been able to replicate it after asking them to allocate more memory to their jobs. If that changes I can open another ticket, but for now I think we're good to close this one. Thanks for your help! Thanks, Alex