Created attachment 9274 [details] all errors in slurmctld on day 23 Feb For last two months, our slurmctld keep crashing. Almost every a few days. Sometimes many times per day. On 23 Feb it crashed 6 times. To avoid outage, we had a script setup to auto restart slurmctld when this happen. There is no core dump created. Here is a record when crash happened on that day. and attached is a file of error logged in slurmctld.log Feb 23 00:43:42 transom1 slurmctld: restarted Feb 23 04:02:12 transom1 slurmctld: restarted Feb 23 06:31:17 transom1 slurmctld: restarted Feb 23 12:22:56 transom1 slurmctld: restarted Feb 23 19:23:20 transom1 slurmctld: restarted Feb 23 20:19:23 transom1 slurmctld: restarted I have drained those not responding nodes this morning. Hope it will do any help. I am new on slurm. So please let me know how to troubleshoot with detail when response. Appreciate of that. Thanks, Wei
Created attachment 9275 [details] core dump of slurmctld on 5 Jan
Created attachment 9276 [details] core dump of slurmctld on 2 Jan
Created attachment 9277 [details] another core dump of slurmctld on 2 Jan
Created attachment 9278 [details] core dump of slurmctld on 1 Jan
Hi Without binaries core file is useless. Could you generate backtrace on the slurmctld host? eg: gdb -batch -ex "thread apply all bt full" <slurmctld path> <coredump> Dominik
Hi Dominik, I did attach 4 core dump files though they are not from the latest crashes. Did you get chance to check them out. Thanks, Wei On 25 Feb. 2019 6:49 pm, bugs@schedmd.com wrote: Comment # 5<https://bugs.schedmd.com/show_bug.cgi?id=6572#c5> on bug 6572<https://bugs.schedmd.com/show_bug.cgi?id=6572> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Without binaries core file is useless. Could you generate backtrace on the slurmctld host? eg: gdb -batch -ex "thread apply all bt full" <slurmctld path> <coredump> Dominik ________________________________ You are receiving this mail because: * You reported the bug.
Hi Wei, Dominik meant to say that binaries core file are useless to us since, in most cases, we can not not get a good backtrace from them. This is impart due to the OS and library differences between our systems. Please generate a backtrace on the slurmctld host as mentioned in comment #5 eg: gdb -batch -ex "thread apply all bt full" <slurmctld path> <coredump> It would also be nice to see backtraces from a few of your core files so that we can determine if they are all crashing in the same place. -Jason
Created attachment 9295 [details] core dump on 1 Jan
Created attachment 9296 [details] core dump on 5 Jan
Created attachment 9297 [details] core dump on 2 Jan
Created attachment 9298 [details] another core dump on 2 Jan
Hi Dominik/Jason, Thanks for your reply. Sorry I misunderstood earlier. Now please see the attached core dump files. Thanks, Wei
Hi Wei, Would you mind also attaching your slurm.conf? Thanks, Michael
Created attachment 9299 [details] slurm.conf
Also be aware the core dump files of 1/2 Jan is from slurm v18.08.3 while 5 Jan is from v18.08.4. Currently we are running v18.08.5
Hi Wei, We'll take a look at those old core dump stack traces. In the meantime, could you get a stack trace on a recent core dump file while on 18.05, and give us the slurmctld.log, syslog, and output of dmesg as well? That would help us a lot. One thing to note with your four old stack traces is that one of them crashed in a different way than the other three. So there could be more than one issue. Thanks, Michael
Hi Michael, There are not any more core dump files created since 5 Jan. So that is why I uploaded all those files from Jan. I assume they are the same cause since we haven't did anything to fix it except had 2 upgrade. I noticed that one of them is due to signal 11 instead of 6. I will try to collect other logs for you. Thanks, Wei
Created attachment 9306 [details] system log file, slurmctld file on those crashed days 1/2/5 Jan & 23 Feb as well as current dmesg Please see attached system log file, slurmctld file on those crashed days 1/2/5 Jan & 23 Feb as well as current dmesg. Thanks, Wei
Wei, I'm confused why you said "on 23 Feb it crashed 6 times." slurmctld-20190223.log shows that slurmctld is being sent a sigterm and then starting again. It's not aborting or segfaulting here. Maybe your script is restarting slurmctld prematurely. As for the core dumps: Can you run these GDB commands for me? gdb -batch -ex "frame 4" -ex "print item" -ex "frame 5" -ex "print job_entry" <slurmctld path> <gdb.173981 path> gdb -batch -ex "frame 4" -ex "print item" -ex "frame 5" -ex "print job_entry" <slurmctld path> <gdb.258485 path> gdb -batch -ex "frame 4" -ex "print item" -ex "frame 5" -ex "print job_entry" <slurmctld path> <gdb.221758 path> Of course, you can interactively run the "frame" and "print" commands while in gdb, if you prefer. I just need to know what item and job_entry look like for each of the aborts. As for the segfault in gdb.282452... this is still a mystery. It's segfaulting inside libc's malloc_consolidate() function. Most of the things I've read about this kind of error indicate that there was probably a buffer overrun in memory somewhere, corrupting memory management structures used by libc. It's possible that Slurm's to blame, or maybe even your lua job submit script or something else. It's hard to say without duplicating it. Have you changed your job submit plugin recently? See *https://stackoverflow.com/questions/6725164/segmentation-fault-in-malloc-consolidate-malloc-c-that-valgrind-doesnt-detect *https://stackoverflow.com/questions/14820533/malloc-causing-segmentation-fault-by-int-malloc *https://stackoverflow.com/questions/15915920/segfault-of-malloc-consolidate-from-gethostbyname *https://stackoverflow.com/questions/3100193/segfaults-in-malloc-and-malloc-consolidate Thanks, Michael
Created attachment 9346 [details] gdb result for three core dump files
Here is the script to restart our slurmctld. basically just start the daemon once find it is gone. ==================== #!/bin/sh while true; do p=`pgrep slurmctld` if [ "$p" = "" ]; then /apps/slurm/latest/sbin/slurmctld < /dev/null echo restarted echo restarted | logger -t slurmctld sleep 30 echo ======================
Sorry, I told you wrong, so the gdb output that you sent is useless. Could you do this instead? gdb <slurmctld path> <gdb.173918 path> frame 4 print *item print **item frame 5 print *job_entry The * and ** should properly dereference the pointers and print the contents. Then type `exit` to close out gdb. Could you repeat this for gdb.258485 and gdb.221758 as well? Thanks. Feel free to paste the text here in Bugzilla; it's easier for me than trying to unpack an attached file.
(In reply to whong from comment #22) > Here is the script to restart our slurmctld. basically just start the daemon > once find it is gone. > > ==================== > #!/bin/sh > > while true; do > p=`pgrep slurmctld` > if [ "$p" = "" ]; then > /apps/slurm/latest/sbin/slurmctld < /dev/null > echo restarted > echo restarted | logger -t slurmctld > sleep 30 > echo > ====================== Looking at messages-20190223, I see this: Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon... Feb 23 06:31:17 transom1 slurmctld: restarted Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon... I'm assuming systemd is the one stopping the slurmctld. Then your script senses this and tries to restart slurmctld, when I think systemd is already trying to do that. That's probably what's causing this line: Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. At any rate, isn't the whole point of using systemd to help monitor processes in case they crash? I would recommend getting rid of your bash script and instead use systemd to monitor slurmctld. You probably want to add `Restart=always` to the slurmctld systemd service file that you apparently already have somewhere. See https://www.digitalocean.com/community/tutorials/how-to-configure-a-linux-service-to-start-automatically-after-a-crash-or-reboot-part-1-practical-examples#auto-starting-services-with-systemd.
Created attachment 9350 [details] new gdb for those 3 core dump files as requested
(In reply to whong from comment #25) > Created attachment 9350 [details] > new gdb for those 3 core dump files as requested Thanks! So, looking at these three stack traces, the aborts() are happening within libc, during some kind of free (_int_free()). This is actually a very similar situation to the segfault gdb stack trace, except that libc is able to abort before segfaulting. This issue is also likely caused by memory corruption happening somewhere else. See https://stackoverflow.com/questions/2334352/why-do-i-get-a-sigabrt-here It sounds like you could run the slurmctld under valgrind and hope that it detects when memory initially gets corrupted, so you can find out what code is responsible. I'm not really sure what else can be done, especially if you haven't seen this problem in a month. For all we know, the issue could have been fixed in 18.08.5. As for the recent (Feb 23) slurmctld outages, do you agree that it is being restarted by systemd and isn't really crashing?
Hi Michael, Yep, we use script to manage the crashed slurmctld is because all our partitions are set not start by default. So we need to manual enable each partitions after sturmctld is up. I agree that the mentioned restarted event in the log is triggered by systemd. Though I am not aware of anyone did this at 6:31am, which is quite early. The script is not perfect, but my colleague don't feel safe to set those partitions to be on by default. Plus I don't think all those script triggered restarted event is caused by systemd. Thanks, Wei On 28 Feb. 2019 11:17 am, bugs@schedmd.com wrote: Comment # 24<https://bugs.schedmd.com/show_bug.cgi?id=6572#c24> on bug 6572<https://bugs.schedmd.com/show_bug.cgi?id=6572> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to whong from comment #22<https://bugs.schedmd.com/show_bug.cgi?id=6572#c22>) > Here is the script to restart our slurmctld. basically just start the daemon > once find it is gone. > > ==================== > #!/bin/sh > > while true; do > p=`pgrep slurmctld` > if [ "$p" = "" ]; then > /apps/slurm/latest/sbin/slurmctld < /dev/null > echo restarted > echo restarted | logger -t slurmctld > sleep 30 > echo > ====================== Looking at messages-20190223, I see this: Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon... Feb 23 06:31:17 transom1 slurmctld: restarted Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon... I'm assuming systemd is the one stopping the slurmctld. Then your script senses this and tries to restart slurmctld, when I think systemd is already trying to do that. That's probably what's causing this line: Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. At any rate, isn't the whole point of using systemd to help monitor processes in case they crash? I would recommend getting rid of your bash script and instead use systemd to monitor slurmctld. You probably want to add `Restart=always` to the slurmctld systemd service file that you apparently already have somewhere. See https://www.digitalocean.com/community/tutorials/how-to-configure-a-linux-service-to-start-automatically-after-a-crash-or-reboot-part-1-practical-examples#auto-starting-services-with-systemd<https://www.digitalocean.com/community/tutorials/how-to-configure-a-linux-service-to-start-automatically-after-a-crash-or-reboot-part-1-practical-examples#auto-starting-services-with-systemd>. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to whong from comment #27) > I agree that the mentioned restarted event in the log is triggered by > systemd. Though I am not aware of anyone did this at 6:31am, which is quite > early. The script is not perfect, but my colleague don't feel safe to set > those partitions to be on by default. > > Plus I don't think all those script triggered restarted event is caused by > systemd. I think systemd is causing *ALL* the restarts, not just the one I posted. messages-20190223 shows that every restart event had something to do with systemd. The point is, Slurm isn't crashing, as far as I can tell. If it crashed (segfault or abort), then there would have been a core dump file and maybe some errors in the logs. My advice would be to turn off the systemd slurmctld service file and your custom restart script to prove that Slurm is really crashing. It's possible that your restart script and slurmd are negatively interfering with each other, since they are both apparently trying to restart Slurm at the same time. Until you figure that out, I don't think there is much else I can do to help. Thanks, -Michael messages-20190223 ======================================= Feb 22 17:02:26 transom1 systemd: Stopping Slurm controller daemon... Feb 22 17:02:26 transom1 slurmctld: restarted Feb 22 17:02:27 transom1 systemd: Starting Slurm controller daemon... Feb 22 17:02:27 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 22 17:02:27 transom1 systemd: Started Slurm controller daemon. Feb 22 18:43:39 transom1 systemd: Stopping Slurm controller daemon... Feb 22 18:43:39 transom1 slurmctld: restarted Feb 22 18:43:39 transom1 systemd: Starting Slurm controller daemon... Feb 22 18:43:39 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 22 18:43:39 transom1 systemd: Started Slurm controller daemon. Feb 22 22:30:29 transom1 slurmctld: restarted Feb 22 22:30:29 transom1 systemd: Stopping Slurm controller daemon... Feb 22 22:30:29 transom1 systemd: Starting Slurm controller daemon... Feb 22 22:30:29 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 22 22:30:29 transom1 systemd: Started Slurm controller daemon. Feb 23 00:43:42 transom1 slurmctld: restarted Feb 23 00:43:42 transom1 systemd: Stopping Slurm controller daemon... Feb 23 00:43:43 transom1 systemd: Starting Slurm controller daemon... Feb 23 00:43:43 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 00:43:43 transom1 systemd: Started Slurm controller daemon. Feb 23 04:02:12 transom1 systemd: Stopping Slurm controller daemon... Feb 23 04:02:12 transom1 slurmctld: restarted Feb 23 04:02:12 transom1 systemd: Starting Slurm controller daemon... Feb 23 04:02:12 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 04:02:12 transom1 systemd: Started Slurm controller daemon. Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon... Feb 23 06:31:17 transom1 slurmctld: restarted Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon... Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 06:31:17 transom1 systemd: Started Slurm controller daemon. Feb 23 12:22:56 transom1 systemd: Stopping Slurm controller daemon... Feb 23 12:22:56 transom1 slurmctld: restarted Feb 23 12:22:56 transom1 systemd: Starting Slurm controller daemon... Feb 23 12:22:56 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 12:22:56 transom1 systemd: Started Slurm controller daemon. Feb 23 19:23:20 transom1 systemd: Stopping Slurm controller daemon... Feb 23 19:23:20 transom1 slurmctld: restarted Feb 23 19:23:20 transom1 systemd: Starting Slurm controller daemon... Feb 23 19:23:20 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 19:23:20 transom1 systemd: Started Slurm controller daemon. Feb 23 20:19:23 transom1 systemd: Stopping Slurm controller daemon... Feb 23 20:19:23 transom1 slurmctld: restarted Feb 23 20:19:23 transom1 systemd: Starting Slurm controller daemon... Feb 23 20:19:23 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 20:19:23 transom1 systemd: Started Slurm controller daemon.
Hi Michael, Understood. You may close this job. We will modify our script & continue monitor the system. Thanks for all the help. Kind regards, Wei ________________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Saturday, 2 March 2019 12:36:47 PM To: Wei Hong Subject: [Bug 6572] slurmctld keep crashing Comment # 28<https://bugs.schedmd.com/show_bug.cgi?id=6572#c28> on bug 6572<https://bugs.schedmd.com/show_bug.cgi?id=6572> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to whong from comment #27<https://bugs.schedmd.com/show_bug.cgi?id=6572#c27>) > I agree that the mentioned restarted event in the log is triggered by > systemd. Though I am not aware of anyone did this at 6:31am, which is quite > early. The script is not perfect, but my colleague don't feel safe to set > those partitions to be on by default. > > Plus I don't think all those script triggered restarted event is caused by > systemd. I think systemd is causing *ALL* the restarts, not just the one I posted. messages-20190223 shows that every restart event had something to do with systemd. The point is, Slurm isn't crashing, as far as I can tell. If it crashed (segfault or abort), then there would have been a core dump file and maybe some errors in the logs. My advice would be to turn off the systemd slurmctld service file and your custom restart script to prove that Slurm is really crashing. It's possible that your restart script and slurmd are negatively interfering with each other, since they are both apparently trying to restart Slurm at the same time. Until you figure that out, I don't think there is much else I can do to help. Thanks, -Michael messages-20190223 ======================================= Feb 22 17:02:26 transom1 systemd: Stopping Slurm controller daemon... Feb 22 17:02:26 transom1 slurmctld: restarted Feb 22 17:02:27 transom1 systemd: Starting Slurm controller daemon... Feb 22 17:02:27 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 22 17:02:27 transom1 systemd: Started Slurm controller daemon. Feb 22 18:43:39 transom1 systemd: Stopping Slurm controller daemon... Feb 22 18:43:39 transom1 slurmctld: restarted Feb 22 18:43:39 transom1 systemd: Starting Slurm controller daemon... Feb 22 18:43:39 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 22 18:43:39 transom1 systemd: Started Slurm controller daemon. Feb 22 22:30:29 transom1 slurmctld: restarted Feb 22 22:30:29 transom1 systemd: Stopping Slurm controller daemon... Feb 22 22:30:29 transom1 systemd: Starting Slurm controller daemon... Feb 22 22:30:29 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 22 22:30:29 transom1 systemd: Started Slurm controller daemon. Feb 23 00:43:42 transom1 slurmctld: restarted Feb 23 00:43:42 transom1 systemd: Stopping Slurm controller daemon... Feb 23 00:43:43 transom1 systemd: Starting Slurm controller daemon... Feb 23 00:43:43 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 00:43:43 transom1 systemd: Started Slurm controller daemon. Feb 23 04:02:12 transom1 systemd: Stopping Slurm controller daemon... Feb 23 04:02:12 transom1 slurmctld: restarted Feb 23 04:02:12 transom1 systemd: Starting Slurm controller daemon... Feb 23 04:02:12 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 04:02:12 transom1 systemd: Started Slurm controller daemon. Feb 23 06:31:16 transom1 systemd: Stopping Slurm controller daemon... Feb 23 06:31:17 transom1 slurmctld: restarted Feb 23 06:31:17 transom1 systemd: Starting Slurm controller daemon... Feb 23 06:31:17 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 06:31:17 transom1 systemd: Started Slurm controller daemon. Feb 23 12:22:56 transom1 systemd: Stopping Slurm controller daemon... Feb 23 12:22:56 transom1 slurmctld: restarted Feb 23 12:22:56 transom1 systemd: Starting Slurm controller daemon... Feb 23 12:22:56 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 12:22:56 transom1 systemd: Started Slurm controller daemon. Feb 23 19:23:20 transom1 systemd: Stopping Slurm controller daemon... Feb 23 19:23:20 transom1 slurmctld: restarted Feb 23 19:23:20 transom1 systemd: Starting Slurm controller daemon... Feb 23 19:23:20 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 19:23:20 transom1 systemd: Started Slurm controller daemon. Feb 23 20:19:23 transom1 systemd: Stopping Slurm controller daemon... Feb 23 20:19:23 transom1 slurmctld: restarted Feb 23 20:19:23 transom1 systemd: Starting Slurm controller daemon... Feb 23 20:19:23 transom1 systemd: PID file /var/run/slurm/slurmctld.pid not readable (yet?) after start. Feb 23 20:19:23 transom1 systemd: Started Slurm controller daemon. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to whong from comment #29) > Hi Michael, > > Understood. You may close this job. We will modify our script & continue > monitor the system. > > Thanks for all the help. > > Kind regards, > Wei No problem. Feel free to reopen this if anything changes or you find new information. Thanks, Michael