Created attachment 14192 [details] slurmdbd service run in forground mssgs - 1 We had a power-related system crash of the Slurm controller (and nodes) at our one site, and upon restarting the systems, Slurm will not stay up and running, due to slurmdbd segfaulting... I can restart the service, but it goes down within a matter of minutes with segfaults. Please assist me with getting our Slurm system functional again.
Created attachment 14193 [details] slurmdbd service run in foreground mssgs - 2
(In reply to Will Dennis from comment #0) > Slurm will not stay up and > running, due to slurmdbd segfaulting Is slurmctld generating a core dump? If so, please load it into gdb and call the following: > gdb $(which slurmctld) $PATH_TO_CORE > set pagination off > set print pretty on > t a a bt full If it is not generating a core, please start slurmctld using gdb to catch the crash: > gdb --args slurmctld -Dvvvvvvvvvv > b fatal > r wait for it to stop and then: > set pagination off > set print pretty on > t a a bt full Please also attach latest slurm.conf (& friends) to this ticket. Please also call the following: > slurmctld -V
Created attachment 14194 [details] slurmdbd_core_gdb_output.txt.txt It is slurmdbd that is segfaulting, not slurmctld – did you mean to have me work with slurmctld? I attached the output of gdb analysis of slurmdbd, if indeed that’s what you wanted… Also attached slurm.conf and slurmdbd.conf from the controller node. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, May 11, 2020 9:22 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Nate Rini<mailto:nate@schedmd.com> changed bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> What Removed Added CC nate@schedmd.com<mailto:nate@schedmd.com> Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=9033#c2> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #0<show_bug.cgi?id=9033#c0>) > Slurm will not stay up and > running, due to slurmdbd segfaulting Is slurmctld generating a core dump? If so, please load it into gdb and call the following: > gdb $(which slurmctld) $PATH_TO_CORE > set pagination off > set print pretty on > t a a bt full If it is not generating a core, please start slurmctld using gdb to catch the crash: > gdb --args slurmctld -Dvvvvvvvvvv > b fatal > r wait for it to stop and then: > set pagination off > set print pretty on > t a a bt full Please also attach latest slurm.conf (& friends) to this ticket. Please also call the following: > slurmctld -V ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 14195 [details] slurm.conf
Created attachment 14196 [details] slurmdbd.conf
Oh, also: root@skycaptain1:~# slurmctld -V slurm-wlm 17.11.7 From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, May 11, 2020 9:22 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Nate Rini<mailto:nate@schedmd.com> changed bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> What Removed Added CC nate@schedmd.com<mailto:nate@schedmd.com> Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=9033#c2> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #0<show_bug.cgi?id=9033#c0>) > Slurm will not stay up and > running, due to slurmdbd segfaulting Is slurmctld generating a core dump? If so, please load it into gdb and call the following: > gdb $(which slurmctld) $PATH_TO_CORE > set pagination off > set print pretty on > t a a bt full If it is not generating a core, please start slurmctld using gdb to catch the crash: > gdb --args slurmctld -Dvvvvvvvvvv > b fatal > r wait for it to stop and then: > set pagination off > set print pretty on > t a a bt full Please also attach latest slurm.conf (& friends) to this ticket. Please also call the following: > slurmctld -V ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #3) > I attached the output of gdb analysis of slurmdbd, if indeed that’s what you > wanted… Yes, I wanted what is crashing. Please provide: > slurmdbd -V
> Reading symbols from /usr/sbin/slurmdbd...(no debugging symbols found)...done. Is this an Ubuntu universe installed deb? If so please install slurm-wlm-basic-plugins-dbg and any other debug debs related and pull the back trace again. > /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so The backtrace is missing most of the info.
We use an Ubuntu PPA of Slurm 17.11 published here: https://launchpad.net/~jonathonf/+archive/ubuntu/slurm Unfortunately, there are no “dbg” packages present in the repo to install… https://launchpad.net/~jonathonf/+archive/ubuntu/slurm/+packages I could provide you the core file if that would be helpful. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, May 11, 2020 9:49 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=9033#c8> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> > Reading symbols from /usr/sbin/slurmdbd...(no debugging symbols found)...done. Is this an Ubuntu universe installed deb? If so please install slurm-wlm-basic-plugins-dbg and any other debug debs related and pull the back trace again. > /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so The backtrace is missing most of the info. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #11) > We use an Ubuntu PPA of Slurm 17.11 published here: > https://launchpad.net/~jonathonf/+archive/ubuntu/slurm > > Unfortunately, there are no “dbg” packages present in the repo to install… > https://launchpad.net/~jonathonf/+archive/ubuntu/slurm/+packages Looks like this packager has chosen to call them *dev_*. Please try installed the dev packages and pulling the back trace.
Created attachment 14197 [details] slurmdbd core gdb output 2.txt Did the following: root@skycaptain1:~# apt install slurm-wlm-basic-plugins-dev libslurmdb-dev Reading package lists... Done Building dependency tree Reading state information... Done libslurmdb-dev is already the newest version (17.11.7-1~16.04.york0). The following NEW packages will be installed: slurm-wlm-basic-plugins-dev 0 upgraded, 1 newly installed, 0 to remove and 102 not upgraded. Need to get 1,282 kB of archives. After this operation, 7,671 kB of additional disk space will be used. Do you want to continue? [Y/n] y Get:1 http://ppa.launchpad.net/jonathonf/slurm/ubuntu xenial/main amd64 slurm-wlm-basic-plugins-dev amd64 17.11.7-1~16.04.york0 [1,282 kB] Fetched 1,282 kB in 1s (1,114 kB/s) Selecting previously unselected package slurm-wlm-basic-plugins-dev. (Reading database ... 330488 files and directories currently installed.) Preparing to unpack .../slurm-wlm-basic-plugins-dev_17.11.7-1~16.04.york0_amd64.deb ... Unpacking slurm-wlm-basic-plugins-dev (17.11.7-1~16.04.york0) ... Setting up slurm-wlm-basic-plugins-dev (17.11.7-1~16.04.york0) ... See attached for the new trace, which I unfortunately do not think is very different from the first one I sent… From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, May 11, 2020 10:29 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=9033#c12> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #11<show_bug.cgi?id=9033#c11>) > We use an Ubuntu PPA of Slurm 17.11 published here: > https://launchpad.net/~jonathonf/+archive/ubuntu/slurm > > Unfortunately, there are no “dbg” packages present in the repo to install… > https://launchpad.net/~jonathonf/+archive/ubuntu/slurm/+packages Looks like this packager has chosen to call them *dev_*. Please try installed the dev packages and pulling the back trace. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #13) > See attached for the new trace, which I unfortunately do not think is very > different from the first one I sent… Correct, we will have to change our tactic now. We shall need to compile and install latest Slurm on your controller. We generally advise against sites using RPMs/Debs due issues like this. Full install procedure is here: https://slurm.schedmd.com/quickstart_admin.html We are going to install Slurm to /usr/local/slurm/17.11.13-2/ and then only run the slurmdbd and slurmctld daemons from it to allow you to upgrade your whole cluster later. Basic steps: 1. cd /usr/local/src/ 2. wget 'https://download.schedmd.com/slurm/slurm-17.11.13-2.tar.bz2' 3. tar -xjvf slurm-17.11.13-2.tar.bz2 4. cd slurm-17.11.13-2 5. ./configure --sysconfdir=/etc/slurm/ 6. make -j 7. sudo make install 8. killall slurmctld slurmdbd #stop all slurm daemons on the controller 9. /usr/local/slurm/17.11.13-2/sbin/slurmdbd -Dvvv #watch to see if it SEGFAULTs 10. /usr/local/slurm/17.11.13-2/sbin/slurmctld -Dvvv #if slurmdbd does not crash, call this in a new window 11. sinfo 12. srun uptime
Captured the output of ./configure, and grepped for some strings that may affect the success of the compiled binaries, this is the output: $ grep -i -e warning -e "not found" -e unable slurm-configure-output.txt configure: WARNING: *** mysql_config not found. Evidently no MySQL development libs installed on system. configure: WARNING: unable to locate NUMA memory affinity functions configure: WARNING: unable to locate PAM libraries configure: WARNING: unable to locate json parser library configure: WARNING: unable to locate ofed installation configure: WARNING: Unable to locate HDF5 compilation helper scripts 'h5cc' or 'h5pcc'. configure: WARNING: LZ4 test program build failed. configure: WARNING: unable to locate hwloc installation configure: WARNING: unable to locate pmix installation configure: WARNING: unable to locate freeipmi installation configure: WARNING: unable to locate rrdtool installation configure: WARNING: unable to locate libssh2 installation configure: WARNING: Slurm internal X11 support disabled configure: WARNING: cannot build smap without curses or ncurses library checking for GLIB - version >= 2.7.1... Package glib-2.0 was not found in the pkg-config search path. Package gthread-2.0 was not found in the pkg-config search path. checking for GTK+ - version >= 2.7.1... Package gtk+-2.0 was not found in the pkg-config search path. Package gthread-2.0 was not found in the pkg-config search path. configure: WARNING: cannot build sview without gtk library checking whether this is a Cray XT or XE system... Unable to locate Cray APIs (usually in /opt/cray/alpscomm and /opt/cray/job) configure: WARNING: unable to locate DataWarp installation configure: WARNING: unable to locate netloc installation configure: WARNING: unable to locate lua package configure: WARNING: unable to build man page html files without man2html configure: WARNING: configured for readline support, but couldn't find libraries configure: WARNING: could not find working OpenSSL library configure: WARNING: unable to locate/link against libcurl-devel installation Can you advise whether I need to add any packages for successful binary compilation? (again, compiling on Ubuntu 16.04, .deb pkgs) From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Monday, May 11, 2020 at 11:20 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 15<https://bugs.schedmd.com/show_bug.cgi?id=9033#c15> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #13<show_bug.cgi?id=9033#c13>) > See attached for the new trace, which I unfortunately do not think is very > different from the first one I sent… Correct, we will have to change our tactic now. We shall need to compile and install latest Slurm on your controller. We generally advise against sites using RPMs/Debs due issues like this. Full install procedure is here: https://slurm.schedmd.com/quickstart_admin.html We are going to install Slurm to /usr/local/slurm/17.11.13-2/ and then only run the slurmdbd and slurmctld daemons from it to allow you to upgrade your whole cluster later. Basic steps: 1. cd /usr/local/src/ 2. wget 'https://download.schedmd.com/slurm/slurm-17.11.13-2.tar.bz2' 3. tar -xjvf slurm-17.11.13-2.tar.bz2 4. cd slurm-17.11.13-2 5. ./configure --sysconfdir=/etc/slurm/ 6. make -j 7. sudo make install 8. killall slurmctld slurmdbd #stop all slurm daemons on the controller 9. /usr/local/slurm/17.11.13-2/sbin/slurmdbd -Dvvv #watch to see if it SEGFAULTs 10. /usr/local/slurm/17.11.13-2/sbin/slurmctld -Dvvv #if slurmdbd does not crash, call this in a new window 11. sinfo 12. srun uptime
Will - Most of these can be installed through the repo and not every one of these is needed. You must determine what your site needs such as do you need X11 support in jobs and are you going to use the accounting database? You could build without these additional packages but I do advise you to look them over and see which ones you may need to "apt install" to fit your sites needs.
I ended up installing the following packages: freeipmi gcc git hdf5-helpers libcurl4-openssl-dev libfreeipmi-dev libhwloc-dev libjson-c-dev liblua5.3-dev libmariadb-client-lgpl-dev libmysqlclient-dev libncurses5-dev libncursesw5-dev libpam0g-dev libreadline-dev libssh2-1 lua5.3 make ruby ruby-dev Now the grepped output of ./configure stands at: configure: WARNING: unable to locate ofed installation configure: WARNING: Unable to compile HDF5 test program configure: WARNING: LZ4 test program build failed. configure: WARNING: unable to locate pmix installation configure: WARNING: unable to locate freeipmi installation configure: WARNING: unable to locate rrdtool installation configure: WARNING: unable to locate libssh2 installation configure: WARNING: Slurm internal X11 support disabled checking for GLIB - version >= 2.7.1... Package glib-2.0 was not found in the pkg-config search path. Package gthread-2.0 was not found in the pkg-config search path. checking for GTK+ - version >= 2.7.1... Package gtk+-2.0 was not found in the pkg-config search path. Package gthread-2.0 was not found in the pkg-config search path. configure: WARNING: cannot build sview without gtk library checking whether this is a Cray XT or XE system... Unable to locate Cray APIs (usually in /opt/cray/alpscomm and /opt/cray/job) configure: WARNING: unable to locate DataWarp installation configure: WARNING: unable to locate netloc installation configure: WARNING: unable to build man page html files without man2html I don’t see any obvious problems with the above (except maybe HDF5 test program fail? Can you advise if/when this is needed?) From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 11:24 AM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 19<https://bugs.schedmd.com/show_bug.cgi?id=9033#c19> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Jason Booth<mailto:jbooth@schedmd.com> Will - Most of these can be installed through the repo and not every one of these is needed. You must determine what your site needs such as do you need X11 support in jobs and are you going to use the accounting database? You could build without these additional packages but I do advise you to look them over and see which ones you may need to "apt install" to fit your sites needs. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #20) > configure: WARNING: unable to locate ofed installation > configure: WARNING: Unable to compile HDF5 test program > configure: WARNING: LZ4 test program build failed. > configure: WARNING: unable to locate pmix installation > configure: WARNING: unable to locate freeipmi installation > configure: WARNING: unable to locate rrdtool installation > configure: WARNING: unable to locate libssh2 installation > configure: WARNING: Slurm internal X11 support disabled > checking for GLIB - version >= 2.7.1... Package glib-2.0 was not found in > the pkg-config search path. > Package gthread-2.0 was not found in the pkg-config search path. > checking for GTK+ - version >= 2.7.1... Package gtk+-2.0 was not found in > the pkg-config search path. > Package gthread-2.0 was not found in the pkg-config search path. > configure: WARNING: cannot build sview without gtk library > checking whether this is a Cray XT or XE system... Unable to locate Cray > APIs (usually in /opt/cray/alpscomm and /opt/cray/job) > configure: WARNING: unable to locate DataWarp installation > configure: WARNING: unable to locate netloc installation > configure: WARNING: unable to build man page html files without man2html > > I don’t see any obvious problems with the above (except maybe HDF5 test > program fail? Can you advise if/when this is needed?) HDF5 is only needed if you are doing job profiling, which your config does not have enabled. Is 'make -j' working?
No, it unfortunately isn’t…. In file included from sh5util.c:66:0: ../hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory compilation terminated. Makefile:602: recipe for target 'sh5util.o' failed make[6]: *** [sh5util.o] Error 1 make[6]: *** Waiting for unfinished jobs.... libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I/usr/include/hdf5/serial -I/include -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -MT hdf5_api.lo -MD -MP -MF .deps/hdf5_api.Tpo -c hdf5_api.c -fPIC -DPIC -o .libs/hdf5_api.o In file included from hdf5_api.c:53:0: hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory compilation terminated. Makefile:715: recipe for target 'hdf5_api.lo' failed make[7]: *** [hdf5_api.lo] Error 1 make[7]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile/hdf5' Makefile:838: recipe for target '../libhdf5_api.la' failed make[6]: *** [../libhdf5_api.la] Error 2 make[6]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile/hdf5/sh5util' Makefile:734: recipe for target 'all-recursive' failed make[5]: *** [all-recursive] Error 1 make[5]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile/hdf5' Makefile:540: recipe for target 'all-recursive' failed make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile' Makefile:569: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins' Makefile:569: recipe for target 'all-recursive' failed make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src' Makefile:696: recipe for target 'all-recursive' failed make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory '/usr/local/src/slurm-17.11.13-2' Makefile:595: recipe for target 'all' failed make: *** [all] Error 2 From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 11:46 AM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=9033#c21> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> HDF5 is only needed if you are doing job profiling, which your config does not have enabled. Is 'make -j' working? ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #22) > hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory > compilation terminated. Please try with this instead: > ./configure --sysconfdir=/etc/slurm/ --prefix=/usr/local/slurm/17.11.13-2/ --with-hdf5=no Please also provide your mysql/mariadbd version.
OK, the “make –j” worked with those flags set. Running MySQL 5.7.30 on this cluster controller. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 11:54 AM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 23<https://bugs.schedmd.com/show_bug.cgi?id=9033#c23> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #22<show_bug.cgi?id=9033#c22>) > hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory > compilation terminated. Please try with this instead: > ./configure --sysconfdir=/etc/slurm/ --prefix=/usr/local/slurm/17.11.13-2/ --with-hdf5=no Please also provide your mysql/mariadbd version. ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 14205 [details] macluster-slurmdbd-17.11.13-2-foreground-segfault.txt OK, went ahead and installed, then ran the new 17.11.13-2 slurmdbd service, which came up and did not crash for a few minutes; then when I started the 17.11.13-2 slurmctld, it started and then I saw: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Going back to the slurmdbd service, I saw it had segfaulted. It did not produce a core file, however… Attached is the foreground messages from the compiled slurmdbd. From: Will Dennis Sent: Tuesday, May 12, 2020 12:57 PM To: 'bugs@schedmd.com' <bugs@schedmd.com> Subject: RE: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash OK, the “make –j” worked with those flags set. Running MySQL 5.7.30 on this cluster controller. From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> Sent: Tuesday, May 12, 2020 11:54 AM To: Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 23<https://bugs.schedmd.com/show_bug.cgi?id=9033#c23> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #22<show_bug.cgi?id=9033#c22>) > hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory > compilation terminated. Please try with this instead: > ./configure --sysconfdir=/etc/slurm/ --prefix=/usr/local/slurm/17.11.13-2/ --with-hdf5=no Please also provide your mysql/mariadbd version. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #25) > Going back to the slurmdbd service, I saw it had segfaulted. It did not > produce a core file, however… Please follow the procedure from comment #2 to use gdb to get a backtrace. > Running MySQL 5.7.30 on this cluster controller. MySQL 5.7.30 has a known bug (per bug#9006) that causes SEGFAULTs. Was the MySQL version recently changed?
Yes, it was at 5.7.29 before today; when I added in the mysql devel lib’s, it upgraded all the MySQL pkgs to 5.7.30… Before the upgrade today, it was last upgraded on 1/28/2020 from 5.7.28 to 5.7.29; the slurmdbd segfault issue only started happening yesterday (after the controller machine went down hard with the power fault.) From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 1:21 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 26<https://bugs.schedmd.com/show_bug.cgi?id=9033#c26> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #25<show_bug.cgi?id=9033#c25>) > Going back to the slurmdbd service, I saw it had segfaulted. It did not > produce a core file, however… Please follow the procedure from comment #2<show_bug.cgi?id=9033#c2> to use gdb to get a backtrace. > Running MySQL 5.7.30 on this cluster controller. MySQL 5.7.30 has a known bug (per bug#9006<show_bug.cgi?id=9006>) that causes SEGFAULTs. Was the MySQL version recently changed? ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #27) > Yes, it was at 5.7.29 before today; when I added in the mysql devel lib’s, > it upgraded all the MySQL pkgs to 5.7.30… > Before the upgrade today, it was last upgraded on 1/28/2020 from 5.7.28 to > 5.7.29; the slurmdbd segfault issue only started happening yesterday (after > the controller machine went down hard with the power fault.) Please downgrade back to .29 or apply the fix from this bug: https://bugs.mysql.com/bug.php?id=99485
Created attachment 14206 [details] slurmdbd_gdb_output_3.txt See attached for gdb output from latest segfault (seems to happen pretty much as soon as I start slurmctld) From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 1:21 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 26<https://bugs.schedmd.com/show_bug.cgi?id=9033#c26> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #25<show_bug.cgi?id=9033#c25>) > Going back to the slurmdbd service, I saw it had segfaulted. It did not > produce a core file, however… Please follow the procedure from comment #2<show_bug.cgi?id=9033#c2> to use gdb to get a backtrace. > Running MySQL 5.7.30 on this cluster controller. MySQL 5.7.30 has a known bug (per bug#9006<show_bug.cgi?id=9006>) that causes SEGFAULTs. Was the MySQL version recently changed? ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #29) > Created attachment 14206 [details] > slurmdbd_gdb_output_3.txt > > See attached for gdb output from latest segfault (seems to happen pretty > much as soon as I start slurmctld) Please call this in gdb: > t 7 > f 0 > p row2[ASSOC2_REQ_MTPJ][0] > p row2[ASSOC2_REQ_MTPJ]
(gdb) t 7 [Switching to thread 7 (Thread 0x7ffff4c5e700 (LWP 12162))] #0 0x00007ffff69c7a0b in _cluster_get_assocs (mysql_conn=mysql_conn@entry=0x7fffe0009820, user=user@entry=0x7ffff4c5daf0, assoc_cond=assoc_cond@entry=0x7fffe00009a0, cluster_name=0x7fffe00008d0 "macluster", fields=<optimized out>, sent_extra=<optimized out>, is_admin=true, sent_list=0x7fffe00101e0) at as_mysql_assoc.c:2073 2073 if (row2[ASSOC2_REQ_MTPJ][0]) (gdb) f 0 #0 0x00007ffff69c7a0b in _cluster_get_assocs (mysql_conn=mysql_conn@entry=0x7fffe0009820, user=user@entry=0x7ffff4c5daf0, assoc_cond=assoc_cond@entry=0x7fffe00009a0, cluster_name=0x7fffe00008d0 "macluster", fields=<optimized out>, sent_extra=<optimized out>, is_admin=true, sent_list=0x7fffe00101e0) at as_mysql_assoc.c:2073 2073 if (row2[ASSOC2_REQ_MTPJ][0]) (gdb) p row2[ASSOC2_REQ_MTPJ][0] Cannot access memory at address 0x0 (gdb) p row2[ASSOC2_REQ_MTPJ] $1 = 0x0 (gdb) From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 1:43 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 30<https://bugs.schedmd.com/show_bug.cgi?id=9033#c30> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #29<show_bug.cgi?id=9033#c29>) > Created attachment 14206 [details]<attachment.cgi?id=14206> [details]<attachment.cgi?id=14206&action=edit> > slurmdbd_gdb_output_3.txt > > See attached for gdb output from latest segfault (seems to happen pretty > much as soon as I start slurmctld) Please call this in gdb: > t 7 > f 0 > p row2[ASSOC2_REQ_MTPJ][0] > p row2[ASSOC2_REQ_MTPJ] ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 14207 [details] work around patch Please apply this patch: 1. git am bug9033.NEC.workaround.1711.patch 2. make -j install 3. slurmdbd -Dvvvv
Never have used “git am” before… Where do I do the git am bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch, in /usr/local/src/slurm-17.11.13-2 ?? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 2:22 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 32<https://bugs.schedmd.com/show_bug.cgi?id=9033#c32> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> Created attachment 14207 [details]<attachment.cgi?id=14207&action=diff> [details]<attachment.cgi?id=14207&action=edit> work around patch Please apply this patch: 1. git am bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch 2. make -j install 3. slurmdbd -Dvvvv ________________________________ You are receiving this mail because: * You reported the bug.
And, before “make –j install”, do a “make clean” or not? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 2:22 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 32<https://bugs.schedmd.com/show_bug.cgi?id=9033#c32> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> Created attachment 14207 [details]<attachment.cgi?id=14207&action=diff> [details]<attachment.cgi?id=14207&action=edit> work around patch Please apply this patch: 1. git am bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch 2. make -j install 3. slurmdbd -Dvvvv ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #33) > Never have used “git am” before… Try this instead if you don't have git: > 1. patch -p1 < bug9033.NEC.workaround.1711.patch (In reply to Will Dennis from comment #34) > And, before “make –j install”, do a “make clean” or not? It is not required for this patch.
I do have git installed, but ran this and it seems to have worked… root@skycaptain1:/usr/local/src/slurm-17.11.13-2# patch -p1 < slurm_bug9033_workaround_patch.txt patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c Have not gotten to downgrading the MySQL as of yet, looks like I could go down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or anything… Must I do this before trying out the patch? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 3:09 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 35<https://bugs.schedmd.com/show_bug.cgi?id=9033#c35> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #33<show_bug.cgi?id=9033#c33>) > Never have used “git am” before… Try this instead if you don't have git: > 1. patch -p1 < bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch (In reply to Will Dennis from comment #34<show_bug.cgi?id=9033#c34>) > And, before “make –j install”, do a “make clean” or not? It is not required for this patch. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #36) > I do have git installed, but ran this and it seems to have worked… > > root@skycaptain1:/usr/local/src/slurm-17.11.13-2# patch -p1 < > slurm_bug9033_workaround_patch.txt > patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c > > Have not gotten to downgrading the MySQL as of yet, looks like I could go > down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or > anything… Must I do this before trying out the patch? Both fixes may be required. Please try this patch first to see if we can get the system online.
OK, started the recompiled slurmdbd, then started the slurmctld, no immediate crash. Then did: root@skycaptain1:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST allgpus* up 60-00:00:0 11 mix skyserver18k,skyserver19k,skyserver23k,skyserver24k,skyserver27k,skyserver28k,skyserver29k,skyserver30k,skyserver31k,skyserver33k,skyserver34k allgpus* up 60-00:00:0 1 alloc skyserver22k allgpus* up 60-00:00:0 16 idle skyserver10k,skyserver11k,skyserver12k,skyserver13k,skyserver15k,skyserver16k,skyserver17k,skyserver20k,skyserver21k,skyserver25k,skyserver26k,skyserver32k,skyserver36k,skyserver37k,skyserver38k,skyserver39k desktops up 60-00:00:0 5 resv ma17-pc4,ma18-pc[1,5],ma19-pc[2-3] desktops up 60-00:00:0 8 idle ma17-pc[1-3,5],ma18-pc[3-4,6],ma19-pc1 root@skycaptain1:~# root@skycaptain1:~# srun uptime 12:19:37 up 1 day, 1:23, 0 users, load average: 0.17, 0.10, 0.11 root@skycaptain1:~# root@skycaptain1:~# root@skycaptain1:~# sacctmgr show runawayjobs Runaway Jobs: No runaway jobs found Happy to say, slurmdbd and slurmctld are still up and running… Anything else you want me to look at while the daemons are running in foreground? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 3:23 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 37<https://bugs.schedmd.com/show_bug.cgi?id=9033#c37> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #36<show_bug.cgi?id=9033#c36>) > I do have git installed, but ran this and it seems to have worked… > > root@skycaptain1:/usr/local/src/slurm-17.11.13-2#<mailto:root@skycaptain1:/usr/local/src/slurm-17.11.13-2#> patch -p1 < > slurm_bug9033_workaround_patch.txt > patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c > > Have not gotten to downgrading the MySQL as of yet, looks like I could go > down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or > anything… Must I do this before trying out the patch? Both fixes may be required. Please try this patch first to see if we can get the system online. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #38) > OK, started the recompiled slurmdbd, then started the slurmctld, no > immediate crash. > > Then did: > > root@skycaptain1:~# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > allgpus* up 60-00:00:0 11 mix > skyserver18k,skyserver19k,skyserver23k,skyserver24k,skyserver27k, > skyserver28k,skyserver29k,skyserver30k,skyserver31k,skyserver33k,skyserver34k > allgpus* up 60-00:00:0 1 alloc skyserver22k > allgpus* up 60-00:00:0 16 idle > skyserver10k,skyserver11k,skyserver12k,skyserver13k,skyserver15k, > skyserver16k,skyserver17k,skyserver20k,skyserver21k,skyserver25k, > skyserver26k,skyserver32k,skyserver36k,skyserver37k,skyserver38k,skyserver39k > desktops up 60-00:00:0 5 resv ma17-pc4,ma18-pc[1,5],ma19-pc[2-3] > desktops up 60-00:00:0 8 idle > ma17-pc[1-3,5],ma18-pc[3-4,6],ma19-pc1 > root@skycaptain1:~# > root@skycaptain1:~# srun uptime > 12:19:37 up 1 day, 1:23, 0 users, load average: 0.17, 0.10, 0.11 > root@skycaptain1:~# > root@skycaptain1:~# > root@skycaptain1:~# sacctmgr show runawayjobs > Runaway Jobs: No runaway jobs found > > Happy to say, slurmdbd and slurmctld are still up and running… > > Anything else you want me to look at while the daemons are running in > foreground? > > > From: bugs@schedmd.com <bugs@schedmd.com> > Sent: Tuesday, May 12, 2020 3:23 PM > To: Will Dennis <wdennis@nec-labs.com> > Subject: [Bug 9033] slurmdbd service segfaulting after restarting from > power-induced system crash > > Comment # 37<https://bugs.schedmd.com/show_bug.cgi?id=9033#c37> on bug > 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate > Rini<mailto:nate@schedmd.com> > > (In reply to Will Dennis from comment #36<show_bug.cgi?id=9033#c36>) > > > I do have git installed, but ran this and it seems to have worked… > > > > > > root@skycaptain1:/usr/local/src/slurm-17.11.13-2#<mailto:root@skycaptain1:/usr/local/src/slurm-17.11.13-2#> patch -p1 < > > > slurm_bug9033_workaround_patch.txt > > > patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c > > > > > > Have not gotten to downgrading the MySQL as of yet, looks like I could go > > > down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or > > > anything… Must I do this before trying out the patch? Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime.
> anything… Must I do this before trying out the patch? Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime.
OK, will try to do that, and hopefully it works… Moving forward, I assume that I’ll now have to use the compiled version of slurmdbd and slurmctld on this server, instead of the packaged version (which clearly suffer from the bug that was patched.) That means that I will have to modify the system unit files that start the slurmdbd and slurmctld at a minimum; anything else that I need to check? And, could someone explain the bug that was patched? Why did it seem to only be triggered by this system down event? From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 3:35 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 39<https://bugs.schedmd.com/show_bug.cgi?id=9033#c39> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #38<show_bug.cgi?id=9033#c38>) > OK, started the recompiled slurmdbd, then started the slurmctld, no > immediate crash. > > Then did: > > root@skycaptain1:~# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > allgpus* up 60-00:00:0 11 mix > skyserver18k,skyserver19k,skyserver23k,skyserver24k,skyserver27k, > skyserver28k,skyserver29k,skyserver30k,skyserver31k,skyserver33k,skyserver34k > allgpus* up 60-00:00:0 1 alloc skyserver22k > allgpus* up 60-00:00:0 16 idle > skyserver10k,skyserver11k,skyserver12k,skyserver13k,skyserver15k, > skyserver16k,skyserver17k,skyserver20k,skyserver21k,skyserver25k, > skyserver26k,skyserver32k,skyserver36k,skyserver37k,skyserver38k,skyserver39k > desktops up 60-00:00:0 5 resv ma17-pc4,ma18-pc[1,5],ma19-pc[2-3] > desktops up 60-00:00:0 8 idle > ma17-pc[1-3,5],ma18-pc[3-4,6],ma19-pc1 > root@skycaptain1:~# > root@skycaptain1:~# srun uptime > 12:19:37 up 1 day, 1:23, 0 users, load average: 0.17, 0.10, 0.11 > root@skycaptain1:~# > root@skycaptain1:~# > root@skycaptain1:~# sacctmgr show runawayjobs > Runaway Jobs: No runaway jobs found > > Happy to say, slurmdbd and slurmctld are still up and running… > > Anything else you want me to look at while the daemons are running in > foreground? > > > From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>> > Sent: Tuesday, May 12, 2020 3:23 PM > To: Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>> > Subject: [Bug 9033<show_bug.cgi?id=9033>] slurmdbd service segfaulting after restarting from > power-induced system crash > > Comment # 37<show_bug.cgi?id=9033#c37><https://bugs.schedmd.com/show_bug.cgi?id=9033#c37<show_bug.cgi?id=9033#c37>> on bug > 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033<show_bug.cgi?id=9033>> from Nate > Rini<mailto:nate@schedmd.com> > > (In reply to Will Dennis from comment #36<show_bug.cgi?id=9033#c36><show_bug.cgi?id=9033#c36>) > > > I do have git installed, but ran this and it seems to have worked… > > > > > > root@skycaptain1:/usr/local/src/slurm-17.11.13-2#<mailto:root@skycaptain1:/usr/local/src/slurm-17.11.13-2#> patch -p1 < > > > slurm_bug9033_workaround_patch.txt > > > patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c > > > > > > Have not gotten to downgrading the MySQL as of yet, looks like I could go > > > down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or > > > anything… Must I do this before trying out the patch? Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime. ________________________________ You are receiving this mail because: * You reported the bug.
Seeing stuff like this in the slurmdbd message output: slurmdbd: error: We have more allocated time than is possible (15019200 > 4118400) for cluster macluster(1144) from 2020-05-12T11:00:00 - 2020-05-12T12:00:00 tres 1 slurmdbd: error: We have more allocated time than is possible (96753196800 > 32092592400) for cluster macluster(8914609) from 2020-05-12T11:00:00 - 2020-05-12T12:00:00 tres 2 slurmdbd: error: We have more allocated time than is possible (14572800 > 4118400) for cluster macluster(1144) from 2020-05-12T11:00:00 - 2020-05-12T12:00:00 tres 5 slurmdbd: error: id_assoc 12 doesn't have any tres slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580. slurmdbd: error: We have more allocated time than is possible (14994470 > 4118400) for cluster macluster(1144) from 2020-05-12T12:00:00 - 2020-05-12T13:00:00 tres 1 slurmdbd: error: We have more allocated time than is possible (96409043136 > 32092592400) for cluster macluster(8914609) from 2020-05-12T12:00:00 - 2020-05-12T13:00:00 tres 2 slurmdbd: error: We have more allocated time than is possible (14548070 > 4118400) for cluster macluster(1144) from 2020-05-12T12:00:00 - 2020-05-12T13:00:00 tres 5 slurmdbd: error: id_assoc 12 doesn't have any tres Indicative of db problems? From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 3:36 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 40<https://bugs.schedmd.com/show_bug.cgi?id=9033#c40> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> > anything… Must I do this before trying out the patch? Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #41) > Moving forward, I assume that I’ll now have to use the compiled version of > slurmdbd and slurmctld on this server, instead of the packaged version > (which clearly suffer from the bug that was patched.) That means that I will > have to modify the system unit files that start the slurmdbd and slurmctld > at a minimum; anything else that I need to check? I don't have any details about your cluster, but in generally we suggest to run the compiled version on all of your nodes. The setup in comment #15 was *only* to get your system back online. Assuming all of your nodes run the same os clone installed, then it should be possible to either clone out Slurm to all of the nodes or to use a shared file system (less preferred). I suggest uninstalling the debs to avoid them overwriting any changes your about to make to the systemd unit files. Take thorough backups first before doing anything. We gave a patch for 17.11 but that release has been deprecated. Please upgrade to 19.05 as soon as possible. Note that the install prefix explicitly had the installed version to make it easy to swap versions. Slurm is capable of talking to 2 previous versions which should allow a migration of compute nodes instead of needing a full outage. (see https://slurm.schedmd.com/quickstart_admin.html#upgrade) > And, could someone explain the bug that was patched? Why did it seem to only > be triggered by this system down event? Based on the gdb output, looks like the association table was partially corrupted as a value in one of the rows was missing. I suggest calling the following to rebuild your association tables: > slurmdbd -R I also suggest manually verifying your association table is correct with your internal expectations as any data lost can not be recovered, just repaired. This can be done while slurmctld is online and serving jobs. We will be creating a real patch that will be mainlined to fix the problem permanently. Can we lower the severity of this ticket?
(In reply to Will Dennis from comment #42) > Seeing stuff like this in the slurmdbd message output: > > Indicative of db problems? That is normal when Slurm is taken down hard. Once you have upgraded to 19.05 (or 20.02) please call the following to cleanup the jobs: > sacctmgr show runaways The self repair code in 17.11 is lacking, so I suggest waiting until after the upgrade.
Created attachment 14209 [details] skycaptain1_mysql_downgrade.txt Did the MySQL downgrade to v5.7.11, and held it to prevent upgrades (at least until they release 5.7.31+ for Ubuntu 16.04) – see attached for downgrade output From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 3:36 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 40<https://bugs.schedmd.com/show_bug.cgi?id=9033#c40> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> > anything… Must I do this before trying out the patch? Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #45) > Did the MySQL downgrade to v5.7.11, and held it to prevent upgrades (at > least until they release 5.7.31+ for Ubuntu 16.04) – see attached for > downgrade output > > mysql_upgrade: (non fatal) [ERROR] 1728: Cannot load from mysql.proc. The table is probably corrupted > mysql_upgrade: (non fatal) [ERROR] 1545: Failed to open mysql.event Looks like mysql_upgrade eventually figured it out. I take it that the system is now back to running jobs?
Not quite yet… * I have to figure out how to remove the old Slurm 17.11.7 packages without removing any needed conf files etc.; want to get a system backup first, which I have to work with another IT team member to get done * Need to modify the existing systemd unit files (or create them, in case the above deletes them) to call the new patched slurmdbd and slurmctld binaries * Need to test out the system, and inspect the Slurm DB tables, especially the association table Tangential issue: need to speak with someone from SchedMD to figure out how to orchestrate a conversion from packaged Slurm binaries to compiled ones, and how to do this over a fleet of servers that comprise the cluster, especially doing upgrades in the future (perhaps this is all documented in detail somewhere?) From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, May 12, 2020 at 6:38 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Comment # 46<https://bugs.schedmd.com/show_bug.cgi?id=9033#c46> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #45<show_bug.cgi?id=9033#c45>) > Did the MySQL downgrade to v5.7.11, and held it to prevent upgrades (at > least until they release 5.7.31+ for Ubuntu 16.04) – see attached for > downgrade output > > mysql_upgrade: (non fatal) [ERROR] 1728: Cannot load from mysql.proc. The table is probably corrupted > mysql_upgrade: (non fatal) [ERROR] 1545: Failed to open mysql.event Looks like mysql_upgrade eventually figured it out. I take it that the system is now back to running jobs? ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #47) > Not quite yet… Understood, going to reduce this to SEV3 as there is a work around for the issue. > * I have to figure out how to remove the old Slurm 17.11.7 packages > without removing any needed conf files etc.; want to get a system backup > first, which I have to work with another IT team member to get done Thorough backups are always suggested. > * Need to test out the system, and inspect the Slurm DB tables, > especially the association table This can and should be done from 'sacctmgr' commands. Something as simple as a check of 'sacctmgr show assoc' output should be sufficient. The assoc tables can be updated while the system is running using the sacctmgr commands. > Tangential issue: need to speak with someone from SchedMD to figure out how > to orchestrate a conversion from packaged Slurm binaries to compiled ones, We are happy to help with compiling and installing Slurm from official source balls. > and how to do this over a fleet of servers that comprise the cluster, > especially doing upgrades in the future (perhaps this is all documented in > detail somewhere?) How this is done usually depends heavily on how the systems integrator setup the cluster and what suite is managing the cluster.
All the ‘sacctmgr’ command I used, seem to work fine, as does ‘sacct’ and ‘sdiag’, when using the new binaries. We are a R&D shop, so way less formal with procedures (such as backup!) than we should be, but we are learning lessons as we go… We do not use any sort of systems integrator, or cluster management suite – it’s just all done by us (central IT + some researchers.) So pointers on how to roll out Slurm binaries across a fleet of servers without either using the OS package management system, or compiling everything on each server by hand, would be gratefully accepted :) From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, May 12, 2020 6:57 PM To: Will Dennis <wdennis@nec-labs.com> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash Nate Rini<mailto:nate@schedmd.com> changed bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> What Removed Added Severity 1 - System not usable 3 - Medium Impact Comment # 48<https://bugs.schedmd.com/show_bug.cgi?id=9033#c48> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> (In reply to Will Dennis from comment #47<show_bug.cgi?id=9033#c47>) > Not quite yet… Understood, going to reduce this to SEV3 as there is a work around for the issue. > * I have to figure out how to remove the old Slurm 17.11.7 packages > without removing any needed conf files etc.; want to get a system backup > first, which I have to work with another IT team member to get done Thorough backups are always suggested. > * Need to test out the system, and inspect the Slurm DB tables, > especially the association table This can and should be done from 'sacctmgr' commands. Something as simple as a check of 'sacctmgr show assoc' output should be sufficient. The assoc tables can be updated while the system is running using the sacctmgr commands. > Tangential issue: need to speak with someone from SchedMD to figure out how > to orchestrate a conversion from packaged Slurm binaries to compiled ones, We are happy to help with compiling and installing Slurm from official source balls. > and how to do this over a fleet of servers that comprise the cluster, > especially doing upgrades in the future (perhaps this is all documented in > detail somewhere?) How this is done usually depends heavily on how the systems integrator setup the cluster and what suite is managing the cluster. ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Will Dennis from comment #49) > We do not use any sort of systems integrator, or cluster management suite – > it’s just all done by us (central IT + some researchers.) So pointers on how > to roll out Slurm binaries across a fleet of servers without either using > the OS package management system, or compiling everything on each server by > hand, would be gratefully accepted :) How is the OS image managed on all of the nodes?
We deploy the OS on the servers via a basic PXE-based answer-file install, then configure the nodes to production standard config via Ansible. Currently in the Ansible playbook, we are adding the Slurm 17.11.7 PPA repo, and then doing package installation for the requisite Slurm packages on the nodes. Would need to change this to have Ansible do "the right thing" to deploy compiled binaries to the different cluster nodes, and/or compile Slurm on each node...
I have uninstalled the Slurm packages from the controller, and now need to fix up the systemd unit files for starting the new /usr/local/slurm-based daemon binaries. Currently, I have these entries in the following directories: In /etc/systemd/system: lrwxrwxrwx 1 root root 9 May 13 08:29 slurmctld.service -> /dev/null lrwxrwxrwx 1 root root 9 May 13 08:29 slurmdbd.service -> /dev/null In /etc/systemd/system/multi-user.target.wants/ : lrwxrwxrwx 1 root root 37 Feb 25 2019 slurmctld.service -> /lib/systemd/system/slurmctld.service lrwxrwxrwx 1 root root 36 Feb 23 2019 slurmdbd.service -> /lib/systemd/system/slurmdbd.service (There are no longer any slurm*.service files in /lib/systemd/system) Is it "the right thing" to change the /etc/systemd/system symlinks to point at the new .service files in /usr/local/src/slurm-17.11.13-2/etc and then do a "systemctl enable slurm[dbd,ctld]"?
Put two comments in via the ticket’s web interface, not sure if you get notified about those… I’d like to get some guidance on the system question I asked there, if you could get to that fairly soon… Thanks!
(In reply to Will Dennis from comment #53) > Put two comments in via the ticket’s web interface, not sure if you get > notified about those… > I’d like to get some guidance on the system question I asked there, if you > could get to that fairly soon… Thanks! Working on a response right now.
(In reply to Will Dennis from comment #51) > We deploy the OS on the servers via a basic PXE-based answer-file install, > then configure the nodes to production standard config via Ansible. > > Currently in the Ansible playbook, we are adding the Slurm 17.11.7 PPA repo, > and then doing package installation for the requisite Slurm packages on the > nodes. Would need to change this to have Ansible do "the right thing" to > deploy compiled binaries to the different cluster nodes, and/or compile > Slurm on each node... I assume these are nodes with physical disks. It should possible to tell ansible to clone/rsync out /usr/local/slurm from your controller to all of the nodes. I believe that would be the easiest solution here. (In reply to Will Dennis from comment #52) > Is it "the right thing" to change the /etc/systemd/system symlinks to point > at the new .service files in /usr/local/src/slurm-17.11.13-2/etc and then do > a "systemctl enable slurm[dbd,ctld]"? This is distro specific (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sect-managing_services_with_systemd-unit_files) but /etc/systemd/system should be fine. Note, it is common for sites to put a symlink in /usr/local/slurm/current -> /usr/local/src/slurm-17.11.13-2 and then point the system symlinks to the current link instead for easy upgrades. Sites use 'latest' or 'current' or another name as wanted.
OK, did the following: root@skycaptain1:/usr/local/slurm# ln -s 17.11.13-2 current root@skycaptain1:/usr/local/slurm# cd /etc/systemd/system root@skycaptain1:/etc/systemd/system# mkdir /usr/local/slurm/current/etc root@skycaptain1:/etc/systemd/system# cp /usr/local/src/slurm-17.11.13-2/etc/slurmctld.service /usr/local/slurm/current/etc root@skycaptain1:/etc/systemd/system# cp /usr/local/src/slurm-17.11.13-2/etc/slurmdbd.service /usr/local/slurm/current/etc root@skycaptain1:/etc/systemd/system# rm slurmctld.service root@skycaptain1:/etc/systemd/system# rm slurmdbd.service root@skycaptain1:/etc/systemd/system# ln -s /usr/local/slurm/current/etc/slurmctld.service slurmctld.service root@skycaptain1:/etc/systemd/system# ln -s /usr/local/slurm/current/etc/slurmdbd.service slurmdbd.service root@skycaptain1:/etc/systemd/system# systemctl daemon-reload root@skycaptain1:/etc/systemd/system# systemctl start slurmdbd The service startup hung, and finally terminated with a timeout error. When I got service status thereafter, I see: root@skycaptain1:/etc/systemd/system# systemctl status slurmdbd.service ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/local/slurm/current/etc/slurmdbd.service; enabled; vendor preset: enabled) Active: failed (Result: timeout) since Wed 2020-05-13 11:30:53 PDT; 3min 43s ago Process: 6780 ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 19372 (code=killed, signal=SEGV) May 13 11:29:23 skycaptain1 systemd[1]: Starting Slurm DBD accounting daemon... May 13 11:29:23 skycaptain1 systemd[1]: slurmdbd.service: Can't open PID file /var/run/slurmdbd.pid (yet?) after start: No such file or directory May 13 11:30:53 skycaptain1 systemd[1]: slurmdbd.service: Start operation timed out. Terminating. May 13 11:30:53 skycaptain1 systemd[1]: Failed to start Slurm DBD accounting daemon. May 13 11:30:53 skycaptain1 systemd[1]: slurmdbd.service: Unit entered failed state. May 13 11:30:53 skycaptain1 systemd[1]: slurmdbd.service: Failed with result 'timeout'. The end of the slurmdbd.log has: [2020-05-13T11:29:23.235] error: id_assoc 12 doesn't have any tres [2020-05-13T11:29:23.241] error: We have more allocated time than is possible (14616000 > 4118400) for cluster macluster(1144) from 2020-05-13T09:00:00 - 2020-05-13T10:00:00 tres 1 [2020-05-13T11:29:23.241] error: We have more allocated time than is possible (91033516800 > 32092592400) for cluster macluster(8914609) from 2020-05-13T09:00:00 - 2020-05-13T10:00:00 tres 2 [2020-05-13T11:29:23.241] error: We have more allocated time than is possible (14212800 > 4118400) for cluster macluster(1144) from 2020-05-13T09:00:00 - 2020-05-13T10:00:00 tres 5 [2020-05-13T11:29:23.241] error: id_assoc 12 doesn't have any tres [2020-05-13T11:29:23.247] error: We have more allocated time than is possible (14616000 > 4118400) for cluster macluster(1144) from 2020-05-13T10:00:00 - 2020-05-13T11:00:00 tres 1 [2020-05-13T11:29:23.247] error: We have more allocated time than is possible (91033516800 > 32092592400) for cluster macluster(8914609) from 2020-05-13T10:00:00 - 2020-05-13T11:00:00 tres 2 [2020-05-13T11:29:23.247] error: We have more allocated time than is possible (14212800 > 4118400) for cluster macluster(1144) from 2020-05-13T10:00:00 - 2020-05-13T11:00:00 tres 5 [2020-05-13T11:29:23.248] error: id_assoc 12 doesn't have any tres [2020-05-13T11:29:23.575] debug2: No need to roll cluster macluster this month 1588316400 <= 1588316400 [2020-05-13T11:29:23.598] debug2: Got 1 of 1 rolled up [2020-05-13T11:29:23.598] debug2: Everything rolled up [2020-05-13T11:30:53.165] Terminate signal (SIGINT or SIGTERM) received [2020-05-13T11:30:53.165] debug: rpc_mgr shutting down So, it looks to me like the slurmdbd service did start OK and was running, but systemd thought it wasn’t responding (?) and terminated it… The current .service unit file is: root@skycaptain1:/etc/systemd/system# cat /etc/systemd/system/slurmdbd.service [Unit] Description=Slurm DBD accounting daemon After=network.target munge.service ConditionPathExists=/etc/slurm/slurmdbd.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmdbd ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmdbd.pid TasksMax=infinity [Install] WantedBy=multi-user.target I know systemd isn’t a part of Slurm, but any ideas on what may be going on here?
(In reply to Will Dennis from comment #56) > I know systemd isn’t a part of Slurm, but any ideas on what may be going on > here? We provide example unit files (https://github.com/SchedMD/slurm/blob/master/etc/slurmdbd.service.in) which we generally suggest sites to follow. > root@skycaptain1:/etc/systemd/system# cat > /etc/systemd/system/slurmdbd.service > [Unit] > Description=Slurm DBD accounting daemon > After=network.target munge.service > ConditionPathExists=/etc/slurm/slurmdbd.conf > > [Service] > Type=forking Change this to "simple" > EnvironmentFile=-/etc/sysconfig/slurmdbd > ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS Add "-D" here to keep the daemon in foreground mode. > ExecReload=/bin/kill -HUP $MAINPID > PIDFile=/var/run/slurmdbd.pid This may need to be changed to /usr/local/slurm/.../var/run/slurmdbd.pid
It was the PIDFile value; the “slurm” user couldn’t write to that path that was defined there… Fixed that, now all is well, slurmdbd service up and running. Now I’ll do the slurmctld.service file; do you think slurmctld should also have a service dependency on slurmdbd (in our case, since we use database accounting?) Or is this not really necessary? > Comment # 57<https://bugs.schedmd.com/show_bug.cgi?id=9033#c57> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com> >> root@skycaptain1:/etc/systemd/system# cat >> /etc/systemd/system/slurmdbd.service >> [Unit] >> Description=Slurm DBD accounting daemon >> After=network.target munge.service >> ConditionPathExists=/etc/slurm/slurmdbd.conf >> >> [Service] >> Type=forking > Change this to "simple" >> EnvironmentFile=-/etc/sysconfig/slurmdbd >> ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS > Add "-D" here to keep the daemon in foreground mode. >> ExecReload=/bin/kill -HUP $MAINPID >> PIDFile=/var/run/slurmdbd.pid > This may need to be changed to /usr/local/slurm/.../var/run/slurmdbd.pid
(In reply to Will Dennis from comment #58) > do you think slurmctld should also > have a service dependency on slurmdbd (in our case, since we use database > accounting?) Or is this not really necessary? You can, but it is not necessary as slurmctld will retry a few times to talk to slurmdbd during startup.
We are OK now; I made the needed changes to systemd (which now includes "systemd-tmpfiles" control of /var/run, have to write config to create the needed /var/run/slurm directory at boot-time so the PDI file could be written out to it...) On testing, the new custom version starts and runs correctly. Thanks for all your help, OK to close out ticket now.
(In reply to Will Dennis from comment #60) > Thanks for all your help, OK to close out ticket now. I'm going to reduce this ticket to SEV4 as this is now QA ticket for the patch in comment #32. We will update and close the ticket once the fix is mainlined.
Will, After internal discussion, we are going to leave Slurm as it currently is functioning. Attempting to recover from a corrupted database can potentially leave a system in an unknown state and the best course of action is to load a backup or forcibly clear the statesavedirectory. Please respond if you have any questions. Thanks, --Nate