Ticket 9033 - slurmdbd service segfaulting after restarting from power-induced system crash
Summary: slurmdbd service segfaulting after restarting from power-induced system crash
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-05-11 19:11 MDT by Will Dennis
Modified: 2020-06-25 19:49 MDT (History)
3 users (show)

See Also:
Site: NEC Labs
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: Ubuntu
Machine Name: skycaptain1
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmdbd_core_gdb_output.txt.txt (8.41 KB, text/plain)
2020-05-11 19:36 MDT, Will Dennis
Details
slurm.conf (6.36 KB, application/octet-stream)
2020-05-11 19:36 MDT, Will Dennis
Details
slurmdbd_gdb_output_3.txt (15.11 KB, text/plain)
2020-05-12 11:38 MDT, Will Dennis
Details
work around patch (922 bytes, patch)
2020-05-12 12:22 MDT, Nate Rini
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Will Dennis 2020-05-11 19:11:08 MDT
Created attachment 14192 [details]
slurmdbd service run in forground mssgs - 1

We had a power-related system crash of the Slurm controller (and nodes) at our one site, and upon restarting the systems, Slurm will not stay up and running, due to slurmdbd segfaulting... I can restart the service, but it goes down within a matter of minutes with segfaults. Please assist me with getting our Slurm system functional again.
Comment 1 Will Dennis 2020-05-11 19:12:11 MDT
Created attachment 14193 [details]
slurmdbd service run in foreground mssgs - 2
Comment 2 Nate Rini 2020-05-11 19:21:56 MDT
(In reply to Will Dennis from comment #0)
> Slurm will not stay up and
> running, due to slurmdbd segfaulting

Is slurmctld generating a core dump? If so, please load it into gdb and call the following:
> gdb $(which slurmctld) $PATH_TO_CORE
> set pagination off
> set print pretty on 
> t a a bt full

If it is not generating a core, please start slurmctld using gdb to catch the crash:
> gdb --args slurmctld -Dvvvvvvvvvv
> b fatal
> r
wait for it to stop and then:
> set pagination off
> set print pretty on 
> t a a bt full

Please also attach latest slurm.conf (& friends) to this ticket. Please also call the following:
> slurmctld -V
Comment 3 Will Dennis 2020-05-11 19:36:09 MDT
Created attachment 14194 [details]
slurmdbd_core_gdb_output.txt.txt

It is slurmdbd that is segfaulting, not slurmctld – did you mean to have me work with slurmctld?

I attached the output of gdb analysis of slurmdbd, if indeed that’s what you wanted…

Also attached slurm.conf and slurmdbd.conf from the controller node.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, May 11, 2020 9:22 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Nate Rini<mailto:nate@schedmd.com> changed bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033>
What

Removed

Added

CC



nate@schedmd.com<mailto:nate@schedmd.com>

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=9033#c2> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #0<show_bug.cgi?id=9033#c0>)

> Slurm will not stay up and

> running, due to slurmdbd segfaulting



Is slurmctld generating a core dump? If so, please load it into gdb and call

the following:

> gdb $(which slurmctld) $PATH_TO_CORE

> set pagination off

> set print pretty on

> t a a bt full



If it is not generating a core, please start slurmctld using gdb to catch the

crash:

> gdb --args slurmctld -Dvvvvvvvvvv

> b fatal

> r

wait for it to stop and then:

> set pagination off

> set print pretty on

> t a a bt full



Please also attach latest slurm.conf (& friends) to this ticket. Please also

call the following:

> slurmctld -V

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 4 Will Dennis 2020-05-11 19:36:09 MDT
Created attachment 14195 [details]
slurm.conf
Comment 5 Will Dennis 2020-05-11 19:36:09 MDT
Created attachment 14196 [details]
slurmdbd.conf
Comment 6 Will Dennis 2020-05-11 19:36:59 MDT
Oh, also:

root@skycaptain1:~# slurmctld -V
slurm-wlm 17.11.7


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, May 11, 2020 9:22 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Nate Rini<mailto:nate@schedmd.com> changed bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033>
What

Removed

Added

CC



nate@schedmd.com<mailto:nate@schedmd.com>

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=9033#c2> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #0<show_bug.cgi?id=9033#c0>)

> Slurm will not stay up and

> running, due to slurmdbd segfaulting



Is slurmctld generating a core dump? If so, please load it into gdb and call

the following:

> gdb $(which slurmctld) $PATH_TO_CORE

> set pagination off

> set print pretty on

> t a a bt full



If it is not generating a core, please start slurmctld using gdb to catch the

crash:

> gdb --args slurmctld -Dvvvvvvvvvv

> b fatal

> r

wait for it to stop and then:

> set pagination off

> set print pretty on

> t a a bt full



Please also attach latest slurm.conf (& friends) to this ticket. Please also

call the following:

> slurmctld -V

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 7 Nate Rini 2020-05-11 19:38:47 MDT
(In reply to Will Dennis from comment #3)
> I attached the output of gdb analysis of slurmdbd, if indeed that’s what you
> wanted…
Yes, I wanted what is crashing.

Please provide:
> slurmdbd -V
Comment 8 Nate Rini 2020-05-11 19:48:54 MDT
> Reading symbols from /usr/sbin/slurmdbd...(no debugging symbols found)...done. 

Is this an Ubuntu universe installed deb? If so please install slurm-wlm-basic-plugins-dbg and any other debug debs related and pull the back trace again.

> /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so

The backtrace is missing most of the info.
Comment 11 Will Dennis 2020-05-11 20:13:52 MDT
We use an Ubuntu PPA of Slurm 17.11 published here: https://launchpad.net/~jonathonf/+archive/ubuntu/slurm

Unfortunately, there are no “dbg” packages present in the repo to install… https://launchpad.net/~jonathonf/+archive/ubuntu/slurm/+packages

I could provide you the core file if that would be helpful.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, May 11, 2020 9:49 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=9033#c8> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

> Reading symbols from /usr/sbin/slurmdbd...(no debugging symbols found)...done.



Is this an Ubuntu universe installed deb? If so please install

slurm-wlm-basic-plugins-dbg and any other debug debs related and pull the back

trace again.



> /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_mysql.so



The backtrace is missing most of the info.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 12 Nate Rini 2020-05-11 20:29:03 MDT
(In reply to Will Dennis from comment #11)
> We use an Ubuntu PPA of Slurm 17.11 published here:
> https://launchpad.net/~jonathonf/+archive/ubuntu/slurm
> 
> Unfortunately, there are no “dbg” packages present in the repo to install…
> https://launchpad.net/~jonathonf/+archive/ubuntu/slurm/+packages

Looks like this packager has chosen to call them *dev_*. Please try installed the dev packages and pulling the back trace.
Comment 13 Will Dennis 2020-05-11 20:56:30 MDT
Created attachment 14197 [details]
slurmdbd core gdb output 2.txt

Did the following:

root@skycaptain1:~# apt install slurm-wlm-basic-plugins-dev libslurmdb-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
libslurmdb-dev is already the newest version (17.11.7-1~16.04.york0).
The following NEW packages will be installed:
  slurm-wlm-basic-plugins-dev
0 upgraded, 1 newly installed, 0 to remove and 102 not upgraded.
Need to get 1,282 kB of archives.
After this operation, 7,671 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://ppa.launchpad.net/jonathonf/slurm/ubuntu xenial/main amd64 slurm-wlm-basic-plugins-dev amd64 17.11.7-1~16.04.york0 [1,282 kB]
Fetched 1,282 kB in 1s (1,114 kB/s)
Selecting previously unselected package slurm-wlm-basic-plugins-dev.
(Reading database ... 330488 files and directories currently installed.)
Preparing to unpack .../slurm-wlm-basic-plugins-dev_17.11.7-1~16.04.york0_amd64.deb ...
Unpacking slurm-wlm-basic-plugins-dev (17.11.7-1~16.04.york0) ...
Setting up slurm-wlm-basic-plugins-dev (17.11.7-1~16.04.york0) ...

See attached for the new trace, which I unfortunately do not think is very different from the first one I sent…


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, May 11, 2020 10:29 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=9033#c12> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #11<show_bug.cgi?id=9033#c11>)

> We use an Ubuntu PPA of Slurm 17.11 published here:

> https://launchpad.net/~jonathonf/+archive/ubuntu/slurm

>

> Unfortunately, there are no “dbg” packages present in the repo to install…

> https://launchpad.net/~jonathonf/+archive/ubuntu/slurm/+packages



Looks like this packager has chosen to call them *dev_*. Please try installed

the dev packages and pulling the back trace.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 15 Nate Rini 2020-05-11 21:19:29 MDT
(In reply to Will Dennis from comment #13)
> See attached for the new trace, which I unfortunately do not think is very
> different from the first one I sent…
Correct, we will have to change our tactic now.

We shall need to compile and install latest Slurm on your controller. We generally advise against sites using RPMs/Debs due issues like this.

Full install procedure is here: https://slurm.schedmd.com/quickstart_admin.html

We are going to install Slurm to /usr/local/slurm/17.11.13-2/ and then only run the slurmdbd and slurmctld daemons from it to allow you to upgrade your whole cluster later.

Basic steps:
1. cd /usr/local/src/
2. wget 'https://download.schedmd.com/slurm/slurm-17.11.13-2.tar.bz2'
3. tar -xjvf slurm-17.11.13-2.tar.bz2
4. cd slurm-17.11.13-2
5. ./configure --sysconfdir=/etc/slurm/
6. make -j
7. sudo make install
8. killall slurmctld slurmdbd #stop all slurm daemons on the controller
9. /usr/local/slurm/17.11.13-2/sbin/slurmdbd -Dvvv #watch to see if it SEGFAULTs
10. /usr/local/slurm/17.11.13-2/sbin/slurmctld -Dvvv #if slurmdbd does not crash, call this in a new window
11. sinfo
12. srun uptime
Comment 18 Will Dennis 2020-05-12 08:31:46 MDT
Captured the output of ./configure, and grepped for some strings that may affect the success of the compiled binaries, this is the output:

$ grep -i -e warning -e "not found" -e unable slurm-configure-output.txt
configure: WARNING: *** mysql_config not found. Evidently no MySQL development libs installed on system.
configure: WARNING: unable to locate NUMA memory affinity functions
configure: WARNING: unable to locate PAM libraries
configure: WARNING: unable to locate json parser library
configure: WARNING: unable to locate ofed installation
configure: WARNING:
Unable to locate HDF5 compilation helper scripts 'h5cc' or 'h5pcc'.
configure: WARNING: LZ4 test program build failed.
configure: WARNING: unable to locate hwloc installation
configure: WARNING: unable to locate pmix installation
configure: WARNING: unable to locate freeipmi installation
configure: WARNING: unable to locate rrdtool installation
configure: WARNING: unable to locate libssh2 installation
configure: WARNING: Slurm internal X11 support disabled
configure: WARNING: cannot build smap without curses or ncurses library
checking for GLIB - version >= 2.7.1... Package glib-2.0 was not found in the pkg-config search path.
Package gthread-2.0 was not found in the pkg-config search path.
checking for GTK+ - version >= 2.7.1... Package gtk+-2.0 was not found in the pkg-config search path.
Package gthread-2.0 was not found in the pkg-config search path.
configure: WARNING: cannot build sview without gtk library
checking whether this is a Cray XT or XE system... Unable to locate Cray APIs (usually in /opt/cray/alpscomm and /opt/cray/job)
configure: WARNING: unable to locate DataWarp installation
configure: WARNING: unable to locate netloc installation
configure: WARNING: unable to locate lua package
configure: WARNING: unable to build man page html files without man2html
configure: WARNING: configured for readline support, but couldn't find libraries
configure: WARNING: could not find working OpenSSL library
configure: WARNING: unable to locate/link against libcurl-devel installation

Can you advise whether I need to add any packages for successful binary compilation? (again, compiling on Ubuntu 16.04, .deb pkgs)


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Monday, May 11, 2020 at 11:20 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 15<https://bugs.schedmd.com/show_bug.cgi?id=9033#c15> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #13<show_bug.cgi?id=9033#c13>)

> See attached for the new trace, which I unfortunately do not think is very

> different from the first one I sent…

Correct, we will have to change our tactic now.



We shall need to compile and install latest Slurm on your controller. We

generally advise against sites using RPMs/Debs due issues like this.



Full install procedure is here: https://slurm.schedmd.com/quickstart_admin.html



We are going to install Slurm to /usr/local/slurm/17.11.13-2/ and then only run

the slurmdbd and slurmctld daemons from it to allow you to upgrade your whole

cluster later.



Basic steps:

1. cd /usr/local/src/

2. wget 'https://download.schedmd.com/slurm/slurm-17.11.13-2.tar.bz2'

3. tar -xjvf slurm-17.11.13-2.tar.bz2

4. cd slurm-17.11.13-2

5. ./configure --sysconfdir=/etc/slurm/

6. make -j

7. sudo make install

8. killall slurmctld slurmdbd #stop all slurm daemons on the controller

9. /usr/local/slurm/17.11.13-2/sbin/slurmdbd -Dvvv #watch to see if it

SEGFAULTs

10. /usr/local/slurm/17.11.13-2/sbin/slurmctld -Dvvv #if slurmdbd does not

crash, call this in a new window

11. sinfo

12. srun uptime
Comment 19 Jason Booth 2020-05-12 09:23:01 MDT
Will -

Most of these can be installed through the repo and not every one of these is needed. You must determine what your site needs such as do you need X11 support in jobs and are you going to use the accounting database? You could build without these additional packages but I do advise you to look them over and see which ones you may need to "apt install" to fit your sites needs.
Comment 20 Will Dennis 2020-05-12 09:28:13 MDT
I ended up installing the following packages:

freeipmi
gcc
git
hdf5-helpers
libcurl4-openssl-dev
libfreeipmi-dev
libhwloc-dev
libjson-c-dev
liblua5.3-dev
libmariadb-client-lgpl-dev
libmysqlclient-dev
libncurses5-dev
libncursesw5-dev
libpam0g-dev
libreadline-dev
libssh2-1
lua5.3
make
ruby
ruby-dev

Now the grepped output of ./configure stands at:

configure: WARNING: unable to locate ofed installation
configure: WARNING: Unable to compile HDF5 test program
configure: WARNING: LZ4 test program build failed.
configure: WARNING: unable to locate pmix installation
configure: WARNING: unable to locate freeipmi installation
configure: WARNING: unable to locate rrdtool installation
configure: WARNING: unable to locate libssh2 installation
configure: WARNING: Slurm internal X11 support disabled
checking for GLIB - version >= 2.7.1... Package glib-2.0 was not found in the pkg-config search path.
Package gthread-2.0 was not found in the pkg-config search path.
checking for GTK+ - version >= 2.7.1... Package gtk+-2.0 was not found in the pkg-config search path.
Package gthread-2.0 was not found in the pkg-config search path.
configure: WARNING: cannot build sview without gtk library
checking whether this is a Cray XT or XE system... Unable to locate Cray APIs (usually in /opt/cray/alpscomm and /opt/cray/job)
configure: WARNING: unable to locate DataWarp installation
configure: WARNING: unable to locate netloc installation
configure: WARNING: unable to build man page html files without man2html

I don’t see any obvious problems with the above (except maybe HDF5 test program fail? Can you advise if/when this is needed?)


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 11:24 AM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 19<https://bugs.schedmd.com/show_bug.cgi?id=9033#c19> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Jason Booth<mailto:jbooth@schedmd.com>

Will -



Most of these can be installed through the repo and not every one of these is

needed. You must determine what your site needs such as do you need X11 support

in jobs and are you going to use the accounting database? You could build

without these additional packages but I do advise you to look them over and see

which ones you may need to "apt install" to fit your sites needs.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 21 Nate Rini 2020-05-12 09:45:22 MDT
(In reply to Will Dennis from comment #20)
> configure: WARNING: unable to locate ofed installation
> configure: WARNING: Unable to compile HDF5 test program
> configure: WARNING: LZ4 test program build failed.
> configure: WARNING: unable to locate pmix installation
> configure: WARNING: unable to locate freeipmi installation
> configure: WARNING: unable to locate rrdtool installation
> configure: WARNING: unable to locate libssh2 installation
> configure: WARNING: Slurm internal X11 support disabled
> checking for GLIB - version >= 2.7.1... Package glib-2.0 was not found in
> the pkg-config search path.
> Package gthread-2.0 was not found in the pkg-config search path.
> checking for GTK+ - version >= 2.7.1... Package gtk+-2.0 was not found in
> the pkg-config search path.
> Package gthread-2.0 was not found in the pkg-config search path.
> configure: WARNING: cannot build sview without gtk library
> checking whether this is a Cray XT or XE system... Unable to locate Cray
> APIs (usually in /opt/cray/alpscomm and /opt/cray/job)
> configure: WARNING: unable to locate DataWarp installation
> configure: WARNING: unable to locate netloc installation
> configure: WARNING: unable to build man page html files without man2html
> 
> I don’t see any obvious problems with the above (except maybe HDF5 test
> program fail? Can you advise if/when this is needed?)

HDF5 is only needed if you are doing job profiling, which your config does not have enabled. Is 'make -j' working?
Comment 22 Will Dennis 2020-05-12 09:51:40 MDT
No, it unfortunately isn’t….


In file included from sh5util.c:66:0:
../hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory
compilation terminated.
Makefile:602: recipe for target 'sh5util.o' failed
make[6]: *** [sh5util.o] Error 1
make[6]: *** Waiting for unfinished jobs....
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I/usr/include/hdf5/serial -I/include -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -MT hdf5_api.lo -MD -MP -MF .deps/hdf5_api.Tpo -c hdf5_api.c  -fPIC -DPIC -o .libs/hdf5_api.o
In file included from hdf5_api.c:53:0:
hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory
compilation terminated.
Makefile:715: recipe for target 'hdf5_api.lo' failed
make[7]: *** [hdf5_api.lo] Error 1
make[7]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile/hdf5'
Makefile:838: recipe for target '../libhdf5_api.la' failed
make[6]: *** [../libhdf5_api.la] Error 2
make[6]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile/hdf5/sh5util'
Makefile:734: recipe for target 'all-recursive' failed
make[5]: *** [all-recursive] Error 1
make[5]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile/hdf5'
Makefile:540: recipe for target 'all-recursive' failed
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins/acct_gather_profile'
Makefile:569: recipe for target 'all-recursive' failed
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src/plugins'
Makefile:569: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory '/usr/local/src/slurm-17.11.13-2/src'
Makefile:696: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/usr/local/src/slurm-17.11.13-2'
Makefile:595: recipe for target 'all' failed
make: *** [all] Error 2

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 11:46 AM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 21<https://bugs.schedmd.com/show_bug.cgi?id=9033#c21> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>



HDF5 is only needed if you are doing job profiling, which your config does not

have enabled. Is 'make -j' working?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 23 Nate Rini 2020-05-12 09:54:19 MDT
(In reply to Will Dennis from comment #22)
> hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory
> compilation terminated.

Please try with this instead:
> ./configure --sysconfdir=/etc/slurm/ --prefix=/usr/local/slurm/17.11.13-2/ --with-hdf5=no

Please also provide your mysql/mariadbd version.
Comment 24 Will Dennis 2020-05-12 10:56:06 MDT
OK, the “make –j” worked with those flags set.

Running MySQL 5.7.30 on this cluster controller.


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 11:54 AM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 23<https://bugs.schedmd.com/show_bug.cgi?id=9033#c23> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #22<show_bug.cgi?id=9033#c22>)

> hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory

> compilation terminated.



Please try with this instead:

> ./configure --sysconfdir=/etc/slurm/ --prefix=/usr/local/slurm/17.11.13-2/ --with-hdf5=no



Please also provide your mysql/mariadbd version.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 25 Will Dennis 2020-05-12 11:17:38 MDT
Created attachment 14205 [details]
macluster-slurmdbd-17.11.13-2-foreground-segfault.txt

OK, went ahead and installed, then ran the new 17.11.13-2 slurmdbd service, which came up and did not crash for a few minutes; then when I started the 17.11.13-2 slurmctld, it started and then I saw:

error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused

Going back to the slurmdbd service, I saw it had segfaulted. It did not produce a core file, however…

Attached is the foreground messages from the compiled slurmdbd.


From: Will Dennis
Sent: Tuesday, May 12, 2020 12:57 PM
To: 'bugs@schedmd.com' <bugs@schedmd.com>
Subject: RE: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

OK, the “make –j” worked with those flags set.

Running MySQL 5.7.30 on this cluster controller.


From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>>
Sent: Tuesday, May 12, 2020 11:54 AM
To: Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 23<https://bugs.schedmd.com/show_bug.cgi?id=9033#c23> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #22<show_bug.cgi?id=9033#c22>)

> hdf5_api.h:49:18: fatal error: hdf5.h: No such file or directory

> compilation terminated.



Please try with this instead:

> ./configure --sysconfdir=/etc/slurm/ --prefix=/usr/local/slurm/17.11.13-2/ --with-hdf5=no



Please also provide your mysql/mariadbd version.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 26 Nate Rini 2020-05-12 11:20:48 MDT
(In reply to Will Dennis from comment #25)
> Going back to the slurmdbd service, I saw it had segfaulted. It did not
> produce a core file, however…
Please follow the procedure from comment #2 to use gdb to get a backtrace.
 
> Running MySQL 5.7.30 on this cluster controller.
MySQL 5.7.30 has a known bug (per bug#9006) that causes SEGFAULTs.  Was the MySQL version recently changed?
Comment 27 Will Dennis 2020-05-12 11:28:30 MDT
Yes, it was at 5.7.29 before today; when I added in the mysql devel lib’s, it upgraded all the MySQL pkgs to 5.7.30…
Before the upgrade today, it was last upgraded on 1/28/2020 from 5.7.28 to 5.7.29; the slurmdbd segfault issue only started happening yesterday (after the controller machine went down hard with the power fault.)


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 1:21 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 26<https://bugs.schedmd.com/show_bug.cgi?id=9033#c26> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #25<show_bug.cgi?id=9033#c25>)

> Going back to the slurmdbd service, I saw it had segfaulted. It did not

> produce a core file, however…

Please follow the procedure from comment #2<show_bug.cgi?id=9033#c2> to use gdb to get a backtrace.



> Running MySQL 5.7.30 on this cluster controller.

MySQL 5.7.30 has a known bug (per bug#9006<show_bug.cgi?id=9006>) that causes SEGFAULTs.  Was the

MySQL version recently changed?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 28 Nate Rini 2020-05-12 11:33:18 MDT
(In reply to Will Dennis from comment #27)
> Yes, it was at 5.7.29 before today; when I added in the mysql devel lib’s,
> it upgraded all the MySQL pkgs to 5.7.30…
> Before the upgrade today, it was last upgraded on 1/28/2020 from 5.7.28 to
> 5.7.29; the slurmdbd segfault issue only started happening yesterday (after
> the controller machine went down hard with the power fault.)

Please downgrade back to .29 or apply the fix from this bug: https://bugs.mysql.com/bug.php?id=99485
Comment 29 Will Dennis 2020-05-12 11:38:42 MDT
Created attachment 14206 [details]
slurmdbd_gdb_output_3.txt

See attached for gdb output from latest segfault (seems to happen pretty much as soon as I start slurmctld)

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 1:21 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 26<https://bugs.schedmd.com/show_bug.cgi?id=9033#c26> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #25<show_bug.cgi?id=9033#c25>)

> Going back to the slurmdbd service, I saw it had segfaulted. It did not

> produce a core file, however…

Please follow the procedure from comment #2<show_bug.cgi?id=9033#c2> to use gdb to get a backtrace.



> Running MySQL 5.7.30 on this cluster controller.

MySQL 5.7.30 has a known bug (per bug#9006<show_bug.cgi?id=9006>) that causes SEGFAULTs.  Was the

MySQL version recently changed?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 30 Nate Rini 2020-05-12 11:42:18 MDT
(In reply to Will Dennis from comment #29)
> Created attachment 14206 [details]
> slurmdbd_gdb_output_3.txt
> 
> See attached for gdb output from latest segfault (seems to happen pretty
> much as soon as I start slurmctld)

Please call this in gdb:
> t 7
> f 0
> p row2[ASSOC2_REQ_MTPJ][0]
> p row2[ASSOC2_REQ_MTPJ]
Comment 31 Will Dennis 2020-05-12 11:48:37 MDT
(gdb) t 7
[Switching to thread 7 (Thread 0x7ffff4c5e700 (LWP 12162))]
#0  0x00007ffff69c7a0b in _cluster_get_assocs (mysql_conn=mysql_conn@entry=0x7fffe0009820, user=user@entry=0x7ffff4c5daf0, assoc_cond=assoc_cond@entry=0x7fffe00009a0, cluster_name=0x7fffe00008d0 "macluster", fields=<optimized out>, sent_extra=<optimized out>, is_admin=true, sent_list=0x7fffe00101e0) at as_mysql_assoc.c:2073
2073                                                                      if (row2[ASSOC2_REQ_MTPJ][0])
(gdb) f 0
#0  0x00007ffff69c7a0b in _cluster_get_assocs (mysql_conn=mysql_conn@entry=0x7fffe0009820, user=user@entry=0x7ffff4c5daf0, assoc_cond=assoc_cond@entry=0x7fffe00009a0, cluster_name=0x7fffe00008d0 "macluster", fields=<optimized out>, sent_extra=<optimized out>, is_admin=true, sent_list=0x7fffe00101e0) at as_mysql_assoc.c:2073
2073                                                                      if (row2[ASSOC2_REQ_MTPJ][0])
(gdb) p row2[ASSOC2_REQ_MTPJ][0]
Cannot access memory at address 0x0
(gdb) p row2[ASSOC2_REQ_MTPJ]
$1 = 0x0
(gdb)


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 1:43 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 30<https://bugs.schedmd.com/show_bug.cgi?id=9033#c30> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #29<show_bug.cgi?id=9033#c29>)

> Created attachment 14206 [details]<attachment.cgi?id=14206> [details]<attachment.cgi?id=14206&action=edit>

> slurmdbd_gdb_output_3.txt

>

> See attached for gdb output from latest segfault (seems to happen pretty

> much as soon as I start slurmctld)



Please call this in gdb:

> t 7

> f 0

> p row2[ASSOC2_REQ_MTPJ][0]

> p row2[ASSOC2_REQ_MTPJ]

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 32 Nate Rini 2020-05-12 12:22:11 MDT
Created attachment 14207 [details]
work around patch

Please apply this patch:

1. git am bug9033.NEC.workaround.1711.patch
2. make -j install
3. slurmdbd -Dvvvv
Comment 33 Will Dennis 2020-05-12 12:59:41 MDT
Never have used “git am” before…


Where do I do the git am bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch, in /usr/local/src/slurm-17.11.13-2 ??


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 2:22 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 32<https://bugs.schedmd.com/show_bug.cgi?id=9033#c32> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

Created attachment 14207 [details]<attachment.cgi?id=14207&action=diff> [details]<attachment.cgi?id=14207&action=edit>

work around patch



Please apply this patch:



1. git am bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch

2. make -j install

3. slurmdbd -Dvvvv

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 34 Will Dennis 2020-05-12 13:08:22 MDT
And, before “make –j install”, do a “make clean” or not?

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 2:22 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 32<https://bugs.schedmd.com/show_bug.cgi?id=9033#c32> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

Created attachment 14207 [details]<attachment.cgi?id=14207&action=diff> [details]<attachment.cgi?id=14207&action=edit>

work around patch



Please apply this patch:



1. git am bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch

2. make -j install

3. slurmdbd -Dvvvv

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 35 Nate Rini 2020-05-12 13:09:24 MDT
(In reply to Will Dennis from comment #33)
> Never have used “git am” before…
Try this instead if you don't have git:

> 1. patch -p1 < bug9033.NEC.workaround.1711.patch

(In reply to Will Dennis from comment #34)
> And, before “make –j install”, do a “make clean” or not?
It is not required for this patch.
Comment 36 Will Dennis 2020-05-12 13:15:20 MDT
I do have git installed, but ran this and it seems to have worked…

root@skycaptain1:/usr/local/src/slurm-17.11.13-2# patch -p1 < slurm_bug9033_workaround_patch.txt
patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c

Have not gotten to downgrading the MySQL as of yet, looks like I could go down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or anything… Must I do this before trying out the patch?

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 3:09 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 35<https://bugs.schedmd.com/show_bug.cgi?id=9033#c35> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #33<show_bug.cgi?id=9033#c33>)

> Never have used “git am” before…

Try this instead if you don't have git:



> 1. patch -p1 < bug9033<show_bug.cgi?id=9033>.NEC.workaround.1711.patch



(In reply to Will Dennis from comment #34<show_bug.cgi?id=9033#c34>)

> And, before “make –j install”, do a “make clean” or not?

It is not required for this patch.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 37 Nate Rini 2020-05-12 13:23:02 MDT
(In reply to Will Dennis from comment #36)
> I do have git installed, but ran this and it seems to have worked…
> 
> root@skycaptain1:/usr/local/src/slurm-17.11.13-2# patch -p1 <
> slurm_bug9033_workaround_patch.txt
> patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c
> 
> Have not gotten to downgrading the MySQL as of yet, looks like I could go
> down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or
> anything… Must I do this before trying out the patch?

Both fixes may be required. Please try this patch first to see if we can get the system online.
Comment 38 Will Dennis 2020-05-12 13:26:17 MDT
OK, started the recompiled slurmdbd, then started the slurmctld, no immediate crash.

Then did:

root@skycaptain1:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
allgpus*     up 60-00:00:0     11    mix skyserver18k,skyserver19k,skyserver23k,skyserver24k,skyserver27k,skyserver28k,skyserver29k,skyserver30k,skyserver31k,skyserver33k,skyserver34k
allgpus*     up 60-00:00:0      1  alloc skyserver22k
allgpus*     up 60-00:00:0     16   idle skyserver10k,skyserver11k,skyserver12k,skyserver13k,skyserver15k,skyserver16k,skyserver17k,skyserver20k,skyserver21k,skyserver25k,skyserver26k,skyserver32k,skyserver36k,skyserver37k,skyserver38k,skyserver39k
desktops     up 60-00:00:0      5   resv ma17-pc4,ma18-pc[1,5],ma19-pc[2-3]
desktops     up 60-00:00:0      8   idle ma17-pc[1-3,5],ma18-pc[3-4,6],ma19-pc1
root@skycaptain1:~#
root@skycaptain1:~# srun uptime
12:19:37 up 1 day,  1:23,  0 users,  load average: 0.17, 0.10, 0.11
root@skycaptain1:~#
root@skycaptain1:~#
root@skycaptain1:~# sacctmgr show runawayjobs
Runaway Jobs: No runaway jobs found

Happy to say, slurmdbd and slurmctld are still up and running…

Anything else you want me to look at while the daemons are running in foreground?


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 3:23 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 37<https://bugs.schedmd.com/show_bug.cgi?id=9033#c37> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #36<show_bug.cgi?id=9033#c36>)

> I do have git installed, but ran this and it seems to have worked…

>

> root@skycaptain1:/usr/local/src/slurm-17.11.13-2#<mailto:root@skycaptain1:/usr/local/src/slurm-17.11.13-2#> patch -p1 <

> slurm_bug9033_workaround_patch.txt

> patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c

>

> Have not gotten to downgrading the MySQL as of yet, looks like I could go

> down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or

> anything… Must I do this before trying out the patch?



Both fixes may be required. Please try this patch first to see if we can get

the system online.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 39 Nate Rini 2020-05-12 13:34:24 MDT
(In reply to Will Dennis from comment #38)
> OK, started the recompiled slurmdbd, then started the slurmctld, no
> immediate crash.
> 
> Then did:
> 
> root@skycaptain1:~# sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> allgpus*     up 60-00:00:0     11    mix
> skyserver18k,skyserver19k,skyserver23k,skyserver24k,skyserver27k,
> skyserver28k,skyserver29k,skyserver30k,skyserver31k,skyserver33k,skyserver34k
> allgpus*     up 60-00:00:0      1  alloc skyserver22k
> allgpus*     up 60-00:00:0     16   idle
> skyserver10k,skyserver11k,skyserver12k,skyserver13k,skyserver15k,
> skyserver16k,skyserver17k,skyserver20k,skyserver21k,skyserver25k,
> skyserver26k,skyserver32k,skyserver36k,skyserver37k,skyserver38k,skyserver39k
> desktops     up 60-00:00:0      5   resv ma17-pc4,ma18-pc[1,5],ma19-pc[2-3]
> desktops     up 60-00:00:0      8   idle
> ma17-pc[1-3,5],ma18-pc[3-4,6],ma19-pc1
> root@skycaptain1:~#
> root@skycaptain1:~# srun uptime
> 12:19:37 up 1 day,  1:23,  0 users,  load average: 0.17, 0.10, 0.11
> root@skycaptain1:~#
> root@skycaptain1:~#
> root@skycaptain1:~# sacctmgr show runawayjobs
> Runaway Jobs: No runaway jobs found
> 
> Happy to say, slurmdbd and slurmctld are still up and running…
> 
> Anything else you want me to look at while the daemons are running in
> foreground?
> 
> 
> From: bugs@schedmd.com <bugs@schedmd.com>
> Sent: Tuesday, May 12, 2020 3:23 PM
> To: Will Dennis <wdennis@nec-labs.com>
> Subject: [Bug 9033] slurmdbd service segfaulting after restarting from
> power-induced system crash
> 
> Comment # 37<https://bugs.schedmd.com/show_bug.cgi?id=9033#c37> on bug
> 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate
> Rini<mailto:nate@schedmd.com>
> 
> (In reply to Will Dennis from comment #36<show_bug.cgi?id=9033#c36>)
> 
> > I do have git installed, but ran this and it seems to have worked…
> 
> >
> 
> > root@skycaptain1:/usr/local/src/slurm-17.11.13-2#<mailto:root@skycaptain1:/usr/local/src/slurm-17.11.13-2#> patch -p1 <
> 
> > slurm_bug9033_workaround_patch.txt
> 
> > patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c
> 
> >
> 
> > Have not gotten to downgrading the MySQL as of yet, looks like I could go
> 
> > down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or
> 
> > anything… Must I do this before trying out the patch?

Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime.
Comment 40 Nate Rini 2020-05-12 13:35:48 MDT
> anything… Must I do this before trying out the patch?

Please downgrade mysql before returning to production. The bug in mysql code could cause another crash at anytime.
Comment 41 Will Dennis 2020-05-12 13:41:26 MDT
OK, will try to do that, and hopefully it works…

Moving forward, I assume that I’ll now have to use the compiled version of slurmdbd and slurmctld on this server, instead of the packaged version (which clearly suffer from the bug that was patched.) That means that I will have to modify the system unit files that start the slurmdbd and slurmctld at a minimum; anything else that I need to check?

And, could someone explain the bug that was patched? Why did it seem to only be triggered by this system down event?


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 3:35 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 39<https://bugs.schedmd.com/show_bug.cgi?id=9033#c39> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #38<show_bug.cgi?id=9033#c38>)

> OK, started the recompiled slurmdbd, then started the slurmctld, no

> immediate crash.

>

> Then did:

>

> root@skycaptain1:~# sinfo

> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

> allgpus*     up 60-00:00:0     11    mix

> skyserver18k,skyserver19k,skyserver23k,skyserver24k,skyserver27k,

> skyserver28k,skyserver29k,skyserver30k,skyserver31k,skyserver33k,skyserver34k

> allgpus*     up 60-00:00:0      1  alloc skyserver22k

> allgpus*     up 60-00:00:0     16   idle

> skyserver10k,skyserver11k,skyserver12k,skyserver13k,skyserver15k,

> skyserver16k,skyserver17k,skyserver20k,skyserver21k,skyserver25k,

> skyserver26k,skyserver32k,skyserver36k,skyserver37k,skyserver38k,skyserver39k

> desktops     up 60-00:00:0      5   resv ma17-pc4,ma18-pc[1,5],ma19-pc[2-3]

> desktops     up 60-00:00:0      8   idle

> ma17-pc[1-3,5],ma18-pc[3-4,6],ma19-pc1

> root@skycaptain1:~#

> root@skycaptain1:~# srun uptime

> 12:19:37 up 1 day,  1:23,  0 users,  load average: 0.17, 0.10, 0.11

> root@skycaptain1:~#

> root@skycaptain1:~#

> root@skycaptain1:~# sacctmgr show runawayjobs

> Runaway Jobs: No runaway jobs found

>

> Happy to say, slurmdbd and slurmctld are still up and running…

>

> Anything else you want me to look at while the daemons are running in

> foreground?

>

>

> From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>>

> Sent: Tuesday, May 12, 2020 3:23 PM

> To: Will Dennis <wdennis@nec-labs.com<mailto:wdennis@nec-labs.com>>

> Subject: [Bug 9033<show_bug.cgi?id=9033>] slurmdbd service segfaulting after restarting from

> power-induced system crash

>

> Comment # 37<show_bug.cgi?id=9033#c37><https://bugs.schedmd.com/show_bug.cgi?id=9033#c37<show_bug.cgi?id=9033#c37>> on bug

> 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033<show_bug.cgi?id=9033>> from Nate

> Rini<mailto:nate@schedmd.com>

>

> (In reply to Will Dennis from comment #36<show_bug.cgi?id=9033#c36><show_bug.cgi?id=9033#c36>)

>

> > I do have git installed, but ran this and it seems to have worked…

>

> >

>

> > root@skycaptain1:/usr/local/src/slurm-17.11.13-2#<mailto:root@skycaptain1:/usr/local/src/slurm-17.11.13-2#> patch -p1 <

>

> > slurm_bug9033_workaround_patch.txt

>

> > patching file src/plugins/accounting_storage/mysql/as_mysql_assoc.c

>

> >

>

> > Have not gotten to downgrading the MySQL as of yet, looks like I could go

>

> > down to v 5.7.11, but hopefully 5.7.30 didn’t modify the mysql db schema or

>

> > anything… Must I do this before trying out the patch?



Please downgrade mysql before returning to production. The bug in mysql code

could cause another crash at anytime.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 42 Will Dennis 2020-05-12 14:10:18 MDT
Seeing stuff like this in the slurmdbd message output:

slurmdbd: error: We have more allocated time than is possible (15019200 > 4118400) for cluster macluster(1144) from 2020-05-12T11:00:00 - 2020-05-12T12:00:00 tres 1
slurmdbd: error: We have more allocated time than is possible (96753196800 > 32092592400) for cluster macluster(8914609) from 2020-05-12T11:00:00 - 2020-05-12T12:00:00 tres 2
slurmdbd: error: We have more allocated time than is possible (14572800 > 4118400) for cluster macluster(1144) from 2020-05-12T11:00:00 - 2020-05-12T12:00:00 tres 5
slurmdbd: error: id_assoc 12 doesn't have any tres
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: debug3: WARNING: Unused wall is less than zero; this should never happen outside a Flex reservation. Setting it to zero for resv id = 6, start = 1558033580.
slurmdbd: error: We have more allocated time than is possible (14994470 > 4118400) for cluster macluster(1144) from 2020-05-12T12:00:00 - 2020-05-12T13:00:00 tres 1
slurmdbd: error: We have more allocated time than is possible (96409043136 > 32092592400) for cluster macluster(8914609) from 2020-05-12T12:00:00 - 2020-05-12T13:00:00 tres 2
slurmdbd: error: We have more allocated time than is possible (14548070 > 4118400) for cluster macluster(1144) from 2020-05-12T12:00:00 - 2020-05-12T13:00:00 tres 5
slurmdbd: error: id_assoc 12 doesn't have any tres

Indicative of db problems?


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 3:36 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 40<https://bugs.schedmd.com/show_bug.cgi?id=9033#c40> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

> anything… Must I do this before trying out the patch?



Please downgrade mysql before returning to production. The bug in mysql code

could cause another crash at anytime.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 43 Nate Rini 2020-05-12 14:10:34 MDT
(In reply to Will Dennis from comment #41)
> Moving forward, I assume that I’ll now have to use the compiled version of
> slurmdbd and slurmctld on this server, instead of the packaged version
> (which clearly suffer from the bug that was patched.) That means that I will
> have to modify the system unit files that start the slurmdbd and slurmctld
> at a minimum; anything else that I need to check?
I don't have any details about your cluster, but in generally we suggest to run the compiled version on all of your nodes. The setup in comment #15 was *only* to get your system back online. Assuming all of your nodes run the same os clone installed, then it should be possible to either clone out Slurm to all of the nodes or to use a shared file system (less preferred).

I suggest uninstalling the debs to avoid them overwriting any changes your about to make to the systemd unit files. Take thorough backups first before doing anything.

We gave a patch for 17.11 but that release has been deprecated. Please upgrade to 19.05 as soon as possible. Note that the install prefix explicitly had the installed version to make it easy to swap versions. Slurm is capable of talking to 2 previous versions which should allow a migration of compute nodes instead of needing a full outage. (see https://slurm.schedmd.com/quickstart_admin.html#upgrade)

> And, could someone explain the bug that was patched? Why did it seem to only
> be triggered by this system down event?
Based on the gdb output, looks like the association table was partially corrupted as a value in one of the rows was missing.

I suggest calling the following to rebuild your association tables:
> slurmdbd -R

I also suggest manually verifying your association table is correct with your internal expectations as any data lost can not be recovered, just repaired.

This can be done while slurmctld is online and serving jobs.

We will be creating a real patch that will be mainlined to fix the problem permanently.

Can we lower the severity of this ticket?
Comment 44 Nate Rini 2020-05-12 14:12:37 MDT
(In reply to Will Dennis from comment #42)
> Seeing stuff like this in the slurmdbd message output:
> 
> Indicative of db problems?

That is normal when Slurm is taken down hard. Once you have upgraded to 19.05 (or 20.02) please call the following to cleanup the jobs:
> sacctmgr show runaways

The self repair code in 17.11 is lacking, so I suggest waiting until after the upgrade.
Comment 45 Will Dennis 2020-05-12 16:26:04 MDT
Created attachment 14209 [details]
skycaptain1_mysql_downgrade.txt

Did the MySQL downgrade to v5.7.11, and held it to prevent upgrades (at least until they release 5.7.31+ for Ubuntu 16.04) – see attached for downgrade output

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 3:36 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 40<https://bugs.schedmd.com/show_bug.cgi?id=9033#c40> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

> anything… Must I do this before trying out the patch?



Please downgrade mysql before returning to production. The bug in mysql code

could cause another crash at anytime.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 46 Nate Rini 2020-05-12 16:38:03 MDT
(In reply to Will Dennis from comment #45)
> Did the MySQL downgrade to v5.7.11, and held it to prevent upgrades (at
> least until they release 5.7.31+ for Ubuntu 16.04) – see attached for
> downgrade output
>
> mysql_upgrade: (non fatal) [ERROR] 1728: Cannot load from mysql.proc. The table is probably corrupted
> mysql_upgrade: (non fatal) [ERROR] 1545: Failed to open mysql.event
Looks like mysql_upgrade eventually figured it out.

I take it that the system is now back to running jobs?
Comment 47 Will Dennis 2020-05-12 16:45:29 MDT
Not quite yet…


  *   I have to figure out how to remove the old Slurm 17.11.7 packages without removing any needed conf files etc.; want to get a system backup first, which I have to work with another IT team member to get done
  *   Need to modify the existing systemd unit files (or create them, in case the above deletes them) to call the new patched slurmdbd and slurmctld binaries
  *   Need to test out the system, and inspect the Slurm DB tables, especially the association table

Tangential issue: need to speak with someone from SchedMD to figure out how to orchestrate a conversion from packaged Slurm binaries to compiled ones, and how to do this over a fleet of servers that comprise the cluster, especially doing upgrades in the future (perhaps this is all documented in detail somewhere?)


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, May 12, 2020 at 6:38 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Comment # 46<https://bugs.schedmd.com/show_bug.cgi?id=9033#c46> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #45<show_bug.cgi?id=9033#c45>)

> Did the MySQL downgrade to v5.7.11, and held it to prevent upgrades (at

> least until they release 5.7.31+ for Ubuntu 16.04) – see attached for

> downgrade output

>

> mysql_upgrade: (non fatal) [ERROR] 1728: Cannot load from mysql.proc. The table is probably corrupted

> mysql_upgrade: (non fatal) [ERROR] 1545: Failed to open mysql.event

Looks like mysql_upgrade eventually figured it out.



I take it that the system is now back to running jobs?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 48 Nate Rini 2020-05-12 16:56:47 MDT
(In reply to Will Dennis from comment #47)
> Not quite yet…
Understood, going to reduce this to SEV3 as there is a work around for the issue.
 
>   *   I have to figure out how to remove the old Slurm 17.11.7 packages
> without removing any needed conf files etc.; want to get a system backup
> first, which I have to work with another IT team member to get done
Thorough backups are always suggested.

>   *   Need to test out the system, and inspect the Slurm DB tables,
> especially the association table
This can and should be done from 'sacctmgr' commands. Something as simple as a check of 'sacctmgr show assoc' output should be sufficient. The assoc tables can be updated while the system is running using the sacctmgr commands.
 
> Tangential issue: need to speak with someone from SchedMD to figure out how
> to orchestrate a conversion from packaged Slurm binaries to compiled ones,
We are happy to help with compiling and installing Slurm from official source balls.

> and how to do this over a fleet of servers that comprise the cluster,
> especially doing upgrades in the future (perhaps this is all documented in
> detail somewhere?)
How this is done usually depends heavily on how the systems integrator setup the cluster and what suite is managing the cluster.
Comment 49 Will Dennis 2020-05-12 18:24:40 MDT
All the ‘sacctmgr’ command I used, seem to work fine, as does ‘sacct’ and ‘sdiag’, when using the new binaries.

We are a R&D shop, so way less formal with procedures (such as backup!) than we should be, but we are learning lessons as we go…

We do not use any sort of systems integrator, or cluster management suite – it’s just all done by us (central IT + some researchers.) So pointers on how to roll out Slurm binaries across a fleet of servers without either using the OS package management system, or compiling everything on each server by hand, would be gratefully accepted :)

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, May 12, 2020 6:57 PM
To: Will Dennis <wdennis@nec-labs.com>
Subject: [Bug 9033] slurmdbd service segfaulting after restarting from power-induced system crash

Nate Rini<mailto:nate@schedmd.com> changed bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033>
What

Removed

Added

Severity

1 - System not usable

3 - Medium Impact

Comment # 48<https://bugs.schedmd.com/show_bug.cgi?id=9033#c48> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

(In reply to Will Dennis from comment #47<show_bug.cgi?id=9033#c47>)

> Not quite yet…

Understood, going to reduce this to SEV3 as there is a work around for the

issue.



>   *   I have to figure out how to remove the old Slurm 17.11.7 packages

> without removing any needed conf files etc.; want to get a system backup

> first, which I have to work with another IT team member to get done

Thorough backups are always suggested.



>   *   Need to test out the system, and inspect the Slurm DB tables,

> especially the association table

This can and should be done from 'sacctmgr' commands. Something as simple as a

check of 'sacctmgr show assoc' output should be sufficient. The assoc tables

can be updated while the system is running using the sacctmgr commands.



> Tangential issue: need to speak with someone from SchedMD to figure out how

> to orchestrate a conversion from packaged Slurm binaries to compiled ones,

We are happy to help with compiling and installing Slurm from official source

balls.



> and how to do this over a fleet of servers that comprise the cluster,

> especially doing upgrades in the future (perhaps this is all documented in

> detail somewhere?)

How this is done usually depends heavily on how the systems integrator setup

the cluster and what suite is managing the cluster.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 50 Nate Rini 2020-05-12 19:54:20 MDT
(In reply to Will Dennis from comment #49)
> We do not use any sort of systems integrator, or cluster management suite –
> it’s just all done by us (central IT + some researchers.) So pointers on how
> to roll out Slurm binaries across a fleet of servers without either using
> the OS package management system, or compiling everything on each server by
> hand, would be gratefully accepted :)

How is the OS image managed on all of the nodes?
Comment 51 Will Dennis 2020-05-13 09:53:08 MDT
We deploy the OS on the servers via a basic PXE-based answer-file install, then configure the nodes to production standard config via Ansible. 

Currently in the Ansible playbook, we are adding the Slurm 17.11.7 PPA repo, and then doing package installation for the requisite Slurm packages on the nodes. Would need to change this to have Ansible do "the right thing" to deploy compiled binaries to the different cluster nodes, and/or compile Slurm on each node...
Comment 52 Will Dennis 2020-05-13 10:02:10 MDT
I have uninstalled the Slurm packages from the controller, and now need to fix up the systemd unit files for starting the new /usr/local/slurm-based daemon binaries.

Currently, I have these entries in the following directories:

In /etc/systemd/system:
lrwxrwxrwx 1 root root    9 May 13 08:29 slurmctld.service -> /dev/null
lrwxrwxrwx 1 root root    9 May 13 08:29 slurmdbd.service -> /dev/null

In /etc/systemd/system/multi-user.target.wants/ :
lrwxrwxrwx 1 root root 37 Feb 25  2019 slurmctld.service -> /lib/systemd/system/slurmctld.service
lrwxrwxrwx 1 root root 36 Feb 23  2019 slurmdbd.service -> /lib/systemd/system/slurmdbd.service

(There are no longer any slurm*.service files in /lib/systemd/system)

Is it "the right thing" to change the /etc/systemd/system symlinks to point at the new .service files in /usr/local/src/slurm-17.11.13-2/etc and then do a "systemctl enable slurm[dbd,ctld]"?
Comment 53 Will Dennis 2020-05-13 11:52:27 MDT
Put two comments in via the ticket’s web interface, not sure if you get notified about those…
I’d like to get some guidance on the system question I asked there, if you could get to that fairly soon… Thanks!
Comment 54 Nate Rini 2020-05-13 11:57:42 MDT
(In reply to Will Dennis from comment #53)
> Put two comments in via the ticket’s web interface, not sure if you get
> notified about those…
> I’d like to get some guidance on the system question I asked there, if you
> could get to that fairly soon… Thanks!

Working on a response right now.
Comment 55 Nate Rini 2020-05-13 12:09:33 MDT
(In reply to Will Dennis from comment #51)
> We deploy the OS on the servers via a basic PXE-based answer-file install,
> then configure the nodes to production standard config via Ansible. 
> 
> Currently in the Ansible playbook, we are adding the Slurm 17.11.7 PPA repo,
> and then doing package installation for the requisite Slurm packages on the
> nodes. Would need to change this to have Ansible do "the right thing" to
> deploy compiled binaries to the different cluster nodes, and/or compile
> Slurm on each node...
I assume these are nodes with physical disks. It should possible to tell ansible to clone/rsync out /usr/local/slurm from your controller to all of the nodes. I believe that would be the easiest solution here.

(In reply to Will Dennis from comment #52)
> Is it "the right thing" to change the /etc/systemd/system symlinks to point
> at the new .service files in /usr/local/src/slurm-17.11.13-2/etc and then do
> a "systemctl enable slurm[dbd,ctld]"?

This is distro specific (https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sect-managing_services_with_systemd-unit_files) but 
/etc/systemd/system should be fine. Note, it is common for sites to put a symlink in /usr/local/slurm/current -> /usr/local/src/slurm-17.11.13-2 and then point the system symlinks to the current link instead for easy upgrades. Sites use 'latest' or 'current' or another name as wanted.
Comment 56 Will Dennis 2020-05-13 12:48:20 MDT
OK, did the following:

root@skycaptain1:/usr/local/slurm# ln -s 17.11.13-2 current
root@skycaptain1:/usr/local/slurm# cd /etc/systemd/system
root@skycaptain1:/etc/systemd/system# mkdir /usr/local/slurm/current/etc
root@skycaptain1:/etc/systemd/system# cp /usr/local/src/slurm-17.11.13-2/etc/slurmctld.service /usr/local/slurm/current/etc
root@skycaptain1:/etc/systemd/system# cp /usr/local/src/slurm-17.11.13-2/etc/slurmdbd.service /usr/local/slurm/current/etc
root@skycaptain1:/etc/systemd/system# rm slurmctld.service
root@skycaptain1:/etc/systemd/system# rm slurmdbd.service
root@skycaptain1:/etc/systemd/system# ln -s /usr/local/slurm/current/etc/slurmctld.service slurmctld.service
root@skycaptain1:/etc/systemd/system# ln -s /usr/local/slurm/current/etc/slurmdbd.service slurmdbd.service
root@skycaptain1:/etc/systemd/system# systemctl daemon-reload
root@skycaptain1:/etc/systemd/system# systemctl start slurmdbd

The service startup hung, and finally terminated with a timeout error. When I got service status thereafter, I see:

root@skycaptain1:/etc/systemd/system# systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/usr/local/slurm/current/etc/slurmdbd.service; enabled; vendor preset: enabled)
   Active: failed (Result: timeout) since Wed 2020-05-13 11:30:53 PDT; 3min 43s ago
  Process: 6780 ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 19372 (code=killed, signal=SEGV)

May 13 11:29:23 skycaptain1 systemd[1]: Starting Slurm DBD accounting daemon...
May 13 11:29:23 skycaptain1 systemd[1]: slurmdbd.service: Can't open PID file /var/run/slurmdbd.pid (yet?) after start: No such file or directory
May 13 11:30:53 skycaptain1 systemd[1]: slurmdbd.service: Start operation timed out. Terminating.
May 13 11:30:53 skycaptain1 systemd[1]: Failed to start Slurm DBD accounting daemon.
May 13 11:30:53 skycaptain1 systemd[1]: slurmdbd.service: Unit entered failed state.
May 13 11:30:53 skycaptain1 systemd[1]: slurmdbd.service: Failed with result 'timeout'.

The end of the slurmdbd.log has:

[2020-05-13T11:29:23.235] error: id_assoc 12 doesn't have any tres
[2020-05-13T11:29:23.241] error: We have more allocated time than is possible (14616000 > 4118400) for cluster macluster(1144) from 2020-05-13T09:00:00 - 2020-05-13T10:00:00 tres 1
[2020-05-13T11:29:23.241] error: We have more allocated time than is possible (91033516800 > 32092592400) for cluster macluster(8914609) from 2020-05-13T09:00:00 - 2020-05-13T10:00:00 tres 2
[2020-05-13T11:29:23.241] error: We have more allocated time than is possible (14212800 > 4118400) for cluster macluster(1144) from 2020-05-13T09:00:00 - 2020-05-13T10:00:00 tres 5
[2020-05-13T11:29:23.241] error: id_assoc 12 doesn't have any tres
[2020-05-13T11:29:23.247] error: We have more allocated time than is possible (14616000 > 4118400) for cluster macluster(1144) from 2020-05-13T10:00:00 - 2020-05-13T11:00:00 tres 1
[2020-05-13T11:29:23.247] error: We have more allocated time than is possible (91033516800 > 32092592400) for cluster macluster(8914609) from 2020-05-13T10:00:00 - 2020-05-13T11:00:00 tres 2
[2020-05-13T11:29:23.247] error: We have more allocated time than is possible (14212800 > 4118400) for cluster macluster(1144) from 2020-05-13T10:00:00 - 2020-05-13T11:00:00 tres 5
[2020-05-13T11:29:23.248] error: id_assoc 12 doesn't have any tres
[2020-05-13T11:29:23.575] debug2: No need to roll cluster macluster this month 1588316400 <= 1588316400
[2020-05-13T11:29:23.598] debug2: Got 1 of 1 rolled up
[2020-05-13T11:29:23.598] debug2: Everything rolled up
[2020-05-13T11:30:53.165] Terminate signal (SIGINT or SIGTERM) received
[2020-05-13T11:30:53.165] debug:  rpc_mgr shutting down

So, it looks to me like the slurmdbd service did start OK and was running, but systemd thought it wasn’t responding (?) and terminated it…

The current .service unit file is:

root@skycaptain1:/etc/systemd/system# cat /etc/systemd/system/slurmdbd.service
[Unit]
Description=Slurm DBD accounting daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurmdbd.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmdbd
ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmdbd.pid
TasksMax=infinity

[Install]
WantedBy=multi-user.target


I know systemd isn’t a part of Slurm, but any ideas on what may be going on here?
Comment 57 Nate Rini 2020-05-13 12:56:09 MDT
(In reply to Will Dennis from comment #56)
> I know systemd isn’t a part of Slurm, but any ideas on what may be going on
> here?

We provide example unit files (https://github.com/SchedMD/slurm/blob/master/etc/slurmdbd.service.in) which we generally suggest sites to follow.

> root@skycaptain1:/etc/systemd/system# cat
> /etc/systemd/system/slurmdbd.service
> [Unit]
> Description=Slurm DBD accounting daemon
> After=network.target munge.service
> ConditionPathExists=/etc/slurm/slurmdbd.conf
> 
> [Service]
> Type=forking
Change this to "simple"
> EnvironmentFile=-/etc/sysconfig/slurmdbd
> ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS
Add "-D" here to keep the daemon in foreground mode.
> ExecReload=/bin/kill -HUP $MAINPID
> PIDFile=/var/run/slurmdbd.pid
This may need to be changed to /usr/local/slurm/.../var/run/slurmdbd.pid
Comment 58 Will Dennis 2020-05-13 13:06:57 MDT
It was the PIDFile value; the “slurm” user couldn’t write to that path that was defined there… Fixed that, now all is well, slurmdbd service up and running.

Now I’ll do the slurmctld.service file; do you think slurmctld should also have a service dependency on slurmdbd (in our case, since we use database accounting?) Or is this not really necessary?


> Comment # 57<https://bugs.schedmd.com/show_bug.cgi?id=9033#c57> on bug 9033<https://bugs.schedmd.com/show_bug.cgi?id=9033> from Nate Rini<mailto:nate@schedmd.com>

>> root@skycaptain1:/etc/systemd/system# cat

>> /etc/systemd/system/slurmdbd.service

>> [Unit]

>> Description=Slurm DBD accounting daemon

>> After=network.target munge.service

>> ConditionPathExists=/etc/slurm/slurmdbd.conf

>>

>> [Service]

>> Type=forking

> Change this to "simple"

>> EnvironmentFile=-/etc/sysconfig/slurmdbd

>> ExecStart=/usr/local/slurm/17.11.13-2/sbin/slurmdbd $SLURMDBD_OPTIONS

> Add "-D" here to keep the daemon in foreground mode.

>> ExecReload=/bin/kill -HUP $MAINPID

>> PIDFile=/var/run/slurmdbd.pid

> This may need to be changed to /usr/local/slurm/.../var/run/slurmdbd.pid
Comment 59 Nate Rini 2020-05-13 13:23:33 MDT
(In reply to Will Dennis from comment #58)
> do you think slurmctld should also
> have a service dependency on slurmdbd (in our case, since we use database
> accounting?) Or is this not really necessary?

You can, but it is not necessary as slurmctld will retry a few times to talk to slurmdbd during startup.
Comment 60 Will Dennis 2020-05-14 16:09:40 MDT
We are OK now; I made the needed changes to systemd (which now includes "systemd-tmpfiles" control of /var/run, have to write config to create the needed /var/run/slurm directory at boot-time so the PDI file could be written out to it...)

On testing, the new custom version starts and runs correctly.

Thanks for all your help, OK to close out ticket now.
Comment 61 Nate Rini 2020-05-14 16:18:49 MDT
(In reply to Will Dennis from comment #60)
> Thanks for all your help, OK to close out ticket now.

I'm going to reduce this ticket to SEV4 as this is now QA ticket for the patch in comment #32. We will update and close the ticket once the fix is mainlined.
Comment 67 Nate Rini 2020-06-25 19:49:24 MDT
Will,

After internal discussion, we are going to leave Slurm as it currently is functioning. Attempting to recover from a corrupted database can potentially leave a system in an unknown state and the best course of action is to load a backup or forcibly clear the statesavedirectory.

Please respond if you have any questions.

Thanks,
--Nate