4846 – Upgrade to 17.11.4 from 16.05.4

Ticket 4846 - Upgrade to 17.11.4 from 16.05.4

Summary: Upgrade to 17.11.4 from 16.05.4

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	17.11.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2018-03-01 06:48 MST by Damien
Modified:	2018-03-26 10:13 MDT (History)
CC List:	1 user (show)

See Also:
Site:	Monash University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Old slurmdb log (588.16 KB, application/x-gzip) 2018-03-20 07:44 MDT, Damien	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Damien 2018-03-01 06:48:11 MST

Hi SLURM Support

We are planning to upgrade SLURM to 17.11.4 from 16.05.4 in our live production cluster, but are concerns of the very long upgrading time for it's DB updates as mentioned in the SLURM forum.

Our current DB currently has 1152045 records.


What should we watch out for ? or we should just do it and try to resolve any problem later ?


Kindly advise. Thanks.


Cheers

Damien

Comment 1 Felip Moll 2018-03-01 08:39:27 MST

Hi Damien,

As a reference point, one of our databases with 3M jobs on an nvme took 12 minutes to finish the upgrade. The exact same database on a different hardware took around 45 minutes, so hardware matters a lot. Another reference, a customer with 5M records took about 2 hours.

Upgrading shouldn't impact any running jobs and the queue should work just fine; records won't be written to the database during the update, but they'll be queued up and written to the database once the update is complete.

What kind of hardware do you have for your database?

If you are highly concerned about this you could backup your database and test the upgrade in another test server with similar or slower hardware, then you will have an estimation of how much time will it take.

Another point here to take into consideration: if you're running slurmdbd under systemd and it takes a long time to convert or rollup, it could be detected as a timeout and the process being killed. One option here is to run slurmdbd in the foreground (using the slurmdbd -D flag) so it runs separately from systemd. Run it on at least the info log level to see this message:

"Conversion done: success!"

Finally one more question, on which operating system are you running Slurm? If you are using rpms, slurm.spec have been rewritten and will not work in RHEL < 7 versions, then you would have to use the slurm-legacy.spec.

Hope it helps.

Comment 2 Damien 2018-03-01 22:09:58 MST

Hi Felip

Thanks for the replies.

We are running our mySQL and slurmDB on separate VMs of 2 CPU-cores each. Yes, it is very light weight.

Do you think it will be a serious concern ?

We are planning to implement a downtime for this upgrade to SLURM 17.11.4, but still concern of the conversion time.



Cheers


Damien

Comment 3 Felip Moll 2018-03-02 05:18:58 MST

(In reply to Damien from comment #2)
> Hi Felip
> 
> Thanks for the replies.
> 
> We are running our mySQL and slurmDB on separate VMs of 2 CPU-cores each.
> Yes, it is very light weight.
> 
> Do you think it will be a serious concern ?
> 
> We are planning to implement a downtime for this upgrade to SLURM 17.11.4,
> but still concern of the conversion time.
> 
> 
> 
> Cheers
> 
> 
> Damien

2-cpu cores is not a lot, does it have enough memory? We use not to recommend VMs for slurmctld or dbd and I cannot estimate a precise time, but given your records (1M) it shouldn't be a huge problem.


If you want to be 100% sure, what I would do:

Try to clone your VMs, isolate the machine, and try the conversion. May be the quickest method.


Or with little more work you can do it in your workstation, just install a Slurm instance:

1. mysqldump your database and restore it

2. git clone slurm code

3. ./configure --prefix=/your/install/basedir && make -j install

4. cd /your/install/basedir and configure a minimal slurmdbd.conf and slurmctld.conf

5. Just start manually /your/install/basedir/sbin/slurmdbd -Dvvv , no need to start slurmctld or slurmds.


The advantadge here is if that it works on the testbed and something goes wrong, you have an already converted database to import on the new slurm.


Of course... consider also increasing the power of these VMs or even better, get a bare metal for it.

Comment 4 Felip Moll 2018-03-06 05:45:02 MST

Hi Damien,

Do you have any other question about this matter?
If not, is it ok for you to close the bug?

Thank you

Comment 5 Damien 2018-03-06 07:52:42 MST

Hi Felip

In regard to this:

Can we do a 'sacctmgr archive dump'   ? without defining a purge ?

If this works,  How should we load back this events/jobs ?  sacctmgr archive load ? 


We have not done this before, and searching for this on the web, there isn't much example or success stories of this in the web.


Kindly comment. Thanks.


Cheers

Damien

Comment 6 Felip Moll 2018-03-06 08:10:00 MST

(In reply to Damien from comment #5)
> Hi Felip
> 
> In regard to this:
> 
> Can we do a 'sacctmgr archive dump'   ? without defining a purge ?
> 
> If this works,  How should we load back this events/jobs ?  sacctmgr archive
> load ? 
> 
> 
> We have not done this before, and searching for this on the web, there isn't
> much example or success stories of this in the web.
> 
> 
> Kindly comment. Thanks.
> 
> 
> Cheers
> 
> Damien

Are you doing this for trying to start a slurm instance in your laptop as I referred in comment 3?

If so, I was referring to a raw mysqldump command, not a sacctmgr dump:

https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html


If what you are trying to do is to reduce your database in size while archiving data, you must define purge and archive values in your conf file, and then let a rollup to automatically happen. Using sacctmgr for archiving is not recommended.

But first, tell me why do you want to archive/purge data.

Comment 7 Damien 2018-03-08 07:52:41 MST

Hi Felip

As taken from your advice.

I am constructing a minimal slurmdb with mariaDB in the same host, to load, view and convert the exist database.


I am hit with this error. I have seen this before.

[root@slurmdb-dl slurm]# 
[root@slurmdb-dl slurm]# /usr/local/slurm/sbin/slurmdbd -Dvvv
slurmdbd: debug:  Log file re-opened
slurmdbd: debug:  Null authentication plugin loaded
slurmdbd: error: Couldn't find the specified plugin name for accounting_storage/mysql looking at all files
slurmdbd: error: cannot find accounting_storage plugin for accounting_storage/mysql
slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
[root@slurmdb-dl slurm]# 



I believe mariadb-devel is not install, but it is been installed.


[root@slurmdb-dl slurm]# yum install mariadb-server mariadb-devel
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * epel: mirror.as24220.net
Package 1:mariadb-server-5.5.56-2.el7.x86_64 is obsoleted by MariaDB-server-10.2.13-1.el7.centos.x86_64 which is already installed
Package 1:mariadb-devel-5.5.56-2.el7.x86_64 is obsoleted by MariaDB-devel-10.2.13-1.el7.centos.x86_64 which is already installed
Nothing to do



Any clues to refresh my memory?


Kindly advise. Thanks.


Cheers

Damien


 





> 
> 2-cpu cores is not a lot, does it have enough memory? We use not to
> recommend VMs for slurmctld or dbd and I cannot estimate a precise time, but
> given your records (1M) it shouldn't be a huge problem.
> 
> 
> If you want to be 100% sure, what I would do:
> 
> Try to clone your VMs, isolate the machine, and try the conversion. May be
> the quickest method.
> 
> 
> Or with little more work you can do it in your workstation, just install a
> Slurm instance:
> 
> 1. mysqldump your database and restore it
> 
> 2. git clone slurm code
> 
> 3. ./configure --prefix=/your/install/basedir && make -j install
> 
> 4. cd /your/install/basedir and configure a minimal slurmdbd.conf and
> slurmctld.conf
> 
> 5. Just start manually /your/install/basedir/sbin/slurmdbd -Dvvv , no need
> to start slurmctld or slurmds.
> 
> 
> The advantadge here is if that it works on the testbed and something goes
> wrong, you have an already converted database to import on the new slurm.
> 
> 
> Of course... consider also increasing the power of these VMs or even better,
> get a bare metal for it.

Comment 8 Felip Moll 2018-03-08 10:36:38 MST

Damien,

Maybe you need mysql-devel instead of mariadb-devel?

You should look at the ./configure output to see if it detects the library or not.


This is my *minimal* list of development packages for a Centos 7, pick what you need from here:

munge
hwloc
libssh2
ncurses
expect
numactl
libcurl
psmisc
kernel-devel
gcc
cpp
autoconf
automake
gdb
mariadb-server
libevent-devel
hwloc-devel
openssl-devel
libssh2-devel
mysql-devel
pam-devel
gtk2-devel
ncurses-devel
lua-devel
rrdtool-devel
rpm-build
bzip2-devel
munge-devel
libcurl-devel
numactl-devel
readline-devel
perl-ExtUtils-MakeMaker


(In reply to Damien from comment #7)
> Hi Felip
> 
> As taken from your advice.
> 
> I am constructing a minimal slurmdb with mariaDB in the same host, to load,
> view and convert the exist database.
> 
> 
> I am hit with this error. I have seen this before.
> 
> [root@slurmdb-dl slurm]# 
> [root@slurmdb-dl slurm]# /usr/local/slurm/sbin/slurmdbd -Dvvv
> slurmdbd: debug:  Log file re-opened
> slurmdbd: debug:  Null authentication plugin loaded
> slurmdbd: error: Couldn't find the specified plugin name for
> accounting_storage/mysql looking at all files
> slurmdbd: error: cannot find accounting_storage plugin for
> accounting_storage/mysql
> slurmdbd: error: cannot create accounting_storage context for
> accounting_storage/mysql
> slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting
> storage plugin
> [root@slurmdb-dl slurm]# 
> 
> 
> 
> I believe mariadb-devel is not install, but it is been installed.
> 
> 
> [root@slurmdb-dl slurm]# yum install mariadb-server mariadb-devel
> Loaded plugins: fastestmirror
> Loading mirror speeds from cached hostfile
>  * epel: mirror.as24220.net
> Package 1:mariadb-server-5.5.56-2.el7.x86_64 is obsoleted by
> MariaDB-server-10.2.13-1.el7.centos.x86_64 which is already installed
> Package 1:mariadb-devel-5.5.56-2.el7.x86_64 is obsoleted by
> MariaDB-devel-10.2.13-1.el7.centos.x86_64 which is already installed
> Nothing to do
> 
> 
> 
> Any clues to refresh my memory?
> 
> 
> Kindly advise. Thanks.
> 
> 
> Cheers
> 
> Damien
> 
> 
>  
> 
> 
> 
> 
> 
> > 
> > 2-cpu cores is not a lot, does it have enough memory? We use not to
> > recommend VMs for slurmctld or dbd and I cannot estimate a precise time, but
> > given your records (1M) it shouldn't be a huge problem.
> > 
> > 
> > If you want to be 100% sure, what I would do:
> > 
> > Try to clone your VMs, isolate the machine, and try the conversion. May be
> > the quickest method.
> > 
> > 
> > Or with little more work you can do it in your workstation, just install a
> > Slurm instance:
> > 
> > 1. mysqldump your database and restore it
> > 
> > 2. git clone slurm code
> > 
> > 3. ./configure --prefix=/your/install/basedir && make -j install
> > 
> > 4. cd /your/install/basedir and configure a minimal slurmdbd.conf and
> > slurmctld.conf
> > 
> > 5. Just start manually /your/install/basedir/sbin/slurmdbd -Dvvv , no need
> > to start slurmctld or slurmds.
> > 
> > 
> > The advantadge here is if that it works on the testbed and something goes
> > wrong, you have an already converted database to import on the new slurm.
> > 
> > 
> > Of course... consider also increasing the power of these VMs or even better,
> > get a bare metal for it.

Comment 9 Felip Moll 2018-03-08 10:39:00 MST

Sorry , in my Fedora I have the mariadb packages, not mysql ones:

]$ rpm -qa|grep mariad
mariadb-common-10.1.26-2.fc25.x86_64
mariadb-config-10.1.26-2.fc25.x86_64
mariadb-server-10.1.26-2.fc25.x86_64
mariadb-libs-10.1.26-2.fc25.x86_64
mariadb-10.1.26-2.fc25.x86_64
mariadb-server-utils-10.1.26-2.fc25.x86_64
mariadb-devel-10.1.26-2.fc25.x86_64
mariadb-errmsg-10.1.26-2.fc25.x86_64


Take a look at the configure.

I think maybe you have some conflicts in your system:

> Loading mirror speeds from cached hostfile
>  * epel: mirror.as24220.net
> Package 1:mariadb-server-5.5.56-2.el7.x86_64 is obsoleted by
> MariaDB-server-10.2.13-1.el7.centos.x86_64 which is already installed

Comment 10 Damien 2018-03-12 09:26:57 MDT

Hi Felip

Spending more time troubleshooting resolved my issue. We can now mysql dump the Slurm database and review it's records from another location/server.

Can you share with us why using sacctmgr for archiving is not recommended ?

We were planning to use sacctmgr archive dump and sacctmgr archive load to dynamic load and unload old records while necessary.


Do you have my syntax or example for sacctmgr archive dump and sacctmgr archive load ?




Kindly advise. Thanks.

Cheers

Damien

Comment 11 Felip Moll 2018-03-12 11:52:37 MDT

(In reply to Damien from comment #10)
> Hi Felip
> 
> Spending more time troubleshooting resolved my issue. We can now mysql dump
> the Slurm database and review it's records from another location/server.
> 
> Can you share with us why using sacctmgr for archiving is not recommended ?
> 
> We were planning to use sacctmgr archive dump and sacctmgr archive load to
> dynamic load and unload old records while necessary.
> 
> 
> Do you have my syntax or example for sacctmgr archive dump and sacctmgr
> archive load ?
> 
> 
> 
> 
> Kindly advise. Thanks.
> 
> Cheers
> 
> Damien

We don't recommend to use 'sacctmgr archive' since there can be issues if your archive takes too much time to complete. If it happens your 'sacctmgr' command, depending on MessageTimeout and other internal timers can timeout and cut the communication with the archiving process. This can lead to a dangerous situation where the archive is not completed correctly. There are also problems with the internal locking mechanism, that is not the same the one used by the internal process of the one used by sacctmgr. Finally this command can block the database affecting production.

This command must be considered a 'development' tool not intended for production usage. We have opened a bug for clarifying this situation.

So, I recommend you to always use the internal process of archiving.

In your testing server, in order to trigger a rollup and therefore an archive, modify your last_ran_table like this:

1. Stop slurmdbd
2. Modify last_ran_table

update <yourcluster>_last_ran_table SET
    hourly_rollup = UNIX_TIMESTAMP('2018-03-11 00:00:00'), 
    daily_rollup = UNIX_TIMESTAMP('2018-03-11 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2018-03-11 00:00:00');


3. Start slurmdbd

Of course you have to set the appropiate parameters in the slurmdbd conf file.

One more thing: we recently discovered an issue with MySQL 5.1 that causes the upgrade process to be slow or to never end. This probably the cause for the problem about 'very long upgrading time' you are concerned about. Despite we have still not identified the exact cause, we checked that all the time with 5.1 we don't get to do the upgrade or it takes too much, so I hardly recommend you to check for your mysql version befor upgrading.


If you still want to use sacctmgr you have some explanation here:
https://slurm.schedmd.com/sacctmgr.html#lbBL

Comment 12 Damien 2018-03-16 22:56:35 MDT

Hi Slurm Support

We encountered problems on the slurmdbd during the 17.11.04 update on our production cluster. This production cluster has marked maintenance for 12 hours for this upgrade.

At the start of the upgrading, we allow slurmdb to do conversion from 16.05.04, after almost 40 minutes, when we observed these messages:
---
slurmdbd: debug:  Table qos_table has changed.  Updating...
slurmdbd: debug:  Table user_table has changed.  Updating...
slurmdbd: converting QOS table
slurmdbd: Conversion done: success!
slurmdbd: Accounting storage MYSQL plugin loaded
slurmdbd: debug3: Success.
---
We believe the database has been updated successfully, so we move on to slurmctl and the other slurmd.

In another two hours, we are happy with the whole 17.11.04 upgrade, the slurm cluster was put back online, and we begin our initial testings. In the next hour, we released the whole slurm cluster back to our user.


In about the next16 hours, the slurmdb crashes. slurmctl and the respective slurmd are still doing their tasks properly.

A closer examinations on slurmdb logs shows:
--lots
2018-03-15T04:03:24.389] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:29.441] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:34.481] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:39.521] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:44.574] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:49.626] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:54.677] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:03:59.729] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:04.769] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:09.809] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:14.861] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:19.911] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:24.958] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:29.999] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:35.050] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:40.113] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:45.165] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:50.210] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:04:55.259] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:00.299] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:05.339] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:10.379] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:15.428] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:20.479] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:25.525] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:30.565] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:35.605] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:40.646] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:45.696] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:50.760] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:05:55.822] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:00.862] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:05.903] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:10.944] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:15.994] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:21.046] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:26.097] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:31.138] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:36.178] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T04:06:41.218] error: It looks like the storage has gone away trying to reconnect
--

sreport is not working too (This was tested correctly right after the update:
--
sreport cluster Utilization cluster=m3 start=2018-03-14T00:00:00 end=2018-03-14T23:59:59
sreport: error: slurmdbd: Unknown error 1054
Problem with cluster query. 
--

Double check on mysql(mariaDB) instance, it is still running


We also discovered more updating of tables after 12 hours of the upgrades, see:
---
[2018-03-15T18:52:32.063] debug2: DBD_JOB_START: START CALL ID:1736077 NAME:de_e5_1303 INX:0
[2018-03-15T18:52:32.064] debug2: as_mysql_slurmdb_job_start() called
[2018-03-15T18:52:32.064] error: mysql_query failed: 1054 Unknown column 'pack_job_id' in 'field list'
insert into "m3_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (1736077, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 1166, 1, 10540, 10025, 'm3c005', 0, 4320, 1521048532, 1521048532, 1521048533, 'de_e5_1303', 0, 5, 18240, 1, 1, 9223372036854779904, 'qe26', 'm3c', '27', 'gpu:2', 'gpu:2', NULL, 0, '1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', '1=1,2=4096,4=1,1001=2', '/projects/qe26/autoqg') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=1166, id_user=10540, id_group=10025, nodelist='m3c005', id_resv=0, timelimit=4320, time_submit=1521048532, time_eligible=1521048532, time_start=1521048533, mod_time=UNIX_TIMESTAMP(), job_name='de_e5_1303', track_steps=0, id_qos=1, state=greatest(state, 5), priority=18240, cpus_req=1, nodes_alloc=1, mem_req=9223372036854779904, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, account='qe26', `partition`='m3c', node_inx='27', gres_req='gpu:2', gres_alloc='gpu:2', array_task_str=NULL, array_task_pending=0, tres_alloc='1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', tres_req='1=1,2=4096,4=1,1001=2', work_dir='/projects/qe26/autoqg'
[2018-03-15T18:52:32.064] error: It looks like the storage has gone away trying to reconnect
[2018-03-15T18:52:32.064] error: mysql_query failed: 1054 Unknown column 'pack_job_id' in 'field list'
insert into "m3_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (1736077, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 1166, 1, 10540, 10025, 'm3c005', 0, 4320, 1521048532, 1521048532, 1521048533, 'de_e5_1303', 0, 5, 18240, 1, 1, 9223372036854779904, 'qe26', 'm3c', '27', 'gpu:2', 'gpu:2', NULL, 0, '1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', '1=1,2=4096,4=1,1001=2', '/projects/qe26/autoqg') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=1166, id_user=10540, id_group=10025, nodelist='m3c005', id_resv=0, timelimit=4320, time_submit=1521048532, time_eligible=1521048532, time_start=1521048533, mod_time=UNIX_TIMESTAMP(), job_name='de_e5_1303', track_steps=0, id_qos=1, state=greatest(state, 5), priority=18240, cpus_req=1, nodes_alloc=1, mem_req=9223372036854779904, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, account='qe26', `partition`='m3c', node_inx='27', gres_req='gpu:2', gres_alloc='gpu:2', array_task_str=NULL, array_task_pending=0, tres_alloc='1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', tres_req='1=1,2=4096,4=1,1001=2', work_dir='/projects/qe26/autoqg'

and lots of:

2018-03-15T22:36:36.117] debug:  Table "m3_job_table" has changed.  Updating...
[2018-03-15T22:37:22.538] error: mysql_query failed: 1060 Duplicate column name 'admin_comment'
alter table "m3_job_table" modify `job_db_inx` bigint unsigned not null auto_increment, modify `mod_time` bigint unsigned default 0 not null, modify `deleted` tinyint default 0 not null, modify `account` tinytext, add `admin_comment` text after account, modify `array_task_str` text, modify `array_max_tasks` int unsigned default 0 not null, modify `array_task_pending` int unsigned default 0 not null, modify `cpus_req` int unsigned not null, modify `derived_ec` int unsigned default 0 not null, modify `derived_es` text, modify `exit_code` int unsigned default 0 not null, modify `job_name` tinytext not null, modify `id_assoc` int unsigned not null, modify `id_array_job` int unsigned default 0 not null, modify `id_array_task` int unsigned default 0xfffffffe not null, modify `id_block` tinytext, modify `id_job` int unsigned not null, modify `id_qos` int unsigned default 0 not null, modify `id_resv` int unsigned not null, modify `id_wckey` int unsigned not null, modify `id_user` int unsigned not null, modify `id_group` int unsigned not null, add `pack_job_id` int unsigned not null after id_group, add `pack_job_offset` int unsigned not null after pack_job_id, modify `kill_requid` int default -1 not null, add `mcs_label` tinytext default '' after kill_requid, modify `mem_req` bigint unsigned default 0 not null, modify `nodelist` text, modify `nodes_alloc` int unsigned not null, modify `node_inx` text, modify `partition` tinytext not null, modify `priority` int unsigned not null, modify `state` int unsigned not null, modify `timelimit` int unsigned default 0 not null, modify `time_submit` bigint unsigned default 0 not null, modify `time_eligible` bigint unsigned default 0 not null, modify `time_start` bigint unsigned default 0 not null, modify `time_end` bigint unsigned default 0 not null, modify `time_suspended` bigint unsigned default 0 not null, modify `gres_req` text not null default '', modify `gres_alloc` text not null default '', modify `gres_used` text not null default '', modify `wckey` tinytext not null default '', add `work_dir` text not null default '' after wckey, modify `track_steps` tinyint not null, modify `tres_alloc` text not null default '', modify `tres_req` text not null default '', drop primary key, add primary key (job_db_inx), drop index id_job, add unique index (id_job, time_submit), add key old_tuple (id_job, id_assoc, time_submit), drop key rollup, add key rollup (time_eligible, time_end), drop key rollup2, add key rollup2 (time_end, time_eligible), drop key nodes_alloc, add key nodes_alloc (nodes_alloc), drop key wckey, add key wckey (id_wckey), drop key qos, add key qos (id_qos), drop key association, add key association (id_assoc), drop key array_job, add key array_job (id_array_job), add key pack_job (pack_job_id), drop key reserv, add key reserv (id_resv), drop key sacct_def, add key sacct_def (id_user, time_start, time_end), drop key sacct_def2, add key sacct_def2 (id_user, time_end, time_eligible);
[2018-03-15T22:37:22.538] Accounting storage MYSQL plugin failed
[2018-03-15T22:37:22.539] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2018-03-15T22:37:22.539] error: cannot create accounting_storage context for accounting_storage/mysql
[2018-03-15T22:37:22.539] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin

---

We have no choice but to restart slurmDB a couple of times before this stabilise. It is alright so far (1 1/2 days). What could have went wrong ?

Did we miss any tasks or steps ?


Kindly advise. Thanks.


Cheers

Damien

Comment 13 Felip Moll 2018-03-19 04:48:15 MDT

(In reply to Damien from comment #12)
> slurmdbd: debug:  Table qos_table has changed.  Updating...
> slurmdbd: debug:  Table user_table has changed.  Updating...
> slurmdbd: converting QOS table
> slurmdbd: Conversion done: success!
> slurmdbd: Accounting storage MYSQL plugin loaded
> slurmdbd: debug3: Success.
> ---

I assume you saw also other table's messages, like:

debug:  Table "m3_job_table" has changed.   Updating...



> We believe the database has been updated successfully, so we move on to
> slurmctl and the other slurmd.

This is correct as this was shown:

> slurmdbd: Conversion done: success!



> In about the next16 hours, the slurmdb crashes. slurmctl and the respective
> slurmd are still doing their tasks properly.
> A closer examinations on slurmdb logs shows:
> --lots
> 2018-03-15T04:03:24.389] error: It looks like the storage has gone away
> trying to reconnect


This is very strange. The messages you shown means that slurmdbd cannot reach the MySQL instance... this can be a MySQL problem or a communication problem. The latter doesn't seem to be possible since it is in the same host, except if you changed your DNS, some wrong iptables rule, etc.

Other possibilities could be that Slurm user that was blocked for some reason in SQL, like for exceeding num. of connections, error in underlying storage...  a mistake in the password in slurmdbd.conf, or simply that MySQL was so loaded that couldn't serve the petitions.

I would recommend to see the system logs and MySQL logs around that time, tell me if you see something strange.

It is also possible that systemd have some influence here, I have seen cases where it send a sigterm to slurmdbd processes because it were not daemonized. See bug 3532.


> We also discovered more updating of tables after 12 hours of the upgrades,
> see:
> ---

Maybe because some rollback at that time happened due to the commented failures. Send me back the full slurmdbd log.

> We have no choice but to restart slurmDB a couple of times before this
> stabilise. It is alright so far (1 1/2 days). What could have went wrong ?

If it is stabilized now everything should be fine. I recommend you to dump mysql now and keep a backup just in case.


> Did we miss any tasks or steps ?

Don't think so, it seems more like some uncontrolled failure in some point.



Awaiting for your response.

Comment 14 Damien 2018-03-19 06:35:32 MDT

The slurmdb log is too huge. I will send when I'm in office (7 hours later)

Comment 15 Damien 2018-03-19 22:02:07 MDT

Created attachment 6424 [details]
slurmdb logs

The full slurmdb log of that period. 

Please gunzip the file first.

Comment 16 Damien 2018-03-19 22:36:33 MDT

Hi Felip

I have uploaded the slurmdb logs, it is a huge log file.

The upgrading start at the Morning 14th Mar 2018. 


Kindly investigate. Thanks.


Cheers

Damien

Comment 17 Felip Moll 2018-03-20 06:05:35 MDT

(In reply to Damien from comment #16)
> Hi Felip
> 
> I have uploaded the slurmdb logs, it is a huge log file.
> 
> The upgrading start at the Morning 14th Mar 2018. 
> 
> 
> Kindly investigate. Thanks.
> 
> 
> Cheers
> 
> Damien

Hi Damien,

Warning, long explanation coming:
------------------------------------
Log starts after the conversion @ [2018-03-14T13:11:14.337], so I cannot see how the first conversion gone.

What I see is for whatever reason the conversion was not fully completed or was reverted at some point by MySQL. In your logs problems start at the nightly rollup at 2018-03-15 04:00:00. The server tries to rollup the tables but finds that the database is not fully converted. It tries to rollup every hour until you restart slurmdbd one day later at 2018-03-15T22:32:26.423, were conversion starts again.

[2018-03-15T22:32:26.423] debug:  Log file re-opened
[2018-03-15T22:32:26.423] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/auth_munge.so
[2018-03-15T22:32:26.424] debug:  Munge authentication plugin loaded
[2018-03-15T22:32:26.424] debug3: Success.
[2018-03-15T22:32:26.424] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/accounting_storage_mysql.so
[2018-03-15T22:32:26.425] debug2: mysql_connect() called for db slurm_acct_db
[2018-03-15T22:32:26.439] debug4: 0(as_mysql_convert.c:816) query
select version from convert_version_table
[2018-03-15T22:32:26.439] debug4: as_mysql_convert_get_bad_tres: No conversion needed, Horray!
[2018-03-15T22:32:26.439] debug4: as_mysql_convert_tables_pre_create: No conversion needed, Horray!
[2018-03-15T22:32:26.443] debug:  Table "m3_assoc_table" has changed.  Updating...
[2018-03-15T22:32:26.522] debug:  Table "m3_assoc_usage_day_table" has changed.  Updating...
[2018-03-15T22:32:27.104] debug:  Table "m3_assoc_usage_hour_table" has changed.  Updating...
[2018-03-15T22:32:35.804] Warning: Note very large processing time from make table current "m3_assoc_usage_hour_table": usec=8699682 began=22:32:27.104
[2018-03-15T22:32:35.811] debug:  Table "m3_assoc_usage_month_table" has changed.  Updating...
[2018-03-15T22:32:35.883] debug:  Table "m3_usage_day_table" has changed.  Updating...
[2018-03-15T22:32:35.949] debug:  Table "m3_usage_hour_table" has changed.  Updating...
[2018-03-15T22:32:36.859] debug:  Table "m3_usage_month_table" has changed.  Updating...
[2018-03-15T22:32:37.033] debug:  Table "m3_event_table" has changed.  Updating...
[2018-03-15T22:32:37.140] adding column admin_comment after account in table "m3_job_table"
[2018-03-15T22:32:37.140] adding column pack_job_id after id_group in table "m3_job_table"
[2018-03-15T22:32:37.140] adding column pack_job_offset after pack_job_id in table "m3_job_table"
[2018-03-15T22:32:37.140] adding column mcs_label after kill_requid in table "m3_job_table"
[2018-03-15T22:32:37.140] adding column work_dir after wckey in table "m3_job_table"
[2018-03-15T22:32:37.140] adding key old_tuple (id_job, id_assoc, time_submit) to table "m3_job_table"
[2018-03-15T22:32:37.140] adding key pack_job (pack_job_id) to table "m3_job_table"

Then.. I see that when the conversion was running, you restarted slurmdbd again:

[2018-03-15T22:32:37.140] debug:  Table "m3_job_table" has changed.  Updating...
[2018-03-15T22:35:00.635] debug:  Log file re-opened

And again later:

[2018-03-15T22:35:00.686] debug:  Table "m3_job_table" has changed.  Updating...
[2018-03-15T22:36:36.066] debug:  Log file re-opened

Then we get an error since the job table was partially converted:

[2018-03-15T22:36:36.117] debug:  Table "m3_job_table" has changed.  Updating...
[2018-03-15T22:37:22.538] error: mysql_query failed: 1060 Duplicate column name 'admin_comment'

Therefore, it fails to start and you start slurmdbd again:

[2018-03-15T22:37:22.538] Accounting storage MYSQL plugin failed
[2018-03-15T22:37:22.539] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2018-03-15T22:37:22.539] error: cannot create accounting_storage context for accounting_storage/mysql
[2018-03-15T22:37:22.539] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
[2018-03-15T22:41:29.313] debug:  Log file re-opened

This time, you let enough time to finish correctly, it takes about 10 minutes.

[2018-03-15T22:51:29.028] Accounting storage MYSQL plugin loaded


Then you get some normal errors, this corresponds to jobs not being updated while slurmdbd was not responding, this should be automatically fixed:

[2018-03-15T22:51:30.261] error: We have more allocated time than is possible (6341464 > 5990400) for cluster m3(1664) from 2018-03-15T15:00:00 - 2018-03-15T16:00:00 tres 1
..
[2018-03-15T23:00:00.597] debug3: WARNING: Unused wall is less than zero; this should never happen. Setting it to zero for resv id = 94, start = 1519780166.


Then you restart again:
[2018-03-15T23:01:42.528] debug3: starting mysql cleaning up
[2018-03-15T23:01:42.528] debug3: finished mysql cleaning up
[2018-03-15T23:01:58.488] debug:  Log file re-opened
...
[2018-03-15T23:01:58.538] debug4: as_mysql_convert_get_bad_tres: No conversion needed, Horray!
[2018-03-15T23:01:58.538] debug4: as_mysql_convert_tables_pre_create: No conversion needed, Horray!
[2018-03-15T23:01:58.608] debug4: as_mysql_convert_tables_post_create: No conversion needed, Horray!
[2018-03-15T23:01:58.632] debug4: as_mysql_convert_non_cluster_tables_post_create: No conversion needed, Horray!
[2018-03-15T23:01:58.633] Accounting storage MYSQL plugin loaded


And everything seems fine until now.

-------------------------------------------------

To summarize:

1. Thanks for the huge log, it clarifies the situation.

2. My guess is that a rollback of MySQL happened between 2018-03-14T13:11:14.337 and 2018-03-15 04:00:00, that lead to undo the conversion done by slurmdbd. To be sure of this we would need to inspect the logs of the first conversion previous to 14-03-2018, and the MySQL logs on the first conversion and on the specified timeperiod, if there are any.

3. It seems everything is fine now, but, to be sure, please run a:

sacctmgr show runaway

don't fix anything and paste me the output.

4. Urgent/very important: you have 'DebugLevel = 9' set that corresponds to a 'debug4'. You urgently want to lower it down to at least 'DebugLevel=debug'. If this is a production server your performance of slurmdbd will be impacted and this can lead to problems.

Comment 18 Damien 2018-03-20 07:28:09 MDT

Hi Felip

Thanks for your investigation. 

Here goes:

sacctmgr show runaway
NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered running in the database
          ID       Name  Partition    Cluster      State           TimeStart             TimeEnd 
------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 
1734720         desktop        m3c         m3    PENDING             Unknown             Unknown 
1735765       Rmmu_chn2        m3a         m3    PENDING             Unknown             Unknown 
1731790         desktop        m3f         m3    RUNNING 2018-03-10T10:05:01             Unknown 
1709218        md.slurm        m3h         m3    RUNNING 2018-03-11T18:14:37             Unknown 
1734160           BEASt        m3h         m3    RUNNING 2018-03-13T10:56:14             Unknown 
1735736         desktop        m3c         m3    RUNNING 2018-03-13T23:59:42             Unknown 
1735744       Rmmu_chn2        m3a         m3    RUNNING 2018-03-14T01:57:48             Unknown 
1735760       Dmmu_chn2        com         m3    RUNNING 2018-03-14T02:02:49             Unknown 
1735761        mmu_chn2        com         m3    RUNNING 2018-03-14T02:03:16             Unknown 
1735762       Dmmu_chn2        com         m3    RUNNING 2018-03-14T02:03:57             Unknown 
1735763        mmu_chn2        com         m3    RUNNING 2018-03-14T02:09:00             Unknown 
1735764       Dmmu_chn2        com         m3    RUNNING 2018-03-14T02:10:30             Unknown 
1735746       Rmmu_chn2        m3a         m3    RUNNING 2018-03-14T02:15:26             Unknown 
1735755       Rmmu_chn2        m3a         m3    RUNNING 2018-03-14T02:37:27             Unknown 
1735767       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:07:31             Unknown 
1735768       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:07:31             Unknown 
1735769        mmu_chn2        com         m3    RUNNING 2018-03-14T03:07:31             Unknown 
1735770       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:07:31             Unknown 
1735771       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:07:31             Unknown 
1735772       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:08:31             Unknown 
1735773       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:09:31             Unknown 
1735774       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:27:02             Unknown 
1735759       Rmmu_chn2        m3a         m3    RUNNING 2018-03-14T03:27:36             Unknown 
1735776        mmu_chn2        com         m3    RUNNING 2018-03-14T03:27:36             Unknown 
1735777       Dmmu_chn2        com         m3    RUNNING 2018-03-14T03:40:36             Unknown 
1735778        mmu_chn2        com         m3    RUNNING 2018-03-14T03:40:36             Unknown 

Would you like to fix these runaway jobs?
(This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed.
Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.)

 (You have 30 seconds to decide)
(N/y): 




-- You mentioned, you suspected mysql did a rollback on the night before, Have seen these scenarios before ? Is this often ?




Kindly advise. Thanks.


Cheers

Damien

Comment 19 Damien 2018-03-20 07:44:53 MDT

Created attachment 6426 [details]
Old slurmdb log

Hi Felip

This is the earlier slurmdb log (new attached)

The 'conversion done' message is at '2018-03-14T11:06:10'


I hope that this is helpful.


Cheers

Damien

Comment 20 Felip Moll 2018-03-20 10:09:48 MDT

(In reply to Damien from comment #19)
> Created attachment 6426 [details]
> Old slurmdb log
> 
> Hi Felip
> 
> This is the earlier slurmdb log (new attached)
> 
> The 'conversion done' message is at '2018-03-14T11:06:10'
> 
> 
> I hope that this is helpful.
> 
> 
> Cheers
> 
> Damien

Hi Damien,

There's a failure in the middle of the conversion. Everything is working fine until somebody or something stops slurmdbd at:

[2018-03-14T09:54:31.377] Terminate signal (SIGINT or SIGTERM) received

*Then, slurmdbd is started again I guess with the new version, so the conversion starts:

[2018-03-14T09:57:15.629] debug4: Table convert_version_table doesn't exist, adding
..
[2018-03-14T09:57:15.825] debug4: 0(as_mysql_convert.c:816) query
select version from convert_version_table
[2018-03-14T09:57:15.825] debug4: (as_mysql_convert.c:838) query
insert into convert_version_table (version) values (0);
[2018-03-14T09:57:15.827] pre-converting job table for m3
[2018-03-14T09:57:15.864] adding column admin_comment after account in table "m3_job_table"
[2018-03-14T09:57:15.864] debug:  Table "m3_job_table" has changed.  Updating...

* At this point, something kills slurmdbd, and the m3_job_table is not fully converted, then after some seconds, slurmdbd is started again:

[2018-03-14T10:00:09.508] debug:  Log file re-opened
[2018-03-14T10:00:09.508] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/auth_munge.so
[2018-03-14T10:00:09.508] debug:  Munge authentication plugin loaded
[2018-03-14T10:00:09.508] debug3: Success.
[2018-03-14T10:00:09.508] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/accounting_storage_mysql.so
[2018-03-14T10:00:09.510] debug2: mysql_connect() called for db slurm_acct_db
[2018-03-14T10:00:09.653] debug4: 0(as_mysql_convert.c:816) query
select version from convert_version_table
[2018-03-14T10:00:09.654] pre-converting job table for m3
[2018-03-14T10:00:09.907] adding column admin_comment after account in table "m3_job_table"
[2018-03-14T10:00:09.907] debug:  Table "m3_job_table" has changed.  Updating...
[2018-03-14T10:01:38.972] error: mysql_query failed: 1060 Duplicate column name 'admin_comment'

* But this second time, since the conversion was done partially, it fails and stops.

[2018-03-14T10:01:38.972] error: issue converting tables before create
[2018-03-14T10:01:38.972] Accounting storage MYSQL plugin failed
[2018-03-14T10:01:38.973] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
[2018-03-14T10:01:38.973] error: cannot create accounting_storage context for accounting_storage/mysql
[2018-03-14T10:01:38.973] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin

*After ~50 minutes, slurmdbd is started again, conversion is started again and it seems to work this time for all tables:

[2018-03-14T10:53:38.995] debug:  Log file re-opened
...
[2018-03-14T11:06:10.762] Conversion done: success!

*Some time later.. slurmdbd is started again and everything seems fine, but the old process is not killed, this means a slurmdbd is started while the old one is not terminated:

[2018-03-14T11:22:22.815] debug4: as_mysql_convert_get_bad_tres: No conversion needed, Horray!
[2018-03-14T11:22:22.815] debug4: as_mysql_convert_tables_pre_create: No conversion needed, Horray!
[2018-03-14T11:22:22.889] debug4: as_mysql_convert_tables_post_create: No conversion needed, Horray!
[2018-03-14T11:22:22.904] debug4: as_mysql_convert_non_cluster_tables_post_create: No conversion needed, Horray!
[2018-03-14T11:22:22.905] Accounting storage MYSQL plugin loaded
...
[2018-03-14T11:22:22.905] Killing old slurmdbd[11144]
[2018-03-14T11:22:22.905] Terminate signal (SIGINT or SIGTERM) received

I see then more slurmdbd restarts until this point without problems, but.. it is possible that due to this starts/stops in the middle of the inserts have caused a rollback at some point on time in MySQL that moved back some tables but not reverted the convert version table, which caused errors when all columns have been queried in the rollup that happened at 04AM.



So first, I would ask you if you are aware of the possibility this restarts have happened, or otherwise they can be caused by something like systemd.

And second, if you are able to find the MySQL logs they could confirm this situation.

Now, I would say that everything is fine and it is pretty clear that this happened because of this starts/stops during conversion. Re-run the 'sacctmgr show runaway' and fix the wrong jobs, decrease the debug level and keep going.


Does it make sense to you?

Comment 21 Damien 2018-03-21 06:53:58 MDT

Hi Felip

Thanks for your explanation.

It does make sense.

1) Is there any danger to remove there runaway jobs ?

2) To lower the slurmDBdebug level,  How can I do this dynamically ? like the debug level for slurmd, which I can use the 'scontrol update' or scontrol reconfig. Is there a sacctmgr update  or sacctmgr reconfig ?



Cheers

Damien

Comment 22 Felip Moll 2018-03-21 16:11:46 MDT

(In reply to Damien from comment #21)
> Hi Felip
> 
> Thanks for your explanation.
> 
> It does make sense.
> 
> 1) Is there any danger to remove there runaway jobs ?
> 

No, there's not. Fix runaway jobs just sets the end_time and the status to completed. It does not remove anything.



> 2) To lower the slurmDBdebug level,  How can I do this dynamically ? like
> the debug level for slurmd, which I can use the 'scontrol update' or
> scontrol reconfig. Is there a sacctmgr update  or sacctmgr reconfig ?

Yep, to reconfigure just modify slurmdbd.conf and send a SIGHUP signal:

#] kill -HUP <slurmdbd pid>


Note: SIGHUP signal reconfigure the daemons, SIGUSR2 is log rotate signal.



Tell me when you are fine to close the bug or if you have more questions.

Comment 23 Damien 2018-03-26 08:54:43 MDT

Hi Felip.

Thanks, So far there isn't further issue.


Thanks for your help. Please close this ticket.


Cheers

Damien

Comment 24 Felip Moll 2018-03-26 10:13:12 MDT

Thanks Damien,

Glad you have solved the situation. If you have any further question, you know, here we are ;)

Closing it!