Ticket 12604 - Can't start slurmdbd after upgrade
Summary: Can't start slurmdbd after upgrade
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 21.08.1
Hardware: Linux Linux
: 2 - High Impact
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-10-05 07:30 MDT by Adam
Modified: 2021-10-06 13:01 MDT (History)
3 users (show)

See Also:
Site: Yale
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
scheduler parameters (2.96 KB, text/plain)
2021-10-05 08:22 MDT, Adam
Details
slurmdbd conf (575 bytes, text/plain)
2021-10-05 08:23 MDT, Adam
Details
slurmdbd messages (1.16 MB, text/plain)
2021-10-05 08:23 MDT, Adam
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Adam 2021-10-05 07:30:27 MDT
going from 20.02.5b to 21.08.1 slurmdbd fails to start, due to:

Oct 05 09:22:22 p2r1u27 slurmdbd[44519]: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 'farnam_usage_day_table'
                                         alter table "farnam_usage_day_table" change resv_secs plan_secs bigint unsigned default 0 not null;
Oct 05 09:22:22 p2r1u27 slurmdbd[44519]: error: Can't update "farnam_usage_day_table" Unknown error 1054
Oct 05 09:22:22 p2r1u27 slurmdbd[44519]: error: issue converting tables before create


Thank you,
Adam
Comment 1 Adam 2021-10-05 07:52:35 MDT
A bit more info:

[ACTIVE@p2r1u27:/etc] /opt/slurm/current/sbin/slurmdbd -D -vvvv
slurmdbd: debug:  Log file re-opened
slurmdbd: debug3: Trying to load plugin /opt/slurm/21.08.1/lib/slurm/auth_munge.so
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug3: Success.
slurmdbd: debug3: Trying to load plugin /opt/slurm/21.08.1/lib/slurm/accounting_storage_mysql.so
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.68-MariaDB
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_buffer_pool_size: 34359738368
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_log_file_size: 1073741824
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_lock_wait_timeout: 900
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: pre-converting usage table for farnam
slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 'farnam_usage_day_table'
alter table "farnam_usage_day_table" change resv_secs plan_secs bigint unsigned default 0 not null;
slurmdbd: error: Can't update "farnam_usage_day_table" Unknown error 1054
slurmdbd: error: issue converting tables before create
slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed
slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql
slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin

MariaDB [slurm_acct_db]> describe farnam_usage_day_table;
+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| creation_time | bigint(20) unsigned | NO   |     | NULL    |       |
| mod_time      | bigint(20) unsigned | NO   |     | 0       |       |
| deleted       | tinyint(4)          | NO   |     | 0       |       |
| id_tres       | int(11)             | NO   | PRI | NULL    |       |
| time_start    | bigint(20) unsigned | NO   | PRI | NULL    |       |
| count         | bigint(20) unsigned | NO   |     | 0       |       |
| alloc_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| down_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| pdown_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| idle_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| plan_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| over_secs     | bigint(20) unsigned | NO   |     | 0       |       |
+---------------+---------------------+------+-----+---------+-------+
12 rows in set (0.01 sec)
Comment 2 Adam 2021-10-05 07:55:47 MDT
..I think the problem has to do with 'plan_secs' as that column is named 'resv_secs' on our other cluster.
Comment 3 Nate Rini 2021-10-05 08:14:06 MDT
Please attach slurm.conf, slurmdbd.conf and the logs from slurmdbd during the upgrade.
Comment 4 Adam 2021-10-05 08:22:46 MDT
Created attachment 21598 [details]
scheduler parameters
Comment 5 Adam 2021-10-05 08:23:08 MDT
Created attachment 21599 [details]
slurmdbd conf
Comment 6 Adam 2021-10-05 08:23:29 MDT
Created attachment 21600 [details]
slurmdbd messages
Comment 7 Tim McMullan 2021-10-05 08:24:02 MDT
Would you also attach the output of these from mysql?

describe farnam_usage_hour_table;
describe farnam_usage_month_table;
Comment 8 Adam 2021-10-05 08:25:19 MDT
Hi Nate: done.

I also tried to rename the farnam_usage_day_table column plan_secs to resv_secs and restarted slurmdbd -- slurmdbd still failed to start but it did manage to rename resv_secs back to plan_secs before it failed.

Thanks,
Adam
Comment 10 Nate Rini 2021-10-05 08:26:54 MDT
(In reply to Adam from comment #8)
> I also tried to rename the farnam_usage_day_table column plan_secs to
> resv_secs and restarted slurmdbd -- slurmdbd still failed to start but it
> did manage to rename resv_secs back to plan_secs before it failed.

What was the Slurm version before upgrading?
Comment 11 Adam 2021-10-05 08:27:23 MDT
As per Tim's request:

MariaDB [slurm_acct_db]> describe farnam_usage_hour_table;
+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| creation_time | bigint(20) unsigned | NO   |     | NULL    |       |
| mod_time      | bigint(20) unsigned | NO   |     | 0       |       |
| deleted       | tinyint(4)          | NO   |     | 0       |       |
| id_tres       | int(11)             | NO   | PRI | NULL    |       |
| time_start    | bigint(20) unsigned | NO   | PRI | NULL    |       |
| count         | bigint(20) unsigned | NO   |     | 0       |       |
| alloc_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| down_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| pdown_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| idle_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| plan_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| over_secs     | bigint(20) unsigned | NO   |     | 0       |       |
+---------------+---------------------+------+-----+---------+-------+
12 rows in set (0.00 sec)

MariaDB [slurm_acct_db]> describe farnam_usage_month_table;
+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| creation_time | bigint(20) unsigned | NO   |     | NULL    |       |
| mod_time      | bigint(20) unsigned | NO   |     | 0       |       |
| deleted       | tinyint(4)          | NO   |     | 0       |       |
| id_tres       | int(11)             | NO   | PRI | NULL    |       |
| time_start    | bigint(20) unsigned | NO   | PRI | NULL    |       |
| count         | bigint(20) unsigned | NO   |     | 0       |       |
| alloc_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| down_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| pdown_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| idle_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| plan_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| over_secs     | bigint(20) unsigned | NO   |     | 0       |       |
+---------------+---------------------+------+-----+---------+-------+
12 rows in set (0.00 sec)
Comment 12 Adam 2021-10-05 08:27:55 MDT
Hi Nate,

The prior version was 20.02.5b.

Thanks,
Adam
Comment 14 Nate Rini 2021-10-05 08:31:19 MDT
Please also provide:
> describe farnam_usage_day_table;
Comment 15 Adam 2021-10-05 08:31:36 MDT
..which is to say it was 20.02.5 with a patch applied to it.
Comment 16 Adam 2021-10-05 08:32:25 MDT
Sure:

MariaDB [slurm_acct_db]> describe farnam_usage_day_table;
+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| creation_time | bigint(20) unsigned | NO   |     | NULL    |       |
| mod_time      | bigint(20) unsigned | NO   |     | 0       |       |
| deleted       | tinyint(4)          | NO   |     | 0       |       |
| id_tres       | int(11)             | NO   | PRI | NULL    |       |
| time_start    | bigint(20) unsigned | NO   | PRI | NULL    |       |
| count         | bigint(20) unsigned | NO   |     | 0       |       |
| alloc_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| down_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| pdown_secs    | bigint(20) unsigned | NO   |     | 0       |       |
| idle_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| plan_secs     | bigint(20) unsigned | NO   |     | 0       |       |
| over_secs     | bigint(20) unsigned | NO   |     | 0       |       |
+---------------+---------------------+------+-----+---------+-------+
12 rows in set (0.00 sec)
Comment 17 Nate Rini 2021-10-05 08:33:15 MDT
(In reply to Adam from comment #5)
> Created attachment 21599 [details]
> slurmdbd conf

Do you have the log from during the upgrade? This log appears to be after the initial upgrade attempt.
Comment 18 Nate Rini 2021-10-05 08:34:44 MDT
Please also call:
> SHOW ENGINE INNODB STATUS;
Comment 19 Adam 2021-10-05 08:36:12 MDT
I'll see if I can get that log data, in the meantime:

MariaDB [slurm_acct_db]> SHOW ENGINE INNODB STATUS;
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Type   | Name | Status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| InnoDB |      | 
=====================================
211005 10:35:16 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 10 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 7209 1_second, 7208 sleeps, 673 10_second, 1164 background, 1164 flush
srv_master_thread log flush and writes: 5960
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 5186, signal count 9025
Mutex spin waits 112730, rounds 229136, OS waits 4197
RW-shared spins 2180, rounds 26563, OS waits 617
RW-excl spins 3619, rounds 19772, OS waits 363
Spin rounds per wait: 2.03 mutex, 12.18 RW-shared, 5.46 RW-excl
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
402122 OS file reads, 1321859 OS file writes, 33334 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 14, seg size 16, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 70803091, node heap has 12638 buffer(s)
5.40 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 740879938289
Log flushed up to   740879938289
Last checkpoint at  740879938289
Max checkpoint age    1738750649
Checkpoint age target 1684414692
Modified age          0
Checkpoint age        0
0 pending log writes, 0 pending chkp writes
8806 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 35265708032; in additional pool allocated 0
Total memory allocated by read views 104
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 773489936 	(566424728 + 207065208)
    Page hash           35402344 (buffer pool 0 only)
    Dictionary cache    141929420 	(141607792 + 321628)
    File system         112480 	(82672 + 29808)
    Lock system         85000272 	(84999896 + 376)
    Recovery system     0 	(0 + 0)
Dictionary memory allocated 321628
Buffer pool size        2097151
Buffer pool size, bytes 34359721984
Free buffers            426710
Database pages          1657803
Old database pages      611960
Modified db pages       0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 402072, created 1255731, written 1270093
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1657803, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
0 transactions active inside InnoDB
0 out of 1000 descriptors used
---OLDEST VIEW---
Normal read view
Read view low limit trx n:o 18EB6B72
Read view up limit trx id 18EB6B72
Read view low limit trx id 18EB6B72
Read view individually stored trx ids:
-----------------
Main thread process no. 13469, id 139833229100800, state: sleeping
Number of rows inserted 23899674, updated 9, deleted 0, read 24088321
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 37.20 reads/s
------------
TRANSACTIONS
------------
Trx id counter 18EB708E
Purge done for trx's n:o < 18EB6AF6 undo n:o < 0
History list length 2103
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 1459, OS thread handle 0x7f35e86ea700, query id 57526 localhost root
SHOW ENGINE INNODB STATUS
----------------------------
END OF INNODB MONITOR OUTPUT
============================
 |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Comment 20 Adam 2021-10-05 08:39:59 MDT
Here is the very first slurmctld startup attempt:

Oct  5 08:46:32 p2r1u27.farnam.hpc.yale.internal systemd: Can't open PID file /var/run/slurm/slurmdbd.pid (yet?) after start: No such file or directory
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmctld[14619]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: Can't open PID file /var/run/slurm/slurmdbd.pid (yet?) after start: No such file or directory
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 'farnam_usage_day_table'#012alter table "farnam_usage_day_table" change resv_secs plan_secs bigint unsigned default 0 not null;
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: Can't update "farnam_usage_day_table" Unknown error 1054
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: issue converting tables before create
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: cannot create accounting_storage context for accounting_storage/mysql
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: slurmdbd.service: main process exited, code=exited, status=1/FAILURE
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: Unit slurmdbd.service entered failed state.
Oct  5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: slurmdbd.service failed.
Comment 21 Nate Rini 2021-10-05 08:41:06 MDT
(In reply to Adam from comment #19)
> I'll see if I can get that log data, in the meantime:

Please also attach the MySQL error log.
Comment 22 Adam 2021-10-05 08:47:31 MDT
Here is what we have for SQL log data:

211005  8:26:18 [Note] /usr/libexec/mysqld: Normal shutdown
211005  8:26:18 [Note] Event Scheduler: Purging the queue. 0 events
211005  8:26:18  InnoDB: Starting shutdown...
211005  8:26:18  InnoDB: Waiting for 3 pages to be flushed
211005  8:26:24  InnoDB: Shutdown completed; log sequence number 719715784528
211005  8:26:24 [Note] /usr/libexec/mysqld: Shutdown complete

211005 08:26:24 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended
211005 08:45:16 mysqld_safe Starting mysqld daemon with databases from /drbd/var/lib/mysql
211005  8:45:16 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 13469 ...
211005  8:45:16 [Warning] Changed limits: max_open_files: 1024  max_connections: 214  table_cache: 400
211005  8:45:16 InnoDB: The InnoDB memory heap is disabled
211005  8:45:16 InnoDB: Mutexes and rw_locks use GCC atomic builtins
211005  8:45:16 InnoDB: Compressed tables use zlib 1.2.7
211005  8:45:16 InnoDB: Using Linux native AIO
211005  8:45:16 InnoDB: Initializing buffer pool, size = 32.0G
211005  8:45:17 InnoDB: Completed initialization of buffer pool
211005  8:45:17 InnoDB: highest supported file format is Barracuda.
211005  8:45:18  InnoDB: Waiting for the background threads to start
211005  8:45:19 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 719715995881
211005  8:45:19 [Note] Plugin 'FEEDBACK' is disabled.
211005  8:45:19 [Note] Server socket created on IP: '127.0.0.1'.
211005  8:45:19 [Warning] 'proxies_priv' entry '@ root@p2r1u25' ignored in --skip-name-resolve mode.
211005  8:45:19 [Note] Event Scheduler: Loaded 0 events
211005  8:45:19 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.5.68-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
Comment 23 Tim McMullan 2021-10-05 09:00:37 MDT
Thanks for all these logs Adam.

It looks like the database conversion was only partially completed, so when the slurmdbd starts up its re-trying something that already changed and can't change it again.

Did you take a backup of the database before performing the upgrade?  If so, I think the safest option would be to make sure the slurmdbd is stopped, restore the database from the backup, then start the slurmdbd again from the command line (not with systemd, just in case the service times out and kills the slurmdbd prematurely).
Comment 24 Adam 2021-10-05 09:09:52 MDT
Hi Tim,

That's exactly what we're doing now. slurmdbd probably wasn't given enough time to complete the conversion step on the initial attempt, and now can't continue. We're restoring from an earlier backup and will attempt to run slurmdbd -D -vvvv as soon as the restoration has completed. 

Thanks,
Adam
Comment 25 Tim McMullan 2021-10-05 09:22:56 MDT
(In reply to Adam from comment #24)
> Hi Tim,
> 
> That's exactly what we're doing now. slurmdbd probably wasn't given enough
> time to complete the conversion step on the initial attempt, and now can't
> continue. We're restoring from an earlier backup and will attempt to run
> slurmdbd -D -vvvv as soon as the restoration has completed. 
> 
> Thanks,
> Adam

Sounds good!  Thanks Adam, let us know how that goes!

Thanks,
--Tim
Comment 26 Adam 2021-10-05 09:50:18 MDT
This problem was almost certainly the result of a system managed slurmdbd not being given enough time to start: the time it took to complete the conversion step below would have been considered a failure and aborted by our usual service management methods.

We will update our internal documentation to include a pre-task of making a fresh backup of the DB (always a good idea), and an additional step of manually running slurmdbd in order to allow for the time it takes to complete a (potential) DB-conversion.

[ACTIVE@p2r1u27:/var/log] slurmdbd -D -vvv
slurmdbd: debug:  Log file re-opened
slurmdbd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called for db slurm_acct_db
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.68-MariaDB
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_buffer_pool_size: 34359738368
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_log_file_size: 1073741824
slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_lock_wait_timeout: 900
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: pre-converting usage table for farnam
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: pre-converting job table for farnam
slurmdbd: debug:  Table "farnam_usage_day_table" has changed.  Updating...
slurmdbd: debug:  Table "farnam_usage_hour_table" has changed.  Updating...
slurmdbd: debug:  Table "farnam_usage_month_table" has changed.  Updating...
slurmdbd: adding column batch_script after array_task_pending in table "farnam_job_table"
slurmdbd: adding column container after constraints in table "farnam_job_table"
slurmdbd: adding column env_vars after derived_es in table "farnam_job_table"
slurmdbd: adding column submit_line after work_dir in table "farnam_job_table"
slurmdbd: dropping column gres_req from table "farnam_job_table"
slurmdbd: dropping column gres_alloc from table "farnam_job_table"
slurmdbd: debug:  Table "farnam_job_table" has changed.  Updating...
slurmdbd: Warning: Note very large processing time from make table current "farnam_job_table": usec=301272034 began=11:23:58.719
slurmdbd: adding column step_het_comp after id_step in table "farnam_step_table"
slurmdbd: adding column container after consumed_energy in table "farnam_step_table"
slurmdbd: adding column submit_line after req_cpufreq_gov in table "farnam_step_table"
slurmdbd: adding key no_step_comp (job_db_inx, id_step) to table "farnam_step_table"
slurmdbd: debug:  Table "farnam_step_table" has changed.  Updating...
slurmdbd: Warning: Note very large processing time from make table current "farnam_step_table": usec=143188253 began=11:28:59.996
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_post_create: post-converting step table for farnam

slurmdbd: adding column limit_factor after usage_thres in table qos_table
slurmdbd: debug:  Table qos_table has changed.  Updating...
slurmdbd: accounting_storage/as_mysql: as_mysql_convert_non_cluster_tables_post_create: Conversion done: success!
slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded
slurmdbd: debug2: ArchiveDir        = /home/backup/slurmArchiveDir
slurmdbd: debug2: ArchiveScript     = (null)
slurmdbd: debug2: AuthAltTypes      = (null)
slurmdbd: debug2: AuthAltParameters = (null)
slurmdbd: debug2: AuthInfo          = (null)
slurmdbd: debug2: AuthType          = auth/munge
slurmdbd: debug2: CommitDelay       = 0
slurmdbd: debug2: CommunicationParameters       = (null)
slurmdbd: debug2: DbdAddr           = localhost
slurmdbd: debug2: DbdBackupHost     = (null)
slurmdbd: debug2: DbdHost           = localhost
slurmdbd: debug2: DbdPort           = 6819
slurmdbd: debug2: DebugFlags        = (null)
slurmdbd: debug2: DebugLevel        = 6
slurmdbd: debug2: DebugLevelSyslog  = 2
slurmdbd: debug2: DefaultQOS        = (null)
slurmdbd: debug2: LogFile           = (null)
slurmdbd: debug2: MessageTimeout    = 10
slurmdbd: debug2: Parameters        = (null)
slurmdbd: debug2: PidFile           = /run/slurm/slurmdbd.pid
slurmdbd: debug2: PluginDir         = /opt/slurm/21.08.1/lib/slurm
slurmdbd: debug2: PrivateData       = none
slurmdbd: debug2: PurgeEventAfter   = 3 months*
slurmdbd: debug2: PurgeJobAfter     = 3 months*
slurmdbd: debug2: PurgeResvAfter    = 2 months*
slurmdbd: debug2: PurgeStepAfter    = 2 months*
slurmdbd: debug2: PurgeSuspendAfter = 1 months*
slurmdbd: debug2: PurgeTXNAfter = 3 months*
slurmdbd: debug2: PurgeUsageAfter = 3 months*
slurmdbd: debug2: SlurmUser         = slurm(666)
slurmdbd: debug2: StorageBackupHost = (null)
slurmdbd: debug2: StorageHost       = localhost
slurmdbd: debug2: StorageLoc        = slurm_acct_db
slurmdbd: debug2: StorageParameters = (null)
slurmdbd: debug2: StoragePort       = 3306
slurmdbd: debug2: StorageType       = accounting_storage/mysql
slurmdbd: debug2: StorageUser       = slurm
slurmdbd: debug2: TCPTimeout        = 2
slurmdbd: debug2: TrackWCKey        = 0
slurmdbd: debug2: TrackSlurmctldDown= 0
slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connectio
n 1
slurmdbd: debug2: Attempting to connect to localhost:3306
...
slurmdbd: Warning: Note very large processing time from _post_user_list: usec=3167468 began=11:39:36.287
slurmdbd: slurmdbd version 21.08.1 started
slurmdbd: debug2: running rollup at Tue Oct 05 11:39:39 2021
slurmdbd: debug2: Attempting to connect to localhost:3306
slurmdbd: debug2: _slurm_connect: failed to connect to 127.0.0.1:6817: Connection refused
slurmdbd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: Connection refused
slurmdbd: debug2: accounting_storage/as_mysql: _cluster_rollup_usage: No need to roll cluster farnam this day 1633406400 <= 163340
6400
slurmdbd: debug2: accounting_storage/as_mysql: _cluster_rollup_usage: No need to roll cluster farnam this month 1633060800 <= 1633
060800
slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Got 1 of 1 rolled up
slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Everything rolled up
Comment 27 Tim McMullan 2021-10-05 10:20:05 MDT
Thanks for the update, glad to see that the conversion completed!

We are going to update our documentation as well to suggest a manual start of the process for the upgrade as well (we've suggested it in the past, but it didn't make it into the documentation).

It sounds like you have things handled, but before I resolve this just let me know if you need anything else from me on this!

Thanks,
--Tim
Comment 28 Jason Booth 2021-10-05 11:12:39 MDT
Downgrading pending a reply.
Comment 29 Tim McMullan 2021-10-06 11:10:47 MDT
Hey Adam,

Just wanted to check in and make sure you are all set!

Thanks,
--Tim
Comment 31 Tim McMullan 2021-10-06 13:01:03 MDT
Awesome!  Thanks Adam, I'll resolve this ticket now.

Thanks!
--Tim