| Summary: | Can't start slurmdbd after upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Adam <adam.munro> |
| Component: | Accounting | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | eric.peskin, mcmullan, nate |
| Version: | 21.08.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Yale | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
scheduler parameters
slurmdbd conf slurmdbd messages |
||
A bit more info: [ACTIVE@p2r1u27:/etc] /opt/slurm/current/sbin/slurmdbd -D -vvvv slurmdbd: debug: Log file re-opened slurmdbd: debug3: Trying to load plugin /opt/slurm/21.08.1/lib/slurm/auth_munge.so slurmdbd: debug: auth/munge: init: Munge authentication plugin loaded slurmdbd: debug3: Success. slurmdbd: debug3: Trying to load plugin /opt/slurm/21.08.1/lib/slurm/accounting_storage_mysql.so slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.68-MariaDB slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_buffer_pool_size: 34359738368 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_log_file_size: 1073741824 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_lock_wait_timeout: 900 slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: pre-converting usage table for farnam slurmdbd: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 'farnam_usage_day_table' alter table "farnam_usage_day_table" change resv_secs plan_secs bigint unsigned default 0 not null; slurmdbd: error: Can't update "farnam_usage_day_table" Unknown error 1054 slurmdbd: error: issue converting tables before create slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin failed slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin MariaDB [slurm_acct_db]> describe farnam_usage_day_table; +---------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+---------------------+------+-----+---------+-------+ | creation_time | bigint(20) unsigned | NO | | NULL | | | mod_time | bigint(20) unsigned | NO | | 0 | | | deleted | tinyint(4) | NO | | 0 | | | id_tres | int(11) | NO | PRI | NULL | | | time_start | bigint(20) unsigned | NO | PRI | NULL | | | count | bigint(20) unsigned | NO | | 0 | | | alloc_secs | bigint(20) unsigned | NO | | 0 | | | down_secs | bigint(20) unsigned | NO | | 0 | | | pdown_secs | bigint(20) unsigned | NO | | 0 | | | idle_secs | bigint(20) unsigned | NO | | 0 | | | plan_secs | bigint(20) unsigned | NO | | 0 | | | over_secs | bigint(20) unsigned | NO | | 0 | | +---------------+---------------------+------+-----+---------+-------+ 12 rows in set (0.01 sec) ..I think the problem has to do with 'plan_secs' as that column is named 'resv_secs' on our other cluster. Please attach slurm.conf, slurmdbd.conf and the logs from slurmdbd during the upgrade. Created attachment 21598 [details]
scheduler parameters
Created attachment 21599 [details]
slurmdbd conf
Created attachment 21600 [details]
slurmdbd messages
Would you also attach the output of these from mysql? describe farnam_usage_hour_table; describe farnam_usage_month_table; Hi Nate: done. I also tried to rename the farnam_usage_day_table column plan_secs to resv_secs and restarted slurmdbd -- slurmdbd still failed to start but it did manage to rename resv_secs back to plan_secs before it failed. Thanks, Adam (In reply to Adam from comment #8) > I also tried to rename the farnam_usage_day_table column plan_secs to > resv_secs and restarted slurmdbd -- slurmdbd still failed to start but it > did manage to rename resv_secs back to plan_secs before it failed. What was the Slurm version before upgrading? As per Tim's request: MariaDB [slurm_acct_db]> describe farnam_usage_hour_table; +---------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+---------------------+------+-----+---------+-------+ | creation_time | bigint(20) unsigned | NO | | NULL | | | mod_time | bigint(20) unsigned | NO | | 0 | | | deleted | tinyint(4) | NO | | 0 | | | id_tres | int(11) | NO | PRI | NULL | | | time_start | bigint(20) unsigned | NO | PRI | NULL | | | count | bigint(20) unsigned | NO | | 0 | | | alloc_secs | bigint(20) unsigned | NO | | 0 | | | down_secs | bigint(20) unsigned | NO | | 0 | | | pdown_secs | bigint(20) unsigned | NO | | 0 | | | idle_secs | bigint(20) unsigned | NO | | 0 | | | plan_secs | bigint(20) unsigned | NO | | 0 | | | over_secs | bigint(20) unsigned | NO | | 0 | | +---------------+---------------------+------+-----+---------+-------+ 12 rows in set (0.00 sec) MariaDB [slurm_acct_db]> describe farnam_usage_month_table; +---------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+---------------------+------+-----+---------+-------+ | creation_time | bigint(20) unsigned | NO | | NULL | | | mod_time | bigint(20) unsigned | NO | | 0 | | | deleted | tinyint(4) | NO | | 0 | | | id_tres | int(11) | NO | PRI | NULL | | | time_start | bigint(20) unsigned | NO | PRI | NULL | | | count | bigint(20) unsigned | NO | | 0 | | | alloc_secs | bigint(20) unsigned | NO | | 0 | | | down_secs | bigint(20) unsigned | NO | | 0 | | | pdown_secs | bigint(20) unsigned | NO | | 0 | | | idle_secs | bigint(20) unsigned | NO | | 0 | | | plan_secs | bigint(20) unsigned | NO | | 0 | | | over_secs | bigint(20) unsigned | NO | | 0 | | +---------------+---------------------+------+-----+---------+-------+ 12 rows in set (0.00 sec) Hi Nate, The prior version was 20.02.5b. Thanks, Adam Please also provide:
> describe farnam_usage_day_table;
..which is to say it was 20.02.5 with a patch applied to it. Sure: MariaDB [slurm_acct_db]> describe farnam_usage_day_table; +---------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+---------------------+------+-----+---------+-------+ | creation_time | bigint(20) unsigned | NO | | NULL | | | mod_time | bigint(20) unsigned | NO | | 0 | | | deleted | tinyint(4) | NO | | 0 | | | id_tres | int(11) | NO | PRI | NULL | | | time_start | bigint(20) unsigned | NO | PRI | NULL | | | count | bigint(20) unsigned | NO | | 0 | | | alloc_secs | bigint(20) unsigned | NO | | 0 | | | down_secs | bigint(20) unsigned | NO | | 0 | | | pdown_secs | bigint(20) unsigned | NO | | 0 | | | idle_secs | bigint(20) unsigned | NO | | 0 | | | plan_secs | bigint(20) unsigned | NO | | 0 | | | over_secs | bigint(20) unsigned | NO | | 0 | | +---------------+---------------------+------+-----+---------+-------+ 12 rows in set (0.00 sec) (In reply to Adam from comment #5) > Created attachment 21599 [details] > slurmdbd conf Do you have the log from during the upgrade? This log appears to be after the initial upgrade attempt. Please also call:
> SHOW ENGINE INNODB STATUS;
I'll see if I can get that log data, in the meantime:
MariaDB [slurm_acct_db]> SHOW ENGINE INNODB STATUS;
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Type | Name | Status |
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| InnoDB | |
=====================================
211005 10:35:16 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 10 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 7209 1_second, 7208 sleeps, 673 10_second, 1164 background, 1164 flush
srv_master_thread log flush and writes: 5960
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 5186, signal count 9025
Mutex spin waits 112730, rounds 229136, OS waits 4197
RW-shared spins 2180, rounds 26563, OS waits 617
RW-excl spins 3619, rounds 19772, OS waits 363
Spin rounds per wait: 2.03 mutex, 12.18 RW-shared, 5.46 RW-excl
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
402122 OS file reads, 1321859 OS file writes, 33334 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 14, seg size 16, 0 merges
merged operations:
insert 0, delete mark 0, delete 0
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 70803091, node heap has 12638 buffer(s)
5.40 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 740879938289
Log flushed up to 740879938289
Last checkpoint at 740879938289
Max checkpoint age 1738750649
Checkpoint age target 1684414692
Modified age 0
Checkpoint age 0
0 pending log writes, 0 pending chkp writes
8806 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 35265708032; in additional pool allocated 0
Total memory allocated by read views 104
Internal hash tables (constant factor + variable factor)
Adaptive hash index 773489936 (566424728 + 207065208)
Page hash 35402344 (buffer pool 0 only)
Dictionary cache 141929420 (141607792 + 321628)
File system 112480 (82672 + 29808)
Lock system 85000272 (84999896 + 376)
Recovery system 0 (0 + 0)
Dictionary memory allocated 321628
Buffer pool size 2097151
Buffer pool size, bytes 34359721984
Free buffers 426710
Database pages 1657803
Old database pages 611960
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 402072, created 1255731, written 1270093
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 1657803, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
0 transactions active inside InnoDB
0 out of 1000 descriptors used
---OLDEST VIEW---
Normal read view
Read view low limit trx n:o 18EB6B72
Read view up limit trx id 18EB6B72
Read view low limit trx id 18EB6B72
Read view individually stored trx ids:
-----------------
Main thread process no. 13469, id 139833229100800, state: sleeping
Number of rows inserted 23899674, updated 9, deleted 0, read 24088321
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 37.20 reads/s
------------
TRANSACTIONS
------------
Trx id counter 18EB708E
Purge done for trx's n:o < 18EB6AF6 undo n:o < 0
History list length 2103
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0, not started
MySQL thread id 1459, OS thread handle 0x7f35e86ea700, query id 57526 localhost root
SHOW ENGINE INNODB STATUS
----------------------------
END OF INNODB MONITOR OUTPUT
============================
|
+--------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
Here is the very first slurmctld startup attempt: Oct 5 08:46:32 p2r1u27.farnam.hpc.yale.internal systemd: Can't open PID file /var/run/slurm/slurmdbd.pid (yet?) after start: No such file or directory Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmctld[14619]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: Can't open PID file /var/run/slurm/slurmdbd.pid (yet?) after start: No such file or directory Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 'farnam_usage_day_table'#012alter table "farnam_usage_day_table" change resv_secs plan_secs bigint unsigned default 0 not null; Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: Can't update "farnam_usage_day_table" Unknown error 1054 Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: issue converting tables before create Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: error: cannot create accounting_storage context for accounting_storage/mysql Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal slurmdbd[14634]: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: slurmdbd.service: main process exited, code=exited, status=1/FAILURE Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: Unit slurmdbd.service entered failed state. Oct 5 08:46:34 p2r1u27.farnam.hpc.yale.internal systemd: slurmdbd.service failed. (In reply to Adam from comment #19) > I'll see if I can get that log data, in the meantime: Please also attach the MySQL error log. Here is what we have for SQL log data: 211005 8:26:18 [Note] /usr/libexec/mysqld: Normal shutdown 211005 8:26:18 [Note] Event Scheduler: Purging the queue. 0 events 211005 8:26:18 InnoDB: Starting shutdown... 211005 8:26:18 InnoDB: Waiting for 3 pages to be flushed 211005 8:26:24 InnoDB: Shutdown completed; log sequence number 719715784528 211005 8:26:24 [Note] /usr/libexec/mysqld: Shutdown complete 211005 08:26:24 mysqld_safe mysqld from pid file /var/run/mariadb/mariadb.pid ended 211005 08:45:16 mysqld_safe Starting mysqld daemon with databases from /drbd/var/lib/mysql 211005 8:45:16 [Note] /usr/libexec/mysqld (mysqld 5.5.68-MariaDB) starting as process 13469 ... 211005 8:45:16 [Warning] Changed limits: max_open_files: 1024 max_connections: 214 table_cache: 400 211005 8:45:16 InnoDB: The InnoDB memory heap is disabled 211005 8:45:16 InnoDB: Mutexes and rw_locks use GCC atomic builtins 211005 8:45:16 InnoDB: Compressed tables use zlib 1.2.7 211005 8:45:16 InnoDB: Using Linux native AIO 211005 8:45:16 InnoDB: Initializing buffer pool, size = 32.0G 211005 8:45:17 InnoDB: Completed initialization of buffer pool 211005 8:45:17 InnoDB: highest supported file format is Barracuda. 211005 8:45:18 InnoDB: Waiting for the background threads to start 211005 8:45:19 Percona XtraDB (http://www.percona.com) 5.5.61-MariaDB-38.13 started; log sequence number 719715995881 211005 8:45:19 [Note] Plugin 'FEEDBACK' is disabled. 211005 8:45:19 [Note] Server socket created on IP: '127.0.0.1'. 211005 8:45:19 [Warning] 'proxies_priv' entry '@ root@p2r1u25' ignored in --skip-name-resolve mode. 211005 8:45:19 [Note] Event Scheduler: Loaded 0 events 211005 8:45:19 [Note] /usr/libexec/mysqld: ready for connections. Version: '5.5.68-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server Thanks for all these logs Adam. It looks like the database conversion was only partially completed, so when the slurmdbd starts up its re-trying something that already changed and can't change it again. Did you take a backup of the database before performing the upgrade? If so, I think the safest option would be to make sure the slurmdbd is stopped, restore the database from the backup, then start the slurmdbd again from the command line (not with systemd, just in case the service times out and kills the slurmdbd prematurely). Hi Tim, That's exactly what we're doing now. slurmdbd probably wasn't given enough time to complete the conversion step on the initial attempt, and now can't continue. We're restoring from an earlier backup and will attempt to run slurmdbd -D -vvvv as soon as the restoration has completed. Thanks, Adam (In reply to Adam from comment #24) > Hi Tim, > > That's exactly what we're doing now. slurmdbd probably wasn't given enough > time to complete the conversion step on the initial attempt, and now can't > continue. We're restoring from an earlier backup and will attempt to run > slurmdbd -D -vvvv as soon as the restoration has completed. > > Thanks, > Adam Sounds good! Thanks Adam, let us know how that goes! Thanks, --Tim This problem was almost certainly the result of a system managed slurmdbd not being given enough time to start: the time it took to complete the conversion step below would have been considered a failure and aborted by our usual service management methods. We will update our internal documentation to include a pre-task of making a fresh backup of the DB (always a good idea), and an additional step of manually running slurmdbd in order to allow for the time it takes to complete a (potential) DB-conversion. [ACTIVE@p2r1u27:/var/log] slurmdbd -D -vvv slurmdbd: debug: Log file re-opened slurmdbd: debug: auth/munge: init: Munge authentication plugin loaded slurmdbd: debug2: accounting_storage/as_mysql: init: mysql_connect() called for db slurm_acct_db slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.68-MariaDB slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_buffer_pool_size: 34359738368 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_log_file_size: 1073741824 slurmdbd: debug2: accounting_storage/as_mysql: _check_database_variables: innodb_lock_wait_timeout: 900 slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: pre-converting usage table for farnam slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_pre_create: pre-converting job table for farnam slurmdbd: debug: Table "farnam_usage_day_table" has changed. Updating... slurmdbd: debug: Table "farnam_usage_hour_table" has changed. Updating... slurmdbd: debug: Table "farnam_usage_month_table" has changed. Updating... slurmdbd: adding column batch_script after array_task_pending in table "farnam_job_table" slurmdbd: adding column container after constraints in table "farnam_job_table" slurmdbd: adding column env_vars after derived_es in table "farnam_job_table" slurmdbd: adding column submit_line after work_dir in table "farnam_job_table" slurmdbd: dropping column gres_req from table "farnam_job_table" slurmdbd: dropping column gres_alloc from table "farnam_job_table" slurmdbd: debug: Table "farnam_job_table" has changed. Updating... slurmdbd: Warning: Note very large processing time from make table current "farnam_job_table": usec=301272034 began=11:23:58.719 slurmdbd: adding column step_het_comp after id_step in table "farnam_step_table" slurmdbd: adding column container after consumed_energy in table "farnam_step_table" slurmdbd: adding column submit_line after req_cpufreq_gov in table "farnam_step_table" slurmdbd: adding key no_step_comp (job_db_inx, id_step) to table "farnam_step_table" slurmdbd: debug: Table "farnam_step_table" has changed. Updating... slurmdbd: Warning: Note very large processing time from make table current "farnam_step_table": usec=143188253 began=11:28:59.996 slurmdbd: accounting_storage/as_mysql: as_mysql_convert_tables_post_create: post-converting step table for farnam slurmdbd: adding column limit_factor after usage_thres in table qos_table slurmdbd: debug: Table qos_table has changed. Updating... slurmdbd: accounting_storage/as_mysql: as_mysql_convert_non_cluster_tables_post_create: Conversion done: success! slurmdbd: accounting_storage/as_mysql: init: Accounting storage MYSQL plugin loaded slurmdbd: debug2: ArchiveDir = /home/backup/slurmArchiveDir slurmdbd: debug2: ArchiveScript = (null) slurmdbd: debug2: AuthAltTypes = (null) slurmdbd: debug2: AuthAltParameters = (null) slurmdbd: debug2: AuthInfo = (null) slurmdbd: debug2: AuthType = auth/munge slurmdbd: debug2: CommitDelay = 0 slurmdbd: debug2: CommunicationParameters = (null) slurmdbd: debug2: DbdAddr = localhost slurmdbd: debug2: DbdBackupHost = (null) slurmdbd: debug2: DbdHost = localhost slurmdbd: debug2: DbdPort = 6819 slurmdbd: debug2: DebugFlags = (null) slurmdbd: debug2: DebugLevel = 6 slurmdbd: debug2: DebugLevelSyslog = 2 slurmdbd: debug2: DefaultQOS = (null) slurmdbd: debug2: LogFile = (null) slurmdbd: debug2: MessageTimeout = 10 slurmdbd: debug2: Parameters = (null) slurmdbd: debug2: PidFile = /run/slurm/slurmdbd.pid slurmdbd: debug2: PluginDir = /opt/slurm/21.08.1/lib/slurm slurmdbd: debug2: PrivateData = none slurmdbd: debug2: PurgeEventAfter = 3 months* slurmdbd: debug2: PurgeJobAfter = 3 months* slurmdbd: debug2: PurgeResvAfter = 2 months* slurmdbd: debug2: PurgeStepAfter = 2 months* slurmdbd: debug2: PurgeSuspendAfter = 1 months* slurmdbd: debug2: PurgeTXNAfter = 3 months* slurmdbd: debug2: PurgeUsageAfter = 3 months* slurmdbd: debug2: SlurmUser = slurm(666) slurmdbd: debug2: StorageBackupHost = (null) slurmdbd: debug2: StorageHost = localhost slurmdbd: debug2: StorageLoc = slurm_acct_db slurmdbd: debug2: StorageParameters = (null) slurmdbd: debug2: StoragePort = 3306 slurmdbd: debug2: StorageType = accounting_storage/mysql slurmdbd: debug2: StorageUser = slurm slurmdbd: debug2: TCPTimeout = 2 slurmdbd: debug2: TrackWCKey = 0 slurmdbd: debug2: TrackSlurmctldDown= 0 slurmdbd: debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connectio n 1 slurmdbd: debug2: Attempting to connect to localhost:3306 ... slurmdbd: Warning: Note very large processing time from _post_user_list: usec=3167468 began=11:39:36.287 slurmdbd: slurmdbd version 21.08.1 started slurmdbd: debug2: running rollup at Tue Oct 05 11:39:39 2021 slurmdbd: debug2: Attempting to connect to localhost:3306 slurmdbd: debug2: _slurm_connect: failed to connect to 127.0.0.1:6817: Connection refused slurmdbd: debug2: Error connecting slurm stream socket at 127.0.0.1:6817: Connection refused slurmdbd: debug2: accounting_storage/as_mysql: _cluster_rollup_usage: No need to roll cluster farnam this day 1633406400 <= 163340 6400 slurmdbd: debug2: accounting_storage/as_mysql: _cluster_rollup_usage: No need to roll cluster farnam this month 1633060800 <= 1633 060800 slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Got 1 of 1 rolled up slurmdbd: debug2: accounting_storage/as_mysql: as_mysql_roll_usage: Everything rolled up Thanks for the update, glad to see that the conversion completed! We are going to update our documentation as well to suggest a manual start of the process for the upgrade as well (we've suggested it in the past, but it didn't make it into the documentation). It sounds like you have things handled, but before I resolve this just let me know if you need anything else from me on this! Thanks, --Tim Downgrading pending a reply. Hey Adam, Just wanted to check in and make sure you are all set! Thanks, --Tim Yes, we’re all good now. Thanks! Adam From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Wednesday, October 6, 2021 at 1:10 PM To: "Munro, Adam" <adam.munro@yale.edu> Subject: [Bug 12604] Can't start slurmdbd after upgrade Comment # 29<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12604%23c29&data=04%7C01%7Cadam.munro%40yale.edu%7Ca5722f4c5a304c37b79208d988ec3eac%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637691370515428299%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oAv2kylfitgHdLeDkKn2dcGj4cPF%2Besi9Gj1sa3tLgs%3D&reserved=0> on bug 12604<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D12604&data=04%7C01%7Cadam.munro%40yale.edu%7Ca5722f4c5a304c37b79208d988ec3eac%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637691370515438293%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=VjDOoBnADU9bC5RSMGqbxv%2F7G2S5EfxaBazD8exoFPM%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Hey Adam, Just wanted to check in and make sure you are all set! Thanks, --Tim ________________________________ You are receiving this mail because: * You reported the bug. Awesome! Thanks Adam, I'll resolve this ticket now. Thanks! --Tim |
going from 20.02.5b to 21.08.1 slurmdbd fails to start, due to: Oct 05 09:22:22 p2r1u27 slurmdbd[44519]: error: mysql_query failed: 1054 Unknown column 'resv_secs' in 'farnam_usage_day_table' alter table "farnam_usage_day_table" change resv_secs plan_secs bigint unsigned default 0 not null; Oct 05 09:22:22 p2r1u27 slurmdbd[44519]: error: Can't update "farnam_usage_day_table" Unknown error 1054 Oct 05 09:22:22 p2r1u27 slurmdbd[44519]: error: issue converting tables before create Thank you, Adam