Hi SLURM Support We are planning to upgrade SLURM to 17.11.4 from 16.05.4 in our live production cluster, but are concerns of the very long upgrading time for it's DB updates as mentioned in the SLURM forum. Our current DB currently has 1152045 records. What should we watch out for ? or we should just do it and try to resolve any problem later ? Kindly advise. Thanks. Cheers Damien
Hi Damien, As a reference point, one of our databases with 3M jobs on an nvme took 12 minutes to finish the upgrade. The exact same database on a different hardware took around 45 minutes, so hardware matters a lot. Another reference, a customer with 5M records took about 2 hours. Upgrading shouldn't impact any running jobs and the queue should work just fine; records won't be written to the database during the update, but they'll be queued up and written to the database once the update is complete. What kind of hardware do you have for your database? If you are highly concerned about this you could backup your database and test the upgrade in another test server with similar or slower hardware, then you will have an estimation of how much time will it take. Another point here to take into consideration: if you're running slurmdbd under systemd and it takes a long time to convert or rollup, it could be detected as a timeout and the process being killed. One option here is to run slurmdbd in the foreground (using the slurmdbd -D flag) so it runs separately from systemd. Run it on at least the info log level to see this message: "Conversion done: success!" Finally one more question, on which operating system are you running Slurm? If you are using rpms, slurm.spec have been rewritten and will not work in RHEL < 7 versions, then you would have to use the slurm-legacy.spec. Hope it helps.
Hi Felip Thanks for the replies. We are running our mySQL and slurmDB on separate VMs of 2 CPU-cores each. Yes, it is very light weight. Do you think it will be a serious concern ? We are planning to implement a downtime for this upgrade to SLURM 17.11.4, but still concern of the conversion time. Cheers Damien
(In reply to Damien from comment #2) > Hi Felip > > Thanks for the replies. > > We are running our mySQL and slurmDB on separate VMs of 2 CPU-cores each. > Yes, it is very light weight. > > Do you think it will be a serious concern ? > > We are planning to implement a downtime for this upgrade to SLURM 17.11.4, > but still concern of the conversion time. > > > > Cheers > > > Damien 2-cpu cores is not a lot, does it have enough memory? We use not to recommend VMs for slurmctld or dbd and I cannot estimate a precise time, but given your records (1M) it shouldn't be a huge problem. If you want to be 100% sure, what I would do: Try to clone your VMs, isolate the machine, and try the conversion. May be the quickest method. Or with little more work you can do it in your workstation, just install a Slurm instance: 1. mysqldump your database and restore it 2. git clone slurm code 3. ./configure --prefix=/your/install/basedir && make -j install 4. cd /your/install/basedir and configure a minimal slurmdbd.conf and slurmctld.conf 5. Just start manually /your/install/basedir/sbin/slurmdbd -Dvvv , no need to start slurmctld or slurmds. The advantadge here is if that it works on the testbed and something goes wrong, you have an already converted database to import on the new slurm. Of course... consider also increasing the power of these VMs or even better, get a bare metal for it.
Hi Damien, Do you have any other question about this matter? If not, is it ok for you to close the bug? Thank you
Hi Felip In regard to this: Can we do a 'sacctmgr archive dump' ? without defining a purge ? If this works, How should we load back this events/jobs ? sacctmgr archive load ? We have not done this before, and searching for this on the web, there isn't much example or success stories of this in the web. Kindly comment. Thanks. Cheers Damien
(In reply to Damien from comment #5) > Hi Felip > > In regard to this: > > Can we do a 'sacctmgr archive dump' ? without defining a purge ? > > If this works, How should we load back this events/jobs ? sacctmgr archive > load ? > > > We have not done this before, and searching for this on the web, there isn't > much example or success stories of this in the web. > > > Kindly comment. Thanks. > > > Cheers > > Damien Are you doing this for trying to start a slurm instance in your laptop as I referred in comment 3? If so, I was referring to a raw mysqldump command, not a sacctmgr dump: https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html If what you are trying to do is to reduce your database in size while archiving data, you must define purge and archive values in your conf file, and then let a rollup to automatically happen. Using sacctmgr for archiving is not recommended. But first, tell me why do you want to archive/purge data.
Hi Felip As taken from your advice. I am constructing a minimal slurmdb with mariaDB in the same host, to load, view and convert the exist database. I am hit with this error. I have seen this before. [root@slurmdb-dl slurm]# [root@slurmdb-dl slurm]# /usr/local/slurm/sbin/slurmdbd -Dvvv slurmdbd: debug: Log file re-opened slurmdbd: debug: Null authentication plugin loaded slurmdbd: error: Couldn't find the specified plugin name for accounting_storage/mysql looking at all files slurmdbd: error: cannot find accounting_storage plugin for accounting_storage/mysql slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin [root@slurmdb-dl slurm]# I believe mariadb-devel is not install, but it is been installed. [root@slurmdb-dl slurm]# yum install mariadb-server mariadb-devel Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * epel: mirror.as24220.net Package 1:mariadb-server-5.5.56-2.el7.x86_64 is obsoleted by MariaDB-server-10.2.13-1.el7.centos.x86_64 which is already installed Package 1:mariadb-devel-5.5.56-2.el7.x86_64 is obsoleted by MariaDB-devel-10.2.13-1.el7.centos.x86_64 which is already installed Nothing to do Any clues to refresh my memory? Kindly advise. Thanks. Cheers Damien > > 2-cpu cores is not a lot, does it have enough memory? We use not to > recommend VMs for slurmctld or dbd and I cannot estimate a precise time, but > given your records (1M) it shouldn't be a huge problem. > > > If you want to be 100% sure, what I would do: > > Try to clone your VMs, isolate the machine, and try the conversion. May be > the quickest method. > > > Or with little more work you can do it in your workstation, just install a > Slurm instance: > > 1. mysqldump your database and restore it > > 2. git clone slurm code > > 3. ./configure --prefix=/your/install/basedir && make -j install > > 4. cd /your/install/basedir and configure a minimal slurmdbd.conf and > slurmctld.conf > > 5. Just start manually /your/install/basedir/sbin/slurmdbd -Dvvv , no need > to start slurmctld or slurmds. > > > The advantadge here is if that it works on the testbed and something goes > wrong, you have an already converted database to import on the new slurm. > > > Of course... consider also increasing the power of these VMs or even better, > get a bare metal for it.
Damien, Maybe you need mysql-devel instead of mariadb-devel? You should look at the ./configure output to see if it detects the library or not. This is my *minimal* list of development packages for a Centos 7, pick what you need from here: munge hwloc libssh2 ncurses expect numactl libcurl psmisc kernel-devel gcc cpp autoconf automake gdb mariadb-server libevent-devel hwloc-devel openssl-devel libssh2-devel mysql-devel pam-devel gtk2-devel ncurses-devel lua-devel rrdtool-devel rpm-build bzip2-devel munge-devel libcurl-devel numactl-devel readline-devel perl-ExtUtils-MakeMaker (In reply to Damien from comment #7) > Hi Felip > > As taken from your advice. > > I am constructing a minimal slurmdb with mariaDB in the same host, to load, > view and convert the exist database. > > > I am hit with this error. I have seen this before. > > [root@slurmdb-dl slurm]# > [root@slurmdb-dl slurm]# /usr/local/slurm/sbin/slurmdbd -Dvvv > slurmdbd: debug: Log file re-opened > slurmdbd: debug: Null authentication plugin loaded > slurmdbd: error: Couldn't find the specified plugin name for > accounting_storage/mysql looking at all files > slurmdbd: error: cannot find accounting_storage plugin for > accounting_storage/mysql > slurmdbd: error: cannot create accounting_storage context for > accounting_storage/mysql > slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting > storage plugin > [root@slurmdb-dl slurm]# > > > > I believe mariadb-devel is not install, but it is been installed. > > > [root@slurmdb-dl slurm]# yum install mariadb-server mariadb-devel > Loaded plugins: fastestmirror > Loading mirror speeds from cached hostfile > * epel: mirror.as24220.net > Package 1:mariadb-server-5.5.56-2.el7.x86_64 is obsoleted by > MariaDB-server-10.2.13-1.el7.centos.x86_64 which is already installed > Package 1:mariadb-devel-5.5.56-2.el7.x86_64 is obsoleted by > MariaDB-devel-10.2.13-1.el7.centos.x86_64 which is already installed > Nothing to do > > > > Any clues to refresh my memory? > > > Kindly advise. Thanks. > > > Cheers > > Damien > > > > > > > > > > > > 2-cpu cores is not a lot, does it have enough memory? We use not to > > recommend VMs for slurmctld or dbd and I cannot estimate a precise time, but > > given your records (1M) it shouldn't be a huge problem. > > > > > > If you want to be 100% sure, what I would do: > > > > Try to clone your VMs, isolate the machine, and try the conversion. May be > > the quickest method. > > > > > > Or with little more work you can do it in your workstation, just install a > > Slurm instance: > > > > 1. mysqldump your database and restore it > > > > 2. git clone slurm code > > > > 3. ./configure --prefix=/your/install/basedir && make -j install > > > > 4. cd /your/install/basedir and configure a minimal slurmdbd.conf and > > slurmctld.conf > > > > 5. Just start manually /your/install/basedir/sbin/slurmdbd -Dvvv , no need > > to start slurmctld or slurmds. > > > > > > The advantadge here is if that it works on the testbed and something goes > > wrong, you have an already converted database to import on the new slurm. > > > > > > Of course... consider also increasing the power of these VMs or even better, > > get a bare metal for it.
Sorry , in my Fedora I have the mariadb packages, not mysql ones: ]$ rpm -qa|grep mariad mariadb-common-10.1.26-2.fc25.x86_64 mariadb-config-10.1.26-2.fc25.x86_64 mariadb-server-10.1.26-2.fc25.x86_64 mariadb-libs-10.1.26-2.fc25.x86_64 mariadb-10.1.26-2.fc25.x86_64 mariadb-server-utils-10.1.26-2.fc25.x86_64 mariadb-devel-10.1.26-2.fc25.x86_64 mariadb-errmsg-10.1.26-2.fc25.x86_64 Take a look at the configure. I think maybe you have some conflicts in your system: > Loading mirror speeds from cached hostfile > * epel: mirror.as24220.net > Package 1:mariadb-server-5.5.56-2.el7.x86_64 is obsoleted by > MariaDB-server-10.2.13-1.el7.centos.x86_64 which is already installed
Hi Felip Spending more time troubleshooting resolved my issue. We can now mysql dump the Slurm database and review it's records from another location/server. Can you share with us why using sacctmgr for archiving is not recommended ? We were planning to use sacctmgr archive dump and sacctmgr archive load to dynamic load and unload old records while necessary. Do you have my syntax or example for sacctmgr archive dump and sacctmgr archive load ? Kindly advise. Thanks. Cheers Damien
(In reply to Damien from comment #10) > Hi Felip > > Spending more time troubleshooting resolved my issue. We can now mysql dump > the Slurm database and review it's records from another location/server. > > Can you share with us why using sacctmgr for archiving is not recommended ? > > We were planning to use sacctmgr archive dump and sacctmgr archive load to > dynamic load and unload old records while necessary. > > > Do you have my syntax or example for sacctmgr archive dump and sacctmgr > archive load ? > > > > > Kindly advise. Thanks. > > Cheers > > Damien We don't recommend to use 'sacctmgr archive' since there can be issues if your archive takes too much time to complete. If it happens your 'sacctmgr' command, depending on MessageTimeout and other internal timers can timeout and cut the communication with the archiving process. This can lead to a dangerous situation where the archive is not completed correctly. There are also problems with the internal locking mechanism, that is not the same the one used by the internal process of the one used by sacctmgr. Finally this command can block the database affecting production. This command must be considered a 'development' tool not intended for production usage. We have opened a bug for clarifying this situation. So, I recommend you to always use the internal process of archiving. In your testing server, in order to trigger a rollup and therefore an archive, modify your last_ran_table like this: 1. Stop slurmdbd 2. Modify last_ran_table update <yourcluster>_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2018-03-11 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2018-03-11 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2018-03-11 00:00:00'); 3. Start slurmdbd Of course you have to set the appropiate parameters in the slurmdbd conf file. One more thing: we recently discovered an issue with MySQL 5.1 that causes the upgrade process to be slow or to never end. This probably the cause for the problem about 'very long upgrading time' you are concerned about. Despite we have still not identified the exact cause, we checked that all the time with 5.1 we don't get to do the upgrade or it takes too much, so I hardly recommend you to check for your mysql version befor upgrading. If you still want to use sacctmgr you have some explanation here: https://slurm.schedmd.com/sacctmgr.html#lbBL
Hi Slurm Support We encountered problems on the slurmdbd during the 17.11.04 update on our production cluster. This production cluster has marked maintenance for 12 hours for this upgrade. At the start of the upgrading, we allow slurmdb to do conversion from 16.05.04, after almost 40 minutes, when we observed these messages: --- slurmdbd: debug: Table qos_table has changed. Updating... slurmdbd: debug: Table user_table has changed. Updating... slurmdbd: converting QOS table slurmdbd: Conversion done: success! slurmdbd: Accounting storage MYSQL plugin loaded slurmdbd: debug3: Success. --- We believe the database has been updated successfully, so we move on to slurmctl and the other slurmd. In another two hours, we are happy with the whole 17.11.04 upgrade, the slurm cluster was put back online, and we begin our initial testings. In the next hour, we released the whole slurm cluster back to our user. In about the next16 hours, the slurmdb crashes. slurmctl and the respective slurmd are still doing their tasks properly. A closer examinations on slurmdb logs shows: --lots 2018-03-15T04:03:24.389] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:29.441] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:34.481] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:39.521] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:44.574] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:49.626] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:54.677] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:03:59.729] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:04.769] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:09.809] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:14.861] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:19.911] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:24.958] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:29.999] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:35.050] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:40.113] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:45.165] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:50.210] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:04:55.259] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:00.299] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:05.339] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:10.379] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:15.428] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:20.479] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:25.525] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:30.565] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:35.605] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:40.646] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:45.696] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:50.760] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:05:55.822] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:00.862] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:05.903] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:10.944] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:15.994] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:21.046] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:26.097] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:31.138] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:36.178] error: It looks like the storage has gone away trying to reconnect [2018-03-15T04:06:41.218] error: It looks like the storage has gone away trying to reconnect -- sreport is not working too (This was tested correctly right after the update: -- sreport cluster Utilization cluster=m3 start=2018-03-14T00:00:00 end=2018-03-14T23:59:59 sreport: error: slurmdbd: Unknown error 1054 Problem with cluster query. -- Double check on mysql(mariaDB) instance, it is still running We also discovered more updating of tables after 12 hours of the upgrades, see: --- [2018-03-15T18:52:32.063] debug2: DBD_JOB_START: START CALL ID:1736077 NAME:de_e5_1303 INX:0 [2018-03-15T18:52:32.064] debug2: as_mysql_slurmdb_job_start() called [2018-03-15T18:52:32.064] error: mysql_query failed: 1054 Unknown column 'pack_job_id' in 'field list' insert into "m3_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (1736077, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 1166, 1, 10540, 10025, 'm3c005', 0, 4320, 1521048532, 1521048532, 1521048533, 'de_e5_1303', 0, 5, 18240, 1, 1, 9223372036854779904, 'qe26', 'm3c', '27', 'gpu:2', 'gpu:2', NULL, 0, '1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', '1=1,2=4096,4=1,1001=2', '/projects/qe26/autoqg') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=1166, id_user=10540, id_group=10025, nodelist='m3c005', id_resv=0, timelimit=4320, time_submit=1521048532, time_eligible=1521048532, time_start=1521048533, mod_time=UNIX_TIMESTAMP(), job_name='de_e5_1303', track_steps=0, id_qos=1, state=greatest(state, 5), priority=18240, cpus_req=1, nodes_alloc=1, mem_req=9223372036854779904, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, account='qe26', `partition`='m3c', node_inx='27', gres_req='gpu:2', gres_alloc='gpu:2', array_task_str=NULL, array_task_pending=0, tres_alloc='1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', tres_req='1=1,2=4096,4=1,1001=2', work_dir='/projects/qe26/autoqg' [2018-03-15T18:52:32.064] error: It looks like the storage has gone away trying to reconnect [2018-03-15T18:52:32.064] error: mysql_query failed: 1054 Unknown column 'pack_job_id' in 'field list' insert into "m3_job_table" (id_job, mod_time, id_array_job, id_array_task, pack_job_id, pack_job_offset, id_assoc, id_qos, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, nodes_alloc, mem_req, account, `partition`, node_inx, gres_req, gres_alloc, array_task_str, array_task_pending, tres_alloc, tres_req, work_dir) values (1736077, UNIX_TIMESTAMP(), 0, 4294967294, 0, 4294967294, 1166, 1, 10540, 10025, 'm3c005', 0, 4320, 1521048532, 1521048532, 1521048533, 'de_e5_1303', 0, 5, 18240, 1, 1, 9223372036854779904, 'qe26', 'm3c', '27', 'gpu:2', 'gpu:2', NULL, 0, '1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', '1=1,2=4096,4=1,1001=2', '/projects/qe26/autoqg') on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_assoc=1166, id_user=10540, id_group=10025, nodelist='m3c005', id_resv=0, timelimit=4320, time_submit=1521048532, time_eligible=1521048532, time_start=1521048533, mod_time=UNIX_TIMESTAMP(), job_name='de_e5_1303', track_steps=0, id_qos=1, state=greatest(state, 5), priority=18240, cpus_req=1, nodes_alloc=1, mem_req=9223372036854779904, id_array_job=0, id_array_task=4294967294, pack_job_id=0, pack_job_offset=4294967294, account='qe26', `partition`='m3c', node_inx='27', gres_req='gpu:2', gres_alloc='gpu:2', array_task_str=NULL, array_task_pending=0, tres_alloc='1=2,2=8192,3=18446744073709551614,4=1,5=2,1001=2', tres_req='1=1,2=4096,4=1,1001=2', work_dir='/projects/qe26/autoqg' and lots of: 2018-03-15T22:36:36.117] debug: Table "m3_job_table" has changed. Updating... [2018-03-15T22:37:22.538] error: mysql_query failed: 1060 Duplicate column name 'admin_comment' alter table "m3_job_table" modify `job_db_inx` bigint unsigned not null auto_increment, modify `mod_time` bigint unsigned default 0 not null, modify `deleted` tinyint default 0 not null, modify `account` tinytext, add `admin_comment` text after account, modify `array_task_str` text, modify `array_max_tasks` int unsigned default 0 not null, modify `array_task_pending` int unsigned default 0 not null, modify `cpus_req` int unsigned not null, modify `derived_ec` int unsigned default 0 not null, modify `derived_es` text, modify `exit_code` int unsigned default 0 not null, modify `job_name` tinytext not null, modify `id_assoc` int unsigned not null, modify `id_array_job` int unsigned default 0 not null, modify `id_array_task` int unsigned default 0xfffffffe not null, modify `id_block` tinytext, modify `id_job` int unsigned not null, modify `id_qos` int unsigned default 0 not null, modify `id_resv` int unsigned not null, modify `id_wckey` int unsigned not null, modify `id_user` int unsigned not null, modify `id_group` int unsigned not null, add `pack_job_id` int unsigned not null after id_group, add `pack_job_offset` int unsigned not null after pack_job_id, modify `kill_requid` int default -1 not null, add `mcs_label` tinytext default '' after kill_requid, modify `mem_req` bigint unsigned default 0 not null, modify `nodelist` text, modify `nodes_alloc` int unsigned not null, modify `node_inx` text, modify `partition` tinytext not null, modify `priority` int unsigned not null, modify `state` int unsigned not null, modify `timelimit` int unsigned default 0 not null, modify `time_submit` bigint unsigned default 0 not null, modify `time_eligible` bigint unsigned default 0 not null, modify `time_start` bigint unsigned default 0 not null, modify `time_end` bigint unsigned default 0 not null, modify `time_suspended` bigint unsigned default 0 not null, modify `gres_req` text not null default '', modify `gres_alloc` text not null default '', modify `gres_used` text not null default '', modify `wckey` tinytext not null default '', add `work_dir` text not null default '' after wckey, modify `track_steps` tinyint not null, modify `tres_alloc` text not null default '', modify `tres_req` text not null default '', drop primary key, add primary key (job_db_inx), drop index id_job, add unique index (id_job, time_submit), add key old_tuple (id_job, id_assoc, time_submit), drop key rollup, add key rollup (time_eligible, time_end), drop key rollup2, add key rollup2 (time_end, time_eligible), drop key nodes_alloc, add key nodes_alloc (nodes_alloc), drop key wckey, add key wckey (id_wckey), drop key qos, add key qos (id_qos), drop key association, add key association (id_assoc), drop key array_job, add key array_job (id_array_job), add key pack_job (pack_job_id), drop key reserv, add key reserv (id_resv), drop key sacct_def, add key sacct_def (id_user, time_start, time_end), drop key sacct_def2, add key sacct_def2 (id_user, time_end, time_eligible); [2018-03-15T22:37:22.538] Accounting storage MYSQL plugin failed [2018-03-15T22:37:22.539] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2018-03-15T22:37:22.539] error: cannot create accounting_storage context for accounting_storage/mysql [2018-03-15T22:37:22.539] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin --- We have no choice but to restart slurmDB a couple of times before this stabilise. It is alright so far (1 1/2 days). What could have went wrong ? Did we miss any tasks or steps ? Kindly advise. Thanks. Cheers Damien
(In reply to Damien from comment #12) > slurmdbd: debug: Table qos_table has changed. Updating... > slurmdbd: debug: Table user_table has changed. Updating... > slurmdbd: converting QOS table > slurmdbd: Conversion done: success! > slurmdbd: Accounting storage MYSQL plugin loaded > slurmdbd: debug3: Success. > --- I assume you saw also other table's messages, like: debug: Table "m3_job_table" has changed. Updating... > We believe the database has been updated successfully, so we move on to > slurmctl and the other slurmd. This is correct as this was shown: > slurmdbd: Conversion done: success! > In about the next16 hours, the slurmdb crashes. slurmctl and the respective > slurmd are still doing their tasks properly. > A closer examinations on slurmdb logs shows: > --lots > 2018-03-15T04:03:24.389] error: It looks like the storage has gone away > trying to reconnect This is very strange. The messages you shown means that slurmdbd cannot reach the MySQL instance... this can be a MySQL problem or a communication problem. The latter doesn't seem to be possible since it is in the same host, except if you changed your DNS, some wrong iptables rule, etc. Other possibilities could be that Slurm user that was blocked for some reason in SQL, like for exceeding num. of connections, error in underlying storage... a mistake in the password in slurmdbd.conf, or simply that MySQL was so loaded that couldn't serve the petitions. I would recommend to see the system logs and MySQL logs around that time, tell me if you see something strange. It is also possible that systemd have some influence here, I have seen cases where it send a sigterm to slurmdbd processes because it were not daemonized. See bug 3532. > We also discovered more updating of tables after 12 hours of the upgrades, > see: > --- Maybe because some rollback at that time happened due to the commented failures. Send me back the full slurmdbd log. > We have no choice but to restart slurmDB a couple of times before this > stabilise. It is alright so far (1 1/2 days). What could have went wrong ? If it is stabilized now everything should be fine. I recommend you to dump mysql now and keep a backup just in case. > Did we miss any tasks or steps ? Don't think so, it seems more like some uncontrolled failure in some point. Awaiting for your response.
The slurmdb log is too huge. I will send when I'm in office (7 hours later)
Created attachment 6424 [details] slurmdb logs The full slurmdb log of that period. Please gunzip the file first.
Hi Felip I have uploaded the slurmdb logs, it is a huge log file. The upgrading start at the Morning 14th Mar 2018. Kindly investigate. Thanks. Cheers Damien
(In reply to Damien from comment #16) > Hi Felip > > I have uploaded the slurmdb logs, it is a huge log file. > > The upgrading start at the Morning 14th Mar 2018. > > > Kindly investigate. Thanks. > > > Cheers > > Damien Hi Damien, Warning, long explanation coming: ------------------------------------ Log starts after the conversion @ [2018-03-14T13:11:14.337], so I cannot see how the first conversion gone. What I see is for whatever reason the conversion was not fully completed or was reverted at some point by MySQL. In your logs problems start at the nightly rollup at 2018-03-15 04:00:00. The server tries to rollup the tables but finds that the database is not fully converted. It tries to rollup every hour until you restart slurmdbd one day later at 2018-03-15T22:32:26.423, were conversion starts again. [2018-03-15T22:32:26.423] debug: Log file re-opened [2018-03-15T22:32:26.423] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/auth_munge.so [2018-03-15T22:32:26.424] debug: Munge authentication plugin loaded [2018-03-15T22:32:26.424] debug3: Success. [2018-03-15T22:32:26.424] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/accounting_storage_mysql.so [2018-03-15T22:32:26.425] debug2: mysql_connect() called for db slurm_acct_db [2018-03-15T22:32:26.439] debug4: 0(as_mysql_convert.c:816) query select version from convert_version_table [2018-03-15T22:32:26.439] debug4: as_mysql_convert_get_bad_tres: No conversion needed, Horray! [2018-03-15T22:32:26.439] debug4: as_mysql_convert_tables_pre_create: No conversion needed, Horray! [2018-03-15T22:32:26.443] debug: Table "m3_assoc_table" has changed. Updating... [2018-03-15T22:32:26.522] debug: Table "m3_assoc_usage_day_table" has changed. Updating... [2018-03-15T22:32:27.104] debug: Table "m3_assoc_usage_hour_table" has changed. Updating... [2018-03-15T22:32:35.804] Warning: Note very large processing time from make table current "m3_assoc_usage_hour_table": usec=8699682 began=22:32:27.104 [2018-03-15T22:32:35.811] debug: Table "m3_assoc_usage_month_table" has changed. Updating... [2018-03-15T22:32:35.883] debug: Table "m3_usage_day_table" has changed. Updating... [2018-03-15T22:32:35.949] debug: Table "m3_usage_hour_table" has changed. Updating... [2018-03-15T22:32:36.859] debug: Table "m3_usage_month_table" has changed. Updating... [2018-03-15T22:32:37.033] debug: Table "m3_event_table" has changed. Updating... [2018-03-15T22:32:37.140] adding column admin_comment after account in table "m3_job_table" [2018-03-15T22:32:37.140] adding column pack_job_id after id_group in table "m3_job_table" [2018-03-15T22:32:37.140] adding column pack_job_offset after pack_job_id in table "m3_job_table" [2018-03-15T22:32:37.140] adding column mcs_label after kill_requid in table "m3_job_table" [2018-03-15T22:32:37.140] adding column work_dir after wckey in table "m3_job_table" [2018-03-15T22:32:37.140] adding key old_tuple (id_job, id_assoc, time_submit) to table "m3_job_table" [2018-03-15T22:32:37.140] adding key pack_job (pack_job_id) to table "m3_job_table" Then.. I see that when the conversion was running, you restarted slurmdbd again: [2018-03-15T22:32:37.140] debug: Table "m3_job_table" has changed. Updating... [2018-03-15T22:35:00.635] debug: Log file re-opened And again later: [2018-03-15T22:35:00.686] debug: Table "m3_job_table" has changed. Updating... [2018-03-15T22:36:36.066] debug: Log file re-opened Then we get an error since the job table was partially converted: [2018-03-15T22:36:36.117] debug: Table "m3_job_table" has changed. Updating... [2018-03-15T22:37:22.538] error: mysql_query failed: 1060 Duplicate column name 'admin_comment' Therefore, it fails to start and you start slurmdbd again: [2018-03-15T22:37:22.538] Accounting storage MYSQL plugin failed [2018-03-15T22:37:22.539] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2018-03-15T22:37:22.539] error: cannot create accounting_storage context for accounting_storage/mysql [2018-03-15T22:37:22.539] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin [2018-03-15T22:41:29.313] debug: Log file re-opened This time, you let enough time to finish correctly, it takes about 10 minutes. [2018-03-15T22:51:29.028] Accounting storage MYSQL plugin loaded Then you get some normal errors, this corresponds to jobs not being updated while slurmdbd was not responding, this should be automatically fixed: [2018-03-15T22:51:30.261] error: We have more allocated time than is possible (6341464 > 5990400) for cluster m3(1664) from 2018-03-15T15:00:00 - 2018-03-15T16:00:00 tres 1 .. [2018-03-15T23:00:00.597] debug3: WARNING: Unused wall is less than zero; this should never happen. Setting it to zero for resv id = 94, start = 1519780166. Then you restart again: [2018-03-15T23:01:42.528] debug3: starting mysql cleaning up [2018-03-15T23:01:42.528] debug3: finished mysql cleaning up [2018-03-15T23:01:58.488] debug: Log file re-opened ... [2018-03-15T23:01:58.538] debug4: as_mysql_convert_get_bad_tres: No conversion needed, Horray! [2018-03-15T23:01:58.538] debug4: as_mysql_convert_tables_pre_create: No conversion needed, Horray! [2018-03-15T23:01:58.608] debug4: as_mysql_convert_tables_post_create: No conversion needed, Horray! [2018-03-15T23:01:58.632] debug4: as_mysql_convert_non_cluster_tables_post_create: No conversion needed, Horray! [2018-03-15T23:01:58.633] Accounting storage MYSQL plugin loaded And everything seems fine until now. ------------------------------------------------- To summarize: 1. Thanks for the huge log, it clarifies the situation. 2. My guess is that a rollback of MySQL happened between 2018-03-14T13:11:14.337 and 2018-03-15 04:00:00, that lead to undo the conversion done by slurmdbd. To be sure of this we would need to inspect the logs of the first conversion previous to 14-03-2018, and the MySQL logs on the first conversion and on the specified timeperiod, if there are any. 3. It seems everything is fine now, but, to be sure, please run a: sacctmgr show runaway don't fix anything and paste me the output. 4. Urgent/very important: you have 'DebugLevel = 9' set that corresponds to a 'debug4'. You urgently want to lower it down to at least 'DebugLevel=debug'. If this is a production server your performance of slurmdbd will be impacted and this can lead to problems.
Hi Felip Thanks for your investigation. Here goes: sacctmgr show runaway NOTE: Runaway jobs are jobs that don't exist in the controller but are still considered running in the database ID Name Partition Cluster State TimeStart TimeEnd ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- 1734720 desktop m3c m3 PENDING Unknown Unknown 1735765 Rmmu_chn2 m3a m3 PENDING Unknown Unknown 1731790 desktop m3f m3 RUNNING 2018-03-10T10:05:01 Unknown 1709218 md.slurm m3h m3 RUNNING 2018-03-11T18:14:37 Unknown 1734160 BEASt m3h m3 RUNNING 2018-03-13T10:56:14 Unknown 1735736 desktop m3c m3 RUNNING 2018-03-13T23:59:42 Unknown 1735744 Rmmu_chn2 m3a m3 RUNNING 2018-03-14T01:57:48 Unknown 1735760 Dmmu_chn2 com m3 RUNNING 2018-03-14T02:02:49 Unknown 1735761 mmu_chn2 com m3 RUNNING 2018-03-14T02:03:16 Unknown 1735762 Dmmu_chn2 com m3 RUNNING 2018-03-14T02:03:57 Unknown 1735763 mmu_chn2 com m3 RUNNING 2018-03-14T02:09:00 Unknown 1735764 Dmmu_chn2 com m3 RUNNING 2018-03-14T02:10:30 Unknown 1735746 Rmmu_chn2 m3a m3 RUNNING 2018-03-14T02:15:26 Unknown 1735755 Rmmu_chn2 m3a m3 RUNNING 2018-03-14T02:37:27 Unknown 1735767 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:07:31 Unknown 1735768 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:07:31 Unknown 1735769 mmu_chn2 com m3 RUNNING 2018-03-14T03:07:31 Unknown 1735770 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:07:31 Unknown 1735771 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:07:31 Unknown 1735772 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:08:31 Unknown 1735773 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:09:31 Unknown 1735774 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:27:02 Unknown 1735759 Rmmu_chn2 m3a m3 RUNNING 2018-03-14T03:27:36 Unknown 1735776 mmu_chn2 com m3 RUNNING 2018-03-14T03:27:36 Unknown 1735777 Dmmu_chn2 com m3 RUNNING 2018-03-14T03:40:36 Unknown 1735778 mmu_chn2 com m3 RUNNING 2018-03-14T03:40:36 Unknown Would you like to fix these runaway jobs? (This will set the end time for each job to the latest out of the start, eligible, or submit times, and set the state to completed. Once corrected, this will trigger the rollup to reroll usage from before the oldest runaway job.) (You have 30 seconds to decide) (N/y): -- You mentioned, you suspected mysql did a rollback on the night before, Have seen these scenarios before ? Is this often ? Kindly advise. Thanks. Cheers Damien
Created attachment 6426 [details] Old slurmdb log Hi Felip This is the earlier slurmdb log (new attached) The 'conversion done' message is at '2018-03-14T11:06:10' I hope that this is helpful. Cheers Damien
(In reply to Damien from comment #19) > Created attachment 6426 [details] > Old slurmdb log > > Hi Felip > > This is the earlier slurmdb log (new attached) > > The 'conversion done' message is at '2018-03-14T11:06:10' > > > I hope that this is helpful. > > > Cheers > > Damien Hi Damien, There's a failure in the middle of the conversion. Everything is working fine until somebody or something stops slurmdbd at: [2018-03-14T09:54:31.377] Terminate signal (SIGINT or SIGTERM) received *Then, slurmdbd is started again I guess with the new version, so the conversion starts: [2018-03-14T09:57:15.629] debug4: Table convert_version_table doesn't exist, adding .. [2018-03-14T09:57:15.825] debug4: 0(as_mysql_convert.c:816) query select version from convert_version_table [2018-03-14T09:57:15.825] debug4: (as_mysql_convert.c:838) query insert into convert_version_table (version) values (0); [2018-03-14T09:57:15.827] pre-converting job table for m3 [2018-03-14T09:57:15.864] adding column admin_comment after account in table "m3_job_table" [2018-03-14T09:57:15.864] debug: Table "m3_job_table" has changed. Updating... * At this point, something kills slurmdbd, and the m3_job_table is not fully converted, then after some seconds, slurmdbd is started again: [2018-03-14T10:00:09.508] debug: Log file re-opened [2018-03-14T10:00:09.508] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/auth_munge.so [2018-03-14T10:00:09.508] debug: Munge authentication plugin loaded [2018-03-14T10:00:09.508] debug3: Success. [2018-03-14T10:00:09.508] debug3: Trying to load plugin /opt/slurm-17.11.4/lib/slurm/accounting_storage_mysql.so [2018-03-14T10:00:09.510] debug2: mysql_connect() called for db slurm_acct_db [2018-03-14T10:00:09.653] debug4: 0(as_mysql_convert.c:816) query select version from convert_version_table [2018-03-14T10:00:09.654] pre-converting job table for m3 [2018-03-14T10:00:09.907] adding column admin_comment after account in table "m3_job_table" [2018-03-14T10:00:09.907] debug: Table "m3_job_table" has changed. Updating... [2018-03-14T10:01:38.972] error: mysql_query failed: 1060 Duplicate column name 'admin_comment' * But this second time, since the conversion was done partially, it fails and stops. [2018-03-14T10:01:38.972] error: issue converting tables before create [2018-03-14T10:01:38.972] Accounting storage MYSQL plugin failed [2018-03-14T10:01:38.973] error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed [2018-03-14T10:01:38.973] error: cannot create accounting_storage context for accounting_storage/mysql [2018-03-14T10:01:38.973] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin *After ~50 minutes, slurmdbd is started again, conversion is started again and it seems to work this time for all tables: [2018-03-14T10:53:38.995] debug: Log file re-opened ... [2018-03-14T11:06:10.762] Conversion done: success! *Some time later.. slurmdbd is started again and everything seems fine, but the old process is not killed, this means a slurmdbd is started while the old one is not terminated: [2018-03-14T11:22:22.815] debug4: as_mysql_convert_get_bad_tres: No conversion needed, Horray! [2018-03-14T11:22:22.815] debug4: as_mysql_convert_tables_pre_create: No conversion needed, Horray! [2018-03-14T11:22:22.889] debug4: as_mysql_convert_tables_post_create: No conversion needed, Horray! [2018-03-14T11:22:22.904] debug4: as_mysql_convert_non_cluster_tables_post_create: No conversion needed, Horray! [2018-03-14T11:22:22.905] Accounting storage MYSQL plugin loaded ... [2018-03-14T11:22:22.905] Killing old slurmdbd[11144] [2018-03-14T11:22:22.905] Terminate signal (SIGINT or SIGTERM) received I see then more slurmdbd restarts until this point without problems, but.. it is possible that due to this starts/stops in the middle of the inserts have caused a rollback at some point on time in MySQL that moved back some tables but not reverted the convert version table, which caused errors when all columns have been queried in the rollup that happened at 04AM. So first, I would ask you if you are aware of the possibility this restarts have happened, or otherwise they can be caused by something like systemd. And second, if you are able to find the MySQL logs they could confirm this situation. Now, I would say that everything is fine and it is pretty clear that this happened because of this starts/stops during conversion. Re-run the 'sacctmgr show runaway' and fix the wrong jobs, decrease the debug level and keep going. Does it make sense to you?
Hi Felip Thanks for your explanation. It does make sense. 1) Is there any danger to remove there runaway jobs ? 2) To lower the slurmDBdebug level, How can I do this dynamically ? like the debug level for slurmd, which I can use the 'scontrol update' or scontrol reconfig. Is there a sacctmgr update or sacctmgr reconfig ? Cheers Damien
(In reply to Damien from comment #21) > Hi Felip > > Thanks for your explanation. > > It does make sense. > > 1) Is there any danger to remove there runaway jobs ? > No, there's not. Fix runaway jobs just sets the end_time and the status to completed. It does not remove anything. > 2) To lower the slurmDBdebug level, How can I do this dynamically ? like > the debug level for slurmd, which I can use the 'scontrol update' or > scontrol reconfig. Is there a sacctmgr update or sacctmgr reconfig ? Yep, to reconfigure just modify slurmdbd.conf and send a SIGHUP signal: #] kill -HUP <slurmdbd pid> Note: SIGHUP signal reconfigure the daemons, SIGUSR2 is log rotate signal. Tell me when you are fine to close the bug or if you have more questions.
Hi Felip. Thanks, So far there isn't further issue. Thanks for your help. Please close this ticket. Cheers Damien
Thanks Damien, Glad you have solved the situation. If you have any further question, you know, here we are ;) Closing it!