Hello, I made a backup copy of my database. Installed slurmdbd 17.0.2.1. And ran: service slurmdbd start And started tailing the slurmdbd log in another window (because "service slurmdbd start" was blocking). After about 2 minutes systemd got anxious, decided the service had timed out (since slurmdbd didn't daemonize for the database conversion process), and then sigterm'd slurmdbd. This left our job table half converted, which then led to conversion errors later. I restored the database and instead just ran /usr/sbin/slurmdbd (as I had during our practice run of this earlier this week). And that seems to be running fine. I'd suggest that it might be better for slurmdbd to daemonize more quickly so that systemd isn't tempted to time it out and sigterm it. Thanks, -Doug
Hey Doug, I can understand your frustration. Perhaps I can look into making the conversion process it's own thread to allow systemd to be happy. In the mean time perhaps we can update the quickstart guide to make this note. As this only happens once per major release hopefully it isn't that painful. I will make note if we do spin this off on it's own thread and if the process has issue the slurmdbd will just die, but systemd will be happy I suppose :). I would have expected the conversion is done inside a transaction so it could happen many times and no ill will come from it if it doesn't finish, just wasted cycles. You note though this wasn't the case for you. Could you elaborate on the "errors later" you had?
Hi Danny, Following the failed conversion (by systemd death), I ran it manually: corique01:/var/tmp/slurm # /usr/sbin/slurmdbd -Dvvv slurmdbd: debug: Munge authentication plugin loaded slurmdbd: debug2: mysql_connect() called for db cori_slurm_acct_db slurmdbd: debug2: It appears the table conversions have already taken place, hooray! slurmdbd: adding column admin_comment after account in table "cori_job_table" slurmdbd: debug: Table "cori_job_table" has changed. Updating... slurmdbd: error: mysql_query failed: 1060 Duplicate column name 'admin_comment' alter table "cori_job_table" modify `job_db_inx` bigint unsigned not null auto_increment, modify `mod_time` bigint unsigned default 0 not null, modify `deleted` tinyint default 0 not null, modify `account` tinytext, add `admin_comment` text after account, modify `array_task_str` text, modify `array_max_tasks` int unsigned default 0 not null, modify `array_task_pending` int unsigned default 0 not null, modify `cpus_req` int unsigned not null, modify `derived_ec` int unsigned default 0 not null, modify `derived_es` text, modify `exit_code` int unsigned default 0 not null, modify `job_name` tinytext not null, modify `id_assoc` int unsigned not null, modify `id_array_job` int unsigned default 0 not null, modify `id_array_task` int unsigned default 0xfffffffe not null, modify `id_block` tinytext, modify `id_job` int unsigned not null, modify `id_qos` int unsigned default 0 not null, modify `id_resv` int unsigned not null, modify `id_wckey` int unsigned not null, modify `id_user` int unsigned not null, modify `id_group` int unsigned not null, modify `kill_requid` int default -1 not null, modify `mem_req` bigint unsigned default 0 not null, modify `nodelist` text, modify `nodes_alloc` int unsigned not null, modify `node_inx` text, modify `partition` tinytext not null, modify `priority` int unsigned not null, modify `state` int unsigned not null, modify `timelimit` int unsigned default 0 not null, modify `time_submit` bigint unsigned default 0 not null, modify `time_eligible` bigint unsigned default 0 not null, modify `time_start` bigint unsigned default 0 not null, modify `time_end` bigint unsigned default 0 not null, modify `time_suspended` bigint unsigned default 0 not null, modify `gres_req` text not null default '', modify `gres_alloc` text not null default '', modify `gres_used` text not null default '', modify `wckey` tinytext not null default '', modify `track_steps` tinyint not null, modify `tres_alloc` text not null default '', modify `tres_req` text not null default '', drop primary key, add primary key (job_db_inx), drop index id_job, add unique index (id_job, id_assoc, time_submit), drop key rollup, add key rollup (time_eligible, time_end), drop key rollup2, add key rollup2 (time_end, time_eligible), drop key nodes_alloc, add key nodes_alloc (nodes_alloc), drop key wckey, add key wckey (id_wckey), drop key qos, add key qos (id_qos), drop key association, add key association (id_assoc), drop key array_job, add key array_job (id_array_job), drop key reserv, add key reserv (id_resv), drop key sacct_def, add key sacct_def (id_user, time_start, time_end), drop key sacct_def2, add key sacct_def2 (id_user, time_end, time_eligible); slurmdbd: Accounting storage MYSQL plugin failed slurmdbd: error: Couldn't load specified plugin name for accounting_storage/mysql: Plugin init() callback failed slurmdbd: error: cannot create accounting_storage context for accounting_storage/mysql slurmdbd: fatal: Unable to initialize accounting_storage/mysql accounting storage plugin Aborted (core dumped) corique01:/var/tmp/slurm # I have that core dump and once I get a bit more free can possibly send data from it. I restored the database from backup and then slurmdbd (manually run) worked fine. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Mar 3, 2017 at 10:11 AM, <bugs@schedmd.com> wrote: > Danny Auble <da@schedmd.com> changed bug 3532 > <https://bugs.schedmd.com/show_bug.cgi?id=3532> > What Removed Added > Severity 4 - Minor Issue 5 - Enhancement > Assignee support@schedmd.com dev-unassigned@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3532#c1> on bug > 3532 <https://bugs.schedmd.com/show_bug.cgi?id=3532> from Danny Auble > <da@schedmd.com> * > > Hey Doug, I can understand your frustration. Perhaps I can look into making > the conversion process it's own thread to allow systemd to be happy. In the > mean time perhaps we can update the quickstart guide to make this note. As > this only happens once per major release hopefully it isn't that painful. > > I will make note if we do spin this off on it's own thread and if the process > has issue the slurmdbd will just die, but systemd will be happy I suppose :). > > I would have expected the conversion is done inside a transaction so it could > happen many times and no ill will come from it if it doesn't finish, just > wasted cycles. You note though this wasn't the case for you. Could you > elaborate on the "errors later" you had? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Yeah, I can see how that happened outside a transaction. Good thing you had a backup :), though we could had probably worked around it if you hadn't. I'll see what we can do. The core is likely from the change Tim gave you from making fatals abort() instead of exit(1). So you can just throw that one away ;). Good to know the patch works as expected.
perhaps the timeout logic could just be disabled in the slurmdbd service file: from systemd.service: """ TimeoutStartSec= Configures the time to wait for start-up. If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again. Takes a unit-less value in seconds, or a time span value such as "5min 20s". Pass "0" to disable the timeout logic. Defaults to DefaultTimeoutStartSec= from the manager configuration file, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd-system.conf(5)). """
or possibly allow slurmdbd to daemonize and fully startup from systemd's perspective, and then start the conversion. I guess that would break with the current convention. But if there are potentially long running tasks that, if terminated early might cause the database to be unusable on future boots, it seems like allowing a mechanism for systemd to automatically terminate slurmdbd is probably a bad thing.
(In reply to Doug Jacobsen from comment #5) > or possibly allow slurmdbd to daemonize and fully startup from systemd's > perspective, and then start the conversion. I guess that would break with > the current convention. But if there are potentially long running tasks > that, if terminated early might cause the database to be unusable on future > boots, it seems like allowing a mechanism for systemd to automatically > terminate slurmdbd is probably a bad thing. This was done ahead of the 18.08 release. Marking closed as a duplicate of an internal ticket that tracked that change. *** This ticket has been marked as a duplicate of ticket 5247 ***