Ticket 4807

Summary: SlurmDBD Upgrade
Product: Slurm Reporter: John Villa <jv2575>
Component: slurmdbdAssignee: Tim Wickberg <tim>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 1 - System not usable    
Priority: ---    
Version: 17.11.2   
Hardware: Linux   
OS: Linux   
Site: Columbia University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description John Villa 2018-02-19 12:15:34 MST
Hello,
We are running into an issue after upgrading slurmdbd. It appears that a particular table within our accounting database has changed and has been updated as per the error logs:

slurmdbd: debug:  Log file re-opened
slurmdbd: debug:  Munge authentication plugin loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: pre-converting job table for habanero
slurmdbd: debug:  Table "habanero_job_table" has changed.  Updating...


This appears to be taking a rather long time. Here is one such record:

*************************** 6. row ***************************
      Id: 53
    User: slurm
    Host: roll.cm.cluster:33872
      db: slurm_acct_db
 Command: Query
    Time: 2274
   State: copy to tmp table
    Info: alter table "habanero_job_table" modify `job_db_inx` bigint unsigned not null auto_increment, modify
Progress: 34.525

We have over 5 million records. Would it be possible for you to provide us with an estimated amount of time as to when this might finish? Is there a way for us to skip this step? Any advice would use helpful for we scheduled this downtime and did not foresee this. Please cc hpc-admin@columbia.edu on all correspondence here.

Thanks,
John Villa
Comment 3 Marshall Garey 2018-02-19 14:15:19 MST
We had one customer with a database of 3 million job records that took 45 minutes. Here at SchedMD, Danny's identical database took 15 minutes. So how long it takes depends largely on the hardware.

You should be able to let the update run and it shouldn't affect running jobs or the job queue at all. Records won't be written to the database during the update, but they'll be queued up and written to the database once the update is complete.

You can't skip updating the database.

I'm unable to CC hpc-admin@columbia.edu - bugzilla doesn't recognize that email on the list, and I can't add it. Today is a holiday, but tomorrow I can ask internally if that email can be added. We didn't respond right away because today is a holiday. I just happen to be working today, but everyone else is out and we typically aren't around on holidays.

Can you let us know how long it took to update when the process is complete?

Did you shutdown slurmctld during the update, or is it still running?
Comment 4 Marshall Garey 2018-02-19 14:34:14 MST
As an FYI, with the slurmdbd taking a long time to convert or rollup, systemd sometimes likes to kill it. So run slurmdbd in the foreground (using the slurmdbd -D flag) so it runs separately from systemd. Run it on at least the info log level to see this message:

info("Conversion done: success!");
Comment 5 John Villa 2018-02-19 14:44:05 MST
Hello,
Thank you for the update. The update finished. It took a couple of hours
but everything seems fine now in regards to slurmdbd. Please feel free to
close this bug.
Sincerely,
John Villa

On Feb 19, 2018, at 4:34 PM, bugs@schedmd.com wrote:

*Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=4807#c4> on bug 4807
<https://bugs.schedmd.com/show_bug.cgi?id=4807> from Marshall Garey
<marshall@schedmd.com> *

As an FYI, with the slurmdbd taking a long time to convert or rollup, systemd
sometimes likes to kill it. So run slurmdbd in the foreground (using the
slurmdbd -D flag) so it runs separately from systemd. Run it on at least the
info log level to see this message:

info("Conversion done: success!");

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 6 Marshall Garey 2018-02-20 16:50:07 MST
We're glad everything is working fine. Closing as resolved/infogiven.
Comment 7 Marshall Garey 2018-02-20 17:00:00 MST
For adding people to the CC list: see bug 4748 comment 2 - Tim said,

"He'd need to setup an account within our Bugzilla instance; once that's done he can add himself as a CC, or you can do it as well."
Comment 8 John Villa 2018-02-21 10:46:12 MST
Hello,
We noticed that absence of the "sview" application under "bin" or "sbin"
when we compiled 17.11.2 of slurm. Is there any reason this would be
missing? We have gtk2 and gtk3 installed. Please advise.
Thanks,
John Villa

On Tue, Feb 20, 2018 at 7:00 PM, <bugs@schedmd.com> wrote:

> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=4807#c7> on bug
> 4807 <https://bugs.schedmd.com/show_bug.cgi?id=4807> from Marshall Garey
> <marshall@schedmd.com> *
>
> For adding people to the CC list: see bug 4748 comment 2 <https://bugs.schedmd.com/show_bug.cgi?id=4748#c2> - Tim said,
>
> "He'd need to setup an account within our Bugzilla instance; once that's done
> he can add himself as a CC, or you can do it as well."
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 9 Brian Christiansen 2018-02-21 11:28:37 MST
Hey John,

Will you open a new bug your question? That way we can keep things separate. Also, FYI for the future, if you ever need to reopen a bug be sure to change the status back to "Unconfirmed" from the website. This will allow the bug to show up on our lists and help prevent responses from being lost.

Thanks,
Brian
Comment 10 John Villa 2018-02-21 11:32:12 MST
Brian,
Can you create a new bug on my behalf?
Thank you,
John Villa

On Feb 21, 2018, at 1:28 PM, bugs@schedmd.com wrote:

*Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=4807#c9> on bug 4807
<https://bugs.schedmd.com/show_bug.cgi?id=4807> from Brian Christiansen
<brian@schedmd.com> *

Hey John,

Will you open a new bug your question? That way we can keep things separate.
Also, FYI for the future, if you ever need to reopen a bug be sure to change
the status back to "Unconfirmed" from the website. This will allow the bug to
show up on our lists and help prevent responses from being lost.

Thanks,
Brian

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 11 Brian Christiansen 2018-02-21 12:02:31 MST
I'd prefer that you do it. That way I'm not tagged as the reporter.