6434 – Upgrade help

Ticket 6434 - Upgrade help

Summary: Upgrade help

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	17.11.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-01-30 16:51 MST by Wei Feinstein
Modified:	2019-05-22 11:35 MDT (History)
CC List:	1 user (show)

See Also:
Site:	LBNL - Lawrence Berkeley National Laboratory
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm configuration file (14.54 KB, text/plain) 2019-01-30 16:51 MST, Wei Feinstein	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Wei Feinstein 2019-01-30 16:51:57 MST

Created attachment 9053 [details]
slurm configuration file

We will be upgrading on Feb 12th and wanted to know which version you would recommend - 17.11.13 or 18.08.5?  Also if you could review our slurm.conf file and let us know if you see any issues that we might encounter during the upgrade.  Can you help us with identifying parameters that are no longer used in the new version and what changes are recommended.

Thanks

Jackie

Comment 2 Ben Roberts 2019-01-31 10:56:16 MST

Hi Jackie,

I would recommend upgrading to 18.08.5 rather than 17.11.13. We continue to provide updates for the latest two versions, so when version 19.05 is released the 17.11 branch will not be getting any more updates. By going to 18 you will ensure that you won't be required to do a major upgrade to get a bug fix.

There are a few things to be aware of that have changed from 17 to 18. I've put together some of the relevant notes from the NEWS file:
-- Remove support for "ChosLoc" configuration parameter.
-- Remove AdminComment += syntax from 'scontrol update job'.
-- Remove --immediate option from sbatch.
-- Remove drain on node when reboot nextstate used.
-- Completely remove "gres" field from step record. Use "tres_per_node",
"tres_per_socket", etc.
-- slurmd and slurmctld will now fatal if two incompatible mechanisms for
enforcing memory limits are set. This makes incompatible the use of
task/cgroup memory limit enforcing (Constrain[RAM|Swap]Space=yes) with
MemLimitEnforce=yes or with JobAcctGatherParams=OverMemoryKill, which
could cause problems when a task is killed by one of them while the other
is at the same time managing that task. The NoOverMemoryKill setting has
been deprecated in favour of OverMemoryKill, since now the default is
*NOT* to have any memory enforcement mechanism.

I reviewed your config file and don't see that you're using the ChosLoc parameter. I'm not sure from the config file whether you currently use the other options mentioned.

When doing the upgrade you would want to upgrade slurmdbd first. You should make a backup of the database before beginning the upgrade in case something happens and you need to roll back. You would also want to back up the job state files since the upgrade will make changes to them and the updated files won't be recognized if you need to roll back to the previous version.

One thing to be aware of also is that the slurmdtimeout and slurmctldtimeout values could cause nodes to be marked as down if the upgrade takes longer than the values specified by these parameters. You can increase the timeouts for these daemons long enough to complete the upgrade to avoid this being an issue. Here are the recommended steps listed in the upgrade guide:

1 Shutdown the slurmdbd daemon
2 Dump the Slurm database using mysqldump (in case of any possible failure) and increase innodb_buffer_size in my.cnf to 128M
3 Upgrade the slurmdbd daemon
4 Restart the slurmdbd daemon
5 Increase configured SlurmdTimeout and SlurmctldTimeout values and execute "scontrol reconfig" for them to take effect
6 Shutdown the slurmctld daemon(s)
7 Shutdown the slurmd daemons on the compute nodes
8 Copy the contents of the configured StateSaveLocation directory (in case of any possible failure)
9 Upgrade the slurmctld and slurmd daemons
10 Restart the slurmd daemons on the compute nodes
11 Restart the slurmctld daemon(s)
12 Validate proper operation
13 Restore configured SlurmdTimeout and SlurmctldTimeout values and execute "scontrol reconfig" for them to take effect
14 Destroy backup copies of database and/or state files

For reference, you can find these steps and additional details of the things I mentioned here:

https://slurm.schedmd.com/quickstart_admin.html#upgrade

Let me know if you have additional questions about this.

Thanks,
Ben

Comment 3 Wei Feinstein 2019-01-31 19:21:35 MST

Awesome, thanks!

I will review this tomorrow. I do have a question about the gres. If you
check my bug #6225 which I can't get to work anyway. What does this
statement mean?  "Completely remove "gres" field from step record. Use
"tres_per_node", "tres_per_socket", etc. Can you give me an example of how
this will change?

Thanks Jackie

On Thu, Jan 31, 2019 at 9:56 AM <bugs@schedmd.com> wrote:

> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c2> on bug
> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I would recommend upgrading to 18.08.5 rather than 17.11.13.  We continue to
> provide updates for the latest two versions, so when version 19.05 is released
> the 17.11 branch will not be getting any more updates.  By going to 18 you will
> ensure that you won't be required to do a major upgrade to get a bug fix.
>
> There are a few things to be aware of that have changed from 17 to 18.  I've
> put together some of the relevant notes from the NEWS file:
>  -- Remove support for "ChosLoc" configuration parameter.
>  -- Remove AdminComment += syntax from 'scontrol update job'.
>  -- Remove --immediate option from sbatch.
>  -- Remove drain on node when reboot nextstate used.
>  -- Completely remove "gres" field from step record. Use "tres_per_node",
>     "tres_per_socket", etc.
>  -- slurmd and slurmctld will now fatal if two incompatible mechanisms for
>     enforcing memory limits are set. This makes incompatible the use of
>     task/cgroup memory limit enforcing (Constrain[RAM|Swap]Space=yes) with
>     MemLimitEnforce=yes or with JobAcctGatherParams=OverMemoryKill, which
>     could cause problems when a task is killed by one of them while the other
>     is at the same time managing that task. The NoOverMemoryKill setting has
>     been deprecated in favour of OverMemoryKill, since now the default is
>     *NOT* to have any memory enforcement mechanism.
>
>
> I reviewed your config file and don't see that you're using the ChosLoc
> parameter.  I'm not sure from the config file whether you currently use the
> other options mentioned.
>
> When doing the upgrade you would want to upgrade slurmdbd first.  You should
> make a backup of the database before beginning the upgrade in case something
> happens and you need to roll back.  You would also want to back up the job
> state files since the upgrade will make changes to them and the updated files
> won't be recognized if you need to roll back to the previous version.
>
> One thing to be aware of also is that the slurmdtimeout and slurmctldtimeout
> values could cause nodes to be marked as down if the upgrade takes longer than
> the values specified by these parameters.  You can increase the timeouts for
> these daemons long enough to complete the upgrade to avoid this being an issue.
>  Here are the recommended steps listed in the upgrade guide:
>
> 1     Shutdown the slurmdbd daemon
> 2     Dump the Slurm database using mysqldump (in case of any possible failure)
> and increase innodb_buffer_size in my.cnf to 128M
> 3     Upgrade the slurmdbd daemon
> 4     Restart the slurmdbd daemon
> 5     Increase configured SlurmdTimeout and SlurmctldTimeout values and execute
> "scontrol reconfig" for them to take effect
> 6     Shutdown the slurmctld daemon(s)
> 7     Shutdown the slurmd daemons on the compute nodes
> 8     Copy the contents of the configured StateSaveLocation directory (in case
> of any possible failure)
> 9     Upgrade the slurmctld and slurmd daemons
> 10    Restart the slurmd daemons on the compute nodes
> 11    Restart the slurmctld daemon(s)
> 12    Validate proper operation
> 13    Restore configured SlurmdTimeout and SlurmctldTimeout values and execute
> "scontrol reconfig" for them to take effect
> 14    Destroy backup copies of database and/or state files
>
>
> For reference, you can find these steps and additional details of the things I
> mentioned here:
> https://slurm.schedmd.com/quickstart_admin.html#upgrade
>
> Let me know if you have additional questions about this.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 4 Ben Roberts 2019-02-01 09:18:16 MST

Hi Jackie,

Sure, here's a quick example of what that statement is referring to.  It's talking about the way scontrol reports the usage of the gres associated with a job step.  I'll request a job that requests 2 generic resources called 'gpu':

$ salloc -A a2 -q blue -n2 --gres=gpu:2
salloc: Granted job allocation 1913


Inside of that job I'll use srun to initiate a job step:

$ srun sleep 300 &
[1] 6975


I can see that there is the primary job and a job step by looking at sacct:

$ sacct -j 1913
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1913               bash      debug         a2          2    RUNNING      0:0 
1913.0            sleep                    a2          2    RUNNING      0:0 


If I use scontrol to look at the details of the job step I can see that the last line shows "TresPerNode" and shows the value of the requested gres:

$ scontrol show step 1913.0
StepId=1913.0 UserId=1000 StartTime=2019-02-01T10:10:22 TimeLimit=UNLIMITED
   State=RUNNING Partition=debug NodeList=node01
   Nodes=1 CPUs=2 Tasks=2 Name=sleep Network=(null)
   TRES=cpu=2,gres/gpu=2,mem=0,node=1
   ResvPorts=(null) Checkpoint=0 CheckpointDir=/var/slurm/checkpoint
   CPUFreqReq=Default Dist=Block
   SrunHost:Pid=ben-XPS-15-9570:6975
   TresPerNode=gpu:2


Previously it would show that as a gres, so it's just letting you know that if you have a script that relies on that field you need to adjust it accordingly.

Thanks,
Ben

Comment 5 Wei Feinstein 2019-02-01 11:45:51 MST

Thanks. That really does help.

Jackie

On Fri, Feb 1, 2019 at 8:29 AM <bugs@schedmd.com> wrote:

> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c4> on bug
> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> Sure, here's a quick example of what that statement is referring to.  It's
> talking about the way scontrol reports the usage of the gres associated with a
> job step.  I'll request a job that requests 2 generic resources called 'gpu':
>
> $ salloc -A a2 -q blue -n2 --gres=gpu:2
> salloc: Granted job allocation 1913
>
>
> Inside of that job I'll use srun to initiate a job step:
>
> $ srun sleep 300 &
> [1] 6975
>
>
> I can see that there is the primary job and a job step by looking at sacct:
>
> $ sacct -j 1913
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- --------
> 1913               bash      debug         a2          2    RUNNING      0:0
> 1913.0            sleep                    a2          2    RUNNING      0:0
>
>
> If I use scontrol to look at the details of the job step I can see that the
> last line shows "TresPerNode" and shows the value of the requested gres:
>
> $ scontrol show step 1913.0
> StepId=1913.0 UserId=1000 StartTime=2019-02-01T10:10:22 TimeLimit=UNLIMITED
>    State=RUNNING Partition=debug NodeList=node01
>    Nodes=1 CPUs=2 Tasks=2 Name=sleep Network=(null)
>    TRES=cpu=2,gres/gpu=2,mem=0,node=1
>    ResvPorts=(null) Checkpoint=0 CheckpointDir=/var/slurm/checkpoint
>    CPUFreqReq=Default Dist=Block
>    SrunHost:Pid=ben-XPS-15-9570:6975
>    TresPerNode=gpu:2
>
>
> Previously it would show that as a gres, so it's just letting you know that if
> you have a script that relies on that field you need to adjust it accordingly.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Ben Roberts 2019-02-05 08:22:34 MST

Hi Jackie,

It sounds like the information I sent has answered your questions.  I'll close this ticket, but feel free to update it if you have an additional question as you prepare for your upgrade.

Thanks,
Ben

Comment 7 Wei Feinstein 2019-02-05 08:25:19 MST

Yes it has. Thank you so very much.

Thanks

Jackie Scoggins

On Feb 5, 2019, at 7:22 AM, bugs@schedmd.com wrote:

Ben <ben@schedmd.com> changed bug 6434
<https://bugs.schedmd.com/show_bug.cgi?id=6434>
What Removed Added
Status OPEN RESOLVED
Resolution --- INFOGIVEN

*Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c6> on bug 6434
<https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben <ben@schedmd.com> *

Hi Jackie,

It sounds like the information I sent has answered your questions.  I'll close
this ticket, but feel free to update it if you have an additional question as
you prepare for your upgrade.

Thanks,
Ben

------------------------------
You are receiving this mail because:

   - You reported the bug.

Comment 8 Wei Feinstein 2019-05-15 16:23:47 MDT

I am reopening this because we are planning to upgrade slurm on May 21st. We are currently running 17.11.3-2 which I know is an old version.  What version do you recommend we upgrade to and do you have any documentation to provide me for the upgrade process.  

Thanks

Jackie

Comment 9 Jason Booth 2019-05-15 16:33:56 MDT

Hi Jackie,


>I am reopening this because we are planning to upgrade Slurm on May 21st. We are currently running 17.11.3-2 which I know is an old version.  What version do you recommend we upgrade to and do you have any documentation to provide me for the upgrade process. 


Slurm 17.11 is quickly approaching EOL, however, the good news is that you can upgrade to 18.08 from that version. We do suggest that you test the upgrade process on a test system so that you are prepared if anything should go wrong.


Our upgrade procedure can be found here:

https://slurm.schedmd.com/quickstart_admin.html#upgrade

If would also be good know if you have built Slurm with pmix support or if you have any spank plugins in use.

Comment 10 Ben Roberts 2019-05-20 11:19:36 MDT

Hi Jackie,

I just wanted to send a quick update since you said your plan was to upgrade tomorrow.  Let me know if you have any questions about specifics before your upgrade.  

Thanks,
Ben

Comment 11 Wei Feinstein 2019-05-20 12:32:26 MDT

Ok thanks. So far I think we're good.  As long as I have someone to
communicate with tomorrow in case there are issues Im fine.

Thanks

Jackie

On Mon, May 20, 2019 at 10:19 AM <bugs@schedmd.com> wrote:

> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c10> on bug
> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> I just wanted to send a quick update since you said your plan was to upgrade
> tomorrow.  Let me know if you have any questions about specifics before your
> upgrade.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 12 Ben Roberts 2019-05-20 14:03:28 MDT

Sounds good.  I'll be watching for updates tomorrow and will respond promptly if anything comes up.

Thanks,
Ben

Comment 13 Wei Feinstein 2019-05-21 11:36:44 MDT

Hello,

Upgrading slurmdbd right now.  Just curious. How long should it take for
the step_table to transition from 17.x version to 18.x version.  I also saw
the following error as it is processing -  [2019-05-21T10:32:55.891] error:
Could not execute statement 1206 The total number of locks exceeds the lock
table size. Is this something to worry about?

Start time: began=09:40:46.215

Thanks

Jackie

On Mon, May 20, 2019 at 1:03 PM <bugs@schedmd.com> wrote:

> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c12> on bug
> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
> <ben@schedmd.com> *
>
> Sounds good.  I'll be watching for updates tomorrow and will respond promptly
> if anything comes up.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 14 Ben Roberts 2019-05-21 11:45:26 MDT

Hi Jackie,

The time to do the database migration depends on the size of the database and the hardware the migration is being performed on.  It's hard to put a number on it.  

Let me look into the message you posted.

Thanks,
Ben

Comment 15 Wei Feinstein 2019-05-21 11:58:50 MDT

The step_table completed its update. It is now working on the job_table.  I
will keep you posted if I see anything strange. I think we're ok for now.

Thanks for your help.

Jackie

On Tue, May 21, 2019 at 10:45 AM <bugs@schedmd.com> wrote:

> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c14> on bug
> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
> <ben@schedmd.com> *
>
> Hi Jackie,
>
> The time to do the database migration depends on the size of the database and
> the hardware the migration is being performed on.  It's hard to put a number on
> it.
>
> Let me look into the message you posted.
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 16 Ben Roberts 2019-05-21 12:08:29 MDT

I'm glad to hear the step table completed, that and the job table are usually some of the larger tables.  

It looks like there have been other cases where a similar message has shown up during the database migration.  There was a report in Bug 5852 and it looks like you ran into it previously in Bug 4906.  It looks like that particular error doesn't prevent the migration from completing successfully but is related to the innodb_buffer_pool_size.  You can get into mysql and run the following query to see what the size is currently set to in Mb:
SELECT @@innodb_buffer_pool_size/1024/1024;

If you would like to increase the innodb_buffer_pool_size, the MySQL documentation has a good write up on how to do so: 
https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html

In Bug 4906 it looks like you were on sl6 and upgraded to MySQL 5.7.  Are those still valid versions for your system?  

Thanks,
Ben

Comment 17 Wei Feinstein 2019-05-21 12:29:16 MDT

It was set to the value you recommended. Maybe different size databases
need a larger value:

+-------------------------------------+

| @@innodb_buffer_pool_size/1024/1024 |

+-------------------------------------+

|                        128.00000000 |

+-------------------------------------+


But everything completed and is looking good.  Slurmdbd and slurmctl have
been updated. Working on slurmd on the nodes now.


Thanks


Jackie

On Tue, May 21, 2019 at 11:09 AM <bugs@schedmd.com> wrote:

> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c16> on bug
> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
> <ben@schedmd.com> *
>
> I'm glad to hear the step table completed, that and the job table are usually
> some of the larger tables.
>
> It looks like there have been other cases where a similar message has shown up
> during the database migration.  There was a report in Bug 5852 <https://bugs.schedmd.com/show_bug.cgi?id=5852> and it looks
> like you ran into it previously in Bug 4906 <https://bugs.schedmd.com/show_bug.cgi?id=4906>.  It looks like that particular
> error doesn't prevent the migration from completing successfully but is related
> to the innodb_buffer_pool_size.  You can get into mysql and run the following
> query to see what the size is currently set to in Mb:
> SELECT @@innodb_buffer_pool_size/1024/1024;
>
> If you would like to increase the innodb_buffer_pool_size, the MySQL
> documentation has a good write up on how to do so: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html
>
> In Bug 4906 <https://bugs.schedmd.com/show_bug.cgi?id=4906> it looks like you were on sl6 and upgraded to MySQL 5.7.  Are those
> still valid versions for your system?
>
> Thanks,
> Ben
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 18 Wei Feinstein 2019-05-22 10:10:19 MDT

The upgrade is complete and it went very well.  You can close the ticket. I
will open a new one if there are any issues in the future.

Thanks for your help.

Jackie

On Tue, May 21, 2019 at 11:28 AM Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> It was set to the value you recommended. Maybe different size databases
> need a larger value:
>
> +-------------------------------------+
>
> | @@innodb_buffer_pool_size/1024/1024 |
>
> +-------------------------------------+
>
> |                        128.00000000 |
>
> +-------------------------------------+
>
>
> But everything completed and is looking good.  Slurmdbd and slurmctl have
> been updated. Working on slurmd on the nodes now.
>
>
> Thanks
>
>
> Jackie
>
> On Tue, May 21, 2019 at 11:09 AM <bugs@schedmd.com> wrote:
>
>> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=6434#c16> on bug
>> 6434 <https://bugs.schedmd.com/show_bug.cgi?id=6434> from Ben
>> <ben@schedmd.com> *
>>
>> I'm glad to hear the step table completed, that and the job table are usually
>> some of the larger tables.
>>
>> It looks like there have been other cases where a similar message has shown up
>> during the database migration.  There was a report in Bug 5852 <https://bugs.schedmd.com/show_bug.cgi?id=5852> and it looks
>> like you ran into it previously in Bug 4906 <https://bugs.schedmd.com/show_bug.cgi?id=4906>.  It looks like that particular
>> error doesn't prevent the migration from completing successfully but is related
>> to the innodb_buffer_pool_size.  You can get into mysql and run the following
>> query to see what the size is currently set to in Mb:
>> SELECT @@innodb_buffer_pool_size/1024/1024;
>>
>> If you would like to increase the innodb_buffer_pool_size, the MySQL
>> documentation has a good write up on how to do so: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html
>>
>> In Bug 4906 <https://bugs.schedmd.com/show_bug.cgi?id=4906> it looks like you were on sl6 and upgraded to MySQL 5.7.  Are those
>> still valid versions for your system?
>>
>> Thanks,
>> Ben
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 19 Ben Roberts 2019-05-22 11:35:34 MDT

Thanks Jackie, I'm glad to hear the upgrade completed without further problems.  I'll proceed to close this ticket.