Ticket 8762

Summary: Increase timeout for slurmctld/slurmd communication.
Product: Slurm Reporter: Brad Viviano <viviano.brad>
Component: ConfigurationAssignee: Gavin D. Howard <gavin>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 19.05.5   
Hardware: Linux   
OS: Linux   
Site: EPA Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sdiag
slurmctld from primary controller
slurmctld log file from backup controller
slurm.conf from cluster.
job_submit plugin
slurmctld_prolog and slurmd_pro|epi log's.
Slurmctld from atmos-mgmt1
Slurmctld from atmos-mgmt2
Slurmd from one of the compute nodes.
Sliurmctld off testing cluster, node luna-mgmt...
Sliurmctld off testing cluster, node luna2...

Description Brad Viviano 2020-03-31 07:15:49 MDT
Today, during a failover event on our slurmctld manager nodes (which are also our cluster manager nodes).  Slurmctld primary and backup where knocked offline longer then normal.  This caused some job failures on the compute nodes:

[2020-03-31T05:27:22.224] launch task 884848.235 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:23246
[2020-03-31T05:27:22.529] [884848.235] set oom_score_adj of task2 (pid 10098) to 500
[2020-03-31T05:54:56.253] [884848.235] done with job
[2020-03-31T05:54:58.223] launch task 884848.236 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:2256
[2020-03-31T05:54:58.559] [884848.236] set oom_score_adj of task2 (pid 12688) to 500
[2020-03-31T06:22:44.139] [884848.236] done with job
[2020-03-31T06:23:01.061] error: Unable to register: Unable to contact slurm controller (connect failure)

In the above, it caused the user job to fail:

srun: error: Unable to confirm allocation for job 884848: Unable to contact slurm controller (connect failure)

srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 884848


While this event isn't likely to happen very often, we sometimes have problems with cluster manager failover.  It's not immediately apparent looking at the slurm.conf manual.  Where do I set the timeout for slurmctld/slurmd communications, so we can hopefully avoid this problem in the future.

Thanks.
Comment 2 Gavin D. Howard 2020-03-31 10:35:53 MDT
Thank you for the report.

Can you send me your slurm.conf, slurmctld logs (for both the primary and backup), and also, the specs of the nodes that both slurmctld's are on?
Comment 3 Gavin D. Howard 2020-03-31 10:36:52 MDT
Oh, if you could also send me the output of sdiag, I would appreciate it.
Comment 4 Brad Viviano 2020-03-31 10:39:37 MDT
Created attachment 13543 [details]
sdiag
Comment 5 Brad Viviano 2020-03-31 10:40:32 MDT
Created attachment 13544 [details]
slurmctld from primary controller
Comment 6 Brad Viviano 2020-03-31 10:40:56 MDT
Created attachment 13545 [details]
slurmctld log file from backup controller
Comment 7 Brad Viviano 2020-03-31 10:41:23 MDT
Created attachment 13546 [details]
slurm.conf from cluster.
Comment 8 Brad Viviano 2020-03-31 10:45:55 MDT
> the specs of the nodes that both slurmctld's are on?

Both systems are identical

Dell R730
- RHEL 7.7, latest updates
- 2 x Intel 5-2690 v4 @ 2.60GHz (14 core)
- 8 x 32GB ECC Memory
- 2 x 10Gb Ethernet LAG connected to core switch stack (Dell Force10 S4048) used for Slurm and PXE boot of compute nodes
- 2 x 100Gb Infiniband
Comment 9 Gavin D. Howard 2020-03-31 15:15:17 MDT
After looking at your timeout values, it isn't clear where the problem is because a timeout of 300 on both should be fine.

Can you send me a couple of slurmd logs with the failures?
Comment 10 Brad Viviano 2020-04-01 04:46:20 MDT
What I put in the open notes are from the /var/log/slrumd on the compute node and it was all there was in the log that pertained to the issue.

The node in question was running a job array, it was running those job steps fine:

> [2020-03-31T05:27:22.224] launch task 884848.235 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:23246
> [2020-03-31T05:27:22.529] [884848.235] set oom_score_adj of task2 (pid 10098) to 500
> [2020-03-31T05:54:56.253] [884848.235] done with job
> [2020-03-31T05:54:58.223] launch task 884848.236 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:2256
> [2020-03-31T05:54:58.559] [884848.236] set oom_score_adj of task2 (pid 12688) to 500
> [2020-03-31T06:22:44.139] [884848.236] done with job

But at around 6:20am I performed the failover of my cluster manager nodes and slurmctld had some issues.  It was down for longer then normal on a failover and this caused the error below:

> [2020-03-31T06:23:01.061] error: Unable to register: Unable to contact slurm controller (connect failure)

Other jobs on the cluster continued to run without issue because their slurmd didn't need to check in with slurmctld for their next job array step at the time the slurmctld failover was happening.

I'd really like to know if there is a way for me to control the timeout on the slurmctld/slurmd communication side, from slurmd's viewpoint.

The man page says SlurmdTimeout value is a setting for slurmctld

> The interval, in seconds, that the Slurm controller waits for 
> slurmd to respond before configuring that node's state to DOWN.

I can't find anything in the slurm.conf man page that says how long slurmd waits to get a response from slurmctld if it's slurmctld that's down, before giving the "Unable to register" error.

What setting is that?

Thanks.
Comment 11 Gavin D. Howard 2020-04-01 12:14:32 MDT
I am sorry. I did not quite understand what you needed. I think I may now.

The setting that is relevant is `MessageTimeout`. However, before we change that from the default (10 seconds), which is a good default, let's try to figure out if there is something else that might be causing the issue.

Looking at your backup controller logs carefully, I realized that it had been terminated before the failover. That might have contributed or been the cause.

Also, your prologs, epilogs, and the group_account job submit plugin could have contributed because the root user is sending a lot of RPC's, which makes me suspect them.

Can you send me the contents/source code of your prologs (`/usr/local/slurm/var/slurmctld_prolog`, `/usr/local/slurm/var/slurmd_prolog`), your epilogs (`/usr/local/slurm/var/slurmd_epilog`), and your group_account job submit plugin?
Comment 12 Brad Viviano 2020-04-01 13:32:57 MDT
> Looking at your backup controller logs carefully, I realized that it 
> had been terminated before the failover. That might have contributed
> or been the cause.

Yes, we're using an HA solution (similar to clusterlabs) that manages failover between our two cluster management nodes.  We only run 1 slurmctld at a time and rely on the HA software to start slurmctld on whichever node is the master.

So in a fail over event where we switch the master from mgmt1->mgmt2 or mgmt2->mgmt1, the HA software stops slurmctld on the active master, switches the master node and then starts slurmctld back up.  It does other things (like updating MariaDB master/slave settings, move a shared IP address, etc)

We set it up this way because of the takeover delay when switching from Primary to Backup.  If we try and run 2 slurmctld's at the same time with Primary and Backup on our 2 management nodes, when we patch/reboot the primary server, unless I remember to go over to backup slurmctld and issue "scontrol takeover" we experience the SlurmctldTimeout delay until the backup figures out what's going on.

Using only 1 slurmctld instance inside our HA solution with only 1 controller running, we can easily have the HA software

Stop slurmctld on Primary
Start slurmctld on Backup
Issue "scontrol takeover" on Backup

This has the benefit, when HA failover works as expected (3-5 seconds) then there is no/minimal impact to users running s* commands.

The issue is of course, if we experience a problem on HA failover, which is what happened this last time.  The failover failed because of an external to Slurm issue, that caused slurmctld to stop on the master but because the fail over didn't succeed, it didn't start back up on the new master.

So if I increase the timeout on slurmd -> slurmctld communication, I think that will resolve the issue.

For completeness. I originally set this up with Slurm 16, so if there is something new in Slurm since 16 that will better handle the "scontrol takeover" delay, I can revisit this setup.

I saw for example that 18.08 added SlurlctldHost and SlurmctlPrimaryOnProg but that information isn't well detailed anywhere and I don't know if it would solve the delay on Primary shutdown issue that caused us to set things up like we did.

In an ideal world I would have slurmctld running on both management nodes and there would be a command that would let me appoint one slurmctld as the primary until I tell it otherwise.  That way I could have my HA solution simply mark the HA master as the Slurm Primary controller and have it stay that way until I change the HA master.

I don't see anything in the manuals that would allow that (short of modifying slurm.conf each time I make the HA failover and restarting both slurmctld's).  The manuals seem quite clear that it will always prefer the Primary server if it's online and only use the Backup if it can't reach the primary after the delay is reached.

> Can you send me the contents/source code of your prologs
> (`/usr/local/slurm/var/slurmctld_prolog`,
> `/usr/local/slurm/var/slurmd_prolog`), your epilogs
> (`/usr/local/slurm/var/slurmd_epilog`), and your group_account job submit plugin?

I can, but we don't do anything RPC related in them.  The slurmctld_prolog is empty right now, the slurmd_* just do some sanity checks on the node at start/stop time.  The group_account plugin just makes sure the user has specified an --account option equal to their group membership so our accounting is accurate.  It's


Thanks.
Comment 13 Brad Viviano 2020-04-01 13:33:33 MDT
Created attachment 13560 [details]
job_submit plugin
Comment 14 Brad Viviano 2020-04-01 13:34:06 MDT
Created attachment 13561 [details]
slurmctld_prolog and slurmd_pro|epi log's.
Comment 15 Gavin D. Howard 2020-04-01 15:23:02 MDT
In that case, it might be best to just increase MessageTimeout to 60, or maybe even as high as 100.
Comment 16 Brad Viviano 2020-04-02 05:36:49 MDT
Thanks, I've increased the MessageTimeout to 100 and restarted the services.

If you can point me at a good whitepaper, KB article or document that gives better direction of using the new SlurmctldHost directives and best practices for failover with a multi-slurmctld setup I'd be happy to review and see if there is a better way to deploy for my site.

I just couldn't find anything on best practices for multi-slurmctld setup and got stung in the following scenario:

ControlMachine=atmos-mgmt1
ControlAddr=atmos-mgmt1
BackupController=atmos-mgmt2
BackupAddr=atmos-mgmt2

- Need to patch/reboot atmos-mgmt1
- Failed HA over to atmos-mgmt2
- Applied firmware updates to atmos-mgmt1 and rebooted
- Slurmctld hung for the timeout period (or until I remember to "scontrol takeover" on BackupController)
- Once atmos-mgmt1 comes up, slurmctld starts and immediately it becomes the Primary again
- Install OS updates on atmos-mgmt1 and reboot
- Slurmctld hangs for the timeout period (or until I remember to "scontrol takeover" on the BackupController)

Because of the complexity of the management nodes, it's not uncommon during a maintenance window I might need to do 3 or 4 reboots of the node (between the updates for firmware, RHEL7, GPFS, etc).  Each time atmos-mgmt1 came back up from a reboot and slurmctld started, it immediately took over as the Primary controller.  Then on the next reboot I'd have to go through the timeout again or go over to atmos-mgmt2 and do "scontrol takeover".

It's doable on my part to issue "scontrol takeover" on the BackupController with each reboot or simply "systemctl disable slurmctld.service" on atmos-mgmt1 during maintenance.  But I wanted an automatic, don't have to think/remember because I am trying to focus on the management node I am doing maintenance on solution.

That's why I came up with what I did.  I have the HA service start/stop slurmctld based on which node is the HA master and use systemd's "ExecStartPost" directive to run "scontrol takeover":

[root@atmos-mgmt2 ~]# cat /etc/systemd/system/slurmctld.service.d/takeover.conf 
[Service]
ExecStartPost=/usr/sbin/slurmctld_takeover

[root@atmos-mgmt2 ~]# cat /usr/sbin/slurmctld_takeover 
#!/bin/bash

/bin/sleep 3 # Make sure slurmctld is fully up and listening
/usr/local/bin/scontrol takeover

exit 0

This works perfectly as long as HA doesn't have an issue on failover :).

Ideally, there would be a scontrol/dynamic way of telling Slurm 

> atmos-mgmt2 is now the ControlerMachine and atmos-mgmt1 is now the BackupController
>  stay that way until I tell you differently

That way I could easily assign atmos-mgmt1 or atmos-mgmt2 as the ControlMachine as part of my HA failover without having to stop and start slurmctld like I do now.

Is there anything like that available with scontrol?

Thanks.
Comment 17 Gavin D. Howard 2020-04-02 12:32:46 MDT
There is no way for scontrol to do what you want, but there is a way to modify your ExecStartPost script and slurm.conf to do it.

First, in your slurm.conf, remove these lines:

> ControlMachine=atmos-mgmt1
> ControlAddr=atmos-mgmt1
> BackupController=atmos-mgmt2
> BackupAddr=atmos-mgmt2

And put the following line instead:

> include /usr/local/etc/slurmctlds.conf

(The path may be whatever you wish.)

Then, create `/usr/local/etc/slurmctlds.conf` (or whatever you chose) and put the following contents in it:

> SlurmctldHost=atmos-mgmt1
> SlurmctldHost=atmos-mgmt2

`SlurmctldHost` is the replacement for `ControlMachine`, `ControlAddr`, `BackupMachine`, and `BackupAddr`. The order they are in determines which is master and which is backup.

I am having you put them in a separate file because that file will be edited on failover, in the ExecStartPost script.

This is what the script should look like:

> #!/bin/bash
> 
> /bin/sleep 3 # Make sure slurmctld is fully up and listening
> /usr/local/bin/scontrol takeover
> 
> # As you probably know, scontrol takeover will terminate the primary, so we
> # can assume it is down at this point.
> # Edit the /usr/local/etc/slurmctlds.conf to switch the backup to primary and
> # the primary to backup.
> backup=`cat /usr/local/etc/slurmctlds.conf` | head -n 1`
> primary=`cat /usr/local/etc/slurmctlds.conf` | head -n 2 | tail -n 1`
> printf '%s\n%s\n' "$primary" "$backup" > /usr/local/etc/slurmctlds.conf
> # Reload the new config
> /usr/local/bin/scontrol reconfig
> exit 0

After that script runs, you should be fine to restart the new backup (old primary), and it won't take over because it will not be first in the list.

Yes, this is a hack, and if it weren't for the "include" ability in slurm.conf, I couldn't recommend this. But I think it might work well for you. No promises, though.
Comment 18 Brad Viviano 2020-04-03 05:34:08 MDT
This is an interesting angle on the problem I hadn't considered.

I think the best way to handle it would be:

1) Modify slurm.conf to include the /usr/local/etc/slurmctlds.conf as suggested
2) Enable slurmctld.service on both cluster manager nodes, outside of HA, so they are always running on both
3) Create a "slurmctld_update_controller.service" that is a "OneShot" service the did the following

- Rebuild /usr/local/etc/slurmctlds.conf based on which node it runs on
  -- atmos-mgmt1
SlurmctldHost=atmos-mgmt1
SlurmctldHost=atmos-mgmt2

  -- atmos-mgmt2
SlurmctldHost=atmos-mgmt2
SlurmctldHost=atmos-mgmt1

- Run /usr/local/bin/scontrol reconfig

3) Put slurmctld_update_controller.service into the HA setup in place of slurmctld.service

This way, slurmctld is always running on both nodes, but the HA service just updates /usr/local/etc/slurmctlds.conf on whichever is made the master.  If neither is made the master (i.e. HA fails like happened this week) then nothing happens to Slurmctld, the old master stays in place until I resolve the HA problem.


Before I implement this into our HA setup, I'd like to test it manually.  Is there a way to query Slurmctld as to which host is the master, so I can verify as I issue:

/bin/systemctl start slurmctld_update_controller.service

on atmos-mgmt1 and atmos-mgmt2, that the update and scontrol reconfig worked as expected?

Thanks.
Comment 19 Brad Viviano 2020-04-03 09:03:43 MDT
Ok,
   That didn't work.  I started with

SlurmctldHost=atmos-mgmt1
SlurmctldHost=atmos-mgmt2

Verified with "scontrol ping"

[root@atmos-mgmt1 ~]# scontrol ping
Slurmctld(primary) at atmos-mgmt1 is UP
Slurmctld(backup) at atmos-mgmt2 is UP

Then changed to

SlurmctldHost=atmos-mgmt1
SlurmctldHost=atmos-mgmt2

and issued "scontrol reconfig" and ALL hell broke loose.

While the ping looked correct:

[root@atmos-mgmt1 ~]# scontrol ping
Slurmctld(primary) at atmos-mgmt2 is UP
Slurmctld(backup) at atmos-mgmt1 is UP

slurmctld on atmos-mgmt2 started getting RPC errors:

[2020-04-03T10:41:54.641] slurmctld version 19.05.5 started on cluster atmos
[2020-04-03T10:41:54.646] [job_submit/group_account] Submit Plugin Loaded.
[2020-04-03T10:41:54.648] slurmctld running in background mode
[2020-04-03T10:42:58.510] error: Invalid RPC received 1002 while in standby mode
[2020-04-03T10:42:58.510] error: Invalid RPC received 1002 while in standby mode
[2020-04-03T10:42:58.520] error: Invalid RPC received 1002 while in standby mode
[2020-04-03T10:42:58.521] error: Invalid RPC received 1002 while in standby mode
[2020-04-03T10:42:58.523] error: Invalid RPC received 1002 while in standby mode
[2020-04-03T10:42:58.523] error: Invalid RPC received 1002 while in standby mode


The compute nodes FREAKED OUT and we lost a few jobs on the cluster (from slurmd on one of the nodes that lost a job).

[2020-04-03T10:43:19.833] [885867.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2020-04-03T10:43:19.838] [885867.batch] done with job
[2020-04-03T10:43:43.582] error: Error responding to health check: Transport endpoint is not connected
[2020-04-03T10:43:43.583] launch task 885867.173 request from UID:1854 GID:187 HOST:172.20.12.35 PORT:6372
[2020-04-03T10:43:43.586] error: Invalid job credential from 1854@172.20.12.35: Job credential expired
[2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory
[2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory
[2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory
[2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory
[2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory
[2020-04-03T10:43:43.587] error: _rpc_launch_tasks: unable to send return code to address:port=172.20.12.35:58392 msg_type=6001: Transport endpoint is not connected


And atmos-mgmt1 thought it was still in charge.

I am attaching slurmctld from atmos-mgmt1/2 and slurmd from one of the compute nodes that freaked.

I think the issue is, from "man slurm.conf"

> Slurm daemons should be shutdown and restarted if any  of  
> these  parameters are to be changed: AuthType, BackupAddr, BackupController, > ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort or
> SlurmdPort.

So, while "SlurmctldHost" is not explicitly listed in the man page, it probably groups into the same level as "ConrtrolAddr" and "BackupAddr".
Comment 20 Brad Viviano 2020-04-03 09:04:55 MDT
Created attachment 13592 [details]
Slurmctld from atmos-mgmt1
Comment 21 Brad Viviano 2020-04-03 09:05:15 MDT
Created attachment 13593 [details]
Slurmctld from atmos-mgmt2
Comment 22 Brad Viviano 2020-04-03 09:05:35 MDT
Created attachment 13594 [details]
Slurmd from one of the compute nodes.
Comment 23 Gavin D. Howard 2020-04-03 09:35:14 MDT
Did you do an `scontrol takeover` before switching the SlurmctldHost entries? I believe you will need to do that. And since that command takes down the old primary, you will need to bring it back up after the scontrol reconfig.
Comment 24 Brad Viviano 2020-04-03 10:20:06 MDT
We have a small test cluster that we use to test upgrades to Slurm and other OS components, before putting them onto the main system.

It doesn't have a backup Slurmctld setup currently.

To prevent additional issues on my production cluster, let me go ahead and get a backup controller configuration setup on my test cluster and then I can test the failover procedure without worrying if I'm going to cause another user impact.

I'll get back to you on Monday or Tuesday once I've got that setup and done some testing.

But, please confirm the following workflow as what I need to do/test:

Assuming I start with this in /usr/local/slurm/etc/slurmctlds.conf:

SlurmctldHost=atmos-mgmt1
SlurmctldHost=atmos-mgmt2

1) On atmos-mgmt2 I would "scontrol takeover"
2) On atmos-mgmt2 I would update /usr/local/etc/slurm/etc/slurmctlds.conf to:

SlurmctldHost=atmos-mgmt2
SlurmctldHost=atmos-mgmt1

3) On atmos-mgmt2 I would "scontrol reconfigure"
4) On atmos-mgmt1 I would "/bin/systemctl start slurmctld"


/usr/local/* is on a shared filesystem across the cluster, so updating it on any one node will update it on all of them.

Thanks.
Comment 25 Gavin D. Howard 2020-04-03 11:27:37 MDT
Yes, that is the procedure I was thinking about.
Comment 26 Brad Viviano 2020-04-03 11:50:59 MDT
Ok,
   So that doesn't work.  I setup the backup controller on my test cluster and tried what you suggested and Slurm just falls apart.

[root@luna-mgmt ~]# cat /usr/local/slurm/etc/slurmctld.conf 
SlurmctldHost=luna-mgmt
SlurmctldHost=luna2

[root@luna-mgmt ~]# cat make_luna-mgmt_active.sh 
#!/bin/bash

/usr/local/bin/scontrol takeover

/bin/cat > /usr/local/slurm/etc/slurmctld.conf << EOF
SlurmctldHost=luna-mgmt
SlurmctldHost=luna2
EOF

/usr/local/bin/scontrol reconfigure

exit 0

#!/bin/bash

[root@luna2 ~]# cat make_luna2_active.sh 

/usr/local/bin/scontrol takeover

/bin/cat > /usr/local/slurm/etc/slurmctld.conf << EOF
SlurmctldHost=luna2
SlurmctldHost=luna-mgmt
EOF

/usr/local/bin/scontrol reconfigure

exit 0


I am attaching the slurmctld from luna-mgmt and luna2.

The system was so confused that NONE of s* commands would do anything other then timeout:

> error: Unable to contact slurm controller (connect failure)

Even though slurmctld was running on both nodes.

I think the man page is right, if you want to make changes to the Primary and Backup controllers you HAVE to restart the Slurm daemons, all of them.
Comment 27 Brad Viviano 2020-04-03 11:51:27 MDT
Created attachment 13598 [details]
Sliurmctld off testing cluster, node luna-mgmt...
Comment 28 Brad Viviano 2020-04-03 11:51:47 MDT
Created attachment 13599 [details]
Sliurmctld off testing cluster, node luna2...
Comment 29 Gavin D. Howard 2020-04-03 12:24:50 MDT
In that case, you were right in assuming the documentation is wrong. I will get on fixing that.

You will need to shutdown and restart both controllers in order to do the kind of failover that you want to do, unfortunately.
Comment 30 Gavin D. Howard 2020-04-03 12:29:14 MDT
One thing I forgot to say is that the logs indicate that compute nodes will have to be restarted as well for this procedure, which I was not aware of before. So I don't know how useful this will be.
Comment 31 Brad Viviano 2020-04-06 09:26:01 MDT
> the logs indicate that compute nodes will have to be restarted as well 
> for this procedure, which I was not aware of before. So I don't know how
> useful this will be.

Yeah, restarting the slurmd on all our compute nodes on HA fail over isn't going to be workable.  We'll stick with what we're currently doing and hope that setting MessageTimeout to a higher value is enough to resolve the problem, when we have an HA issue.

Thanks.
Comment 32 Gavin D. Howard 2020-04-06 10:08:38 MDT
You're welcome, and I apologize.

Closing.