Today, during a failover event on our slurmctld manager nodes (which are also our cluster manager nodes). Slurmctld primary and backup where knocked offline longer then normal. This caused some job failures on the compute nodes: [2020-03-31T05:27:22.224] launch task 884848.235 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:23246 [2020-03-31T05:27:22.529] [884848.235] set oom_score_adj of task2 (pid 10098) to 500 [2020-03-31T05:54:56.253] [884848.235] done with job [2020-03-31T05:54:58.223] launch task 884848.236 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:2256 [2020-03-31T05:54:58.559] [884848.236] set oom_score_adj of task2 (pid 12688) to 500 [2020-03-31T06:22:44.139] [884848.236] done with job [2020-03-31T06:23:01.061] error: Unable to register: Unable to contact slurm controller (connect failure) In the above, it caused the user job to fail: srun: error: Unable to confirm allocation for job 884848: Unable to contact slurm controller (connect failure) srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 884848 While this event isn't likely to happen very often, we sometimes have problems with cluster manager failover. It's not immediately apparent looking at the slurm.conf manual. Where do I set the timeout for slurmctld/slurmd communications, so we can hopefully avoid this problem in the future. Thanks.
Thank you for the report. Can you send me your slurm.conf, slurmctld logs (for both the primary and backup), and also, the specs of the nodes that both slurmctld's are on?
Oh, if you could also send me the output of sdiag, I would appreciate it.
Created attachment 13543 [details] sdiag
Created attachment 13544 [details] slurmctld from primary controller
Created attachment 13545 [details] slurmctld log file from backup controller
Created attachment 13546 [details] slurm.conf from cluster.
> the specs of the nodes that both slurmctld's are on? Both systems are identical Dell R730 - RHEL 7.7, latest updates - 2 x Intel 5-2690 v4 @ 2.60GHz (14 core) - 8 x 32GB ECC Memory - 2 x 10Gb Ethernet LAG connected to core switch stack (Dell Force10 S4048) used for Slurm and PXE boot of compute nodes - 2 x 100Gb Infiniband
After looking at your timeout values, it isn't clear where the problem is because a timeout of 300 on both should be fine. Can you send me a couple of slurmd logs with the failures?
What I put in the open notes are from the /var/log/slrumd on the compute node and it was all there was in the log that pertained to the issue. The node in question was running a job array, it was running those job steps fine: > [2020-03-31T05:27:22.224] launch task 884848.235 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:23246 > [2020-03-31T05:27:22.529] [884848.235] set oom_score_adj of task2 (pid 10098) to 500 > [2020-03-31T05:54:56.253] [884848.235] done with job > [2020-03-31T05:54:58.223] launch task 884848.236 request from UID:1862 GID:187 HOST:172.20.11.3 PORT:2256 > [2020-03-31T05:54:58.559] [884848.236] set oom_score_adj of task2 (pid 12688) to 500 > [2020-03-31T06:22:44.139] [884848.236] done with job But at around 6:20am I performed the failover of my cluster manager nodes and slurmctld had some issues. It was down for longer then normal on a failover and this caused the error below: > [2020-03-31T06:23:01.061] error: Unable to register: Unable to contact slurm controller (connect failure) Other jobs on the cluster continued to run without issue because their slurmd didn't need to check in with slurmctld for their next job array step at the time the slurmctld failover was happening. I'd really like to know if there is a way for me to control the timeout on the slurmctld/slurmd communication side, from slurmd's viewpoint. The man page says SlurmdTimeout value is a setting for slurmctld > The interval, in seconds, that the Slurm controller waits for > slurmd to respond before configuring that node's state to DOWN. I can't find anything in the slurm.conf man page that says how long slurmd waits to get a response from slurmctld if it's slurmctld that's down, before giving the "Unable to register" error. What setting is that? Thanks.
I am sorry. I did not quite understand what you needed. I think I may now. The setting that is relevant is `MessageTimeout`. However, before we change that from the default (10 seconds), which is a good default, let's try to figure out if there is something else that might be causing the issue. Looking at your backup controller logs carefully, I realized that it had been terminated before the failover. That might have contributed or been the cause. Also, your prologs, epilogs, and the group_account job submit plugin could have contributed because the root user is sending a lot of RPC's, which makes me suspect them. Can you send me the contents/source code of your prologs (`/usr/local/slurm/var/slurmctld_prolog`, `/usr/local/slurm/var/slurmd_prolog`), your epilogs (`/usr/local/slurm/var/slurmd_epilog`), and your group_account job submit plugin?
> Looking at your backup controller logs carefully, I realized that it > had been terminated before the failover. That might have contributed > or been the cause. Yes, we're using an HA solution (similar to clusterlabs) that manages failover between our two cluster management nodes. We only run 1 slurmctld at a time and rely on the HA software to start slurmctld on whichever node is the master. So in a fail over event where we switch the master from mgmt1->mgmt2 or mgmt2->mgmt1, the HA software stops slurmctld on the active master, switches the master node and then starts slurmctld back up. It does other things (like updating MariaDB master/slave settings, move a shared IP address, etc) We set it up this way because of the takeover delay when switching from Primary to Backup. If we try and run 2 slurmctld's at the same time with Primary and Backup on our 2 management nodes, when we patch/reboot the primary server, unless I remember to go over to backup slurmctld and issue "scontrol takeover" we experience the SlurmctldTimeout delay until the backup figures out what's going on. Using only 1 slurmctld instance inside our HA solution with only 1 controller running, we can easily have the HA software Stop slurmctld on Primary Start slurmctld on Backup Issue "scontrol takeover" on Backup This has the benefit, when HA failover works as expected (3-5 seconds) then there is no/minimal impact to users running s* commands. The issue is of course, if we experience a problem on HA failover, which is what happened this last time. The failover failed because of an external to Slurm issue, that caused slurmctld to stop on the master but because the fail over didn't succeed, it didn't start back up on the new master. So if I increase the timeout on slurmd -> slurmctld communication, I think that will resolve the issue. For completeness. I originally set this up with Slurm 16, so if there is something new in Slurm since 16 that will better handle the "scontrol takeover" delay, I can revisit this setup. I saw for example that 18.08 added SlurlctldHost and SlurmctlPrimaryOnProg but that information isn't well detailed anywhere and I don't know if it would solve the delay on Primary shutdown issue that caused us to set things up like we did. In an ideal world I would have slurmctld running on both management nodes and there would be a command that would let me appoint one slurmctld as the primary until I tell it otherwise. That way I could have my HA solution simply mark the HA master as the Slurm Primary controller and have it stay that way until I change the HA master. I don't see anything in the manuals that would allow that (short of modifying slurm.conf each time I make the HA failover and restarting both slurmctld's). The manuals seem quite clear that it will always prefer the Primary server if it's online and only use the Backup if it can't reach the primary after the delay is reached. > Can you send me the contents/source code of your prologs > (`/usr/local/slurm/var/slurmctld_prolog`, > `/usr/local/slurm/var/slurmd_prolog`), your epilogs > (`/usr/local/slurm/var/slurmd_epilog`), and your group_account job submit plugin? I can, but we don't do anything RPC related in them. The slurmctld_prolog is empty right now, the slurmd_* just do some sanity checks on the node at start/stop time. The group_account plugin just makes sure the user has specified an --account option equal to their group membership so our accounting is accurate. It's Thanks.
Created attachment 13560 [details] job_submit plugin
Created attachment 13561 [details] slurmctld_prolog and slurmd_pro|epi log's.
In that case, it might be best to just increase MessageTimeout to 60, or maybe even as high as 100.
Thanks, I've increased the MessageTimeout to 100 and restarted the services. If you can point me at a good whitepaper, KB article or document that gives better direction of using the new SlurmctldHost directives and best practices for failover with a multi-slurmctld setup I'd be happy to review and see if there is a better way to deploy for my site. I just couldn't find anything on best practices for multi-slurmctld setup and got stung in the following scenario: ControlMachine=atmos-mgmt1 ControlAddr=atmos-mgmt1 BackupController=atmos-mgmt2 BackupAddr=atmos-mgmt2 - Need to patch/reboot atmos-mgmt1 - Failed HA over to atmos-mgmt2 - Applied firmware updates to atmos-mgmt1 and rebooted - Slurmctld hung for the timeout period (or until I remember to "scontrol takeover" on BackupController) - Once atmos-mgmt1 comes up, slurmctld starts and immediately it becomes the Primary again - Install OS updates on atmos-mgmt1 and reboot - Slurmctld hangs for the timeout period (or until I remember to "scontrol takeover" on the BackupController) Because of the complexity of the management nodes, it's not uncommon during a maintenance window I might need to do 3 or 4 reboots of the node (between the updates for firmware, RHEL7, GPFS, etc). Each time atmos-mgmt1 came back up from a reboot and slurmctld started, it immediately took over as the Primary controller. Then on the next reboot I'd have to go through the timeout again or go over to atmos-mgmt2 and do "scontrol takeover". It's doable on my part to issue "scontrol takeover" on the BackupController with each reboot or simply "systemctl disable slurmctld.service" on atmos-mgmt1 during maintenance. But I wanted an automatic, don't have to think/remember because I am trying to focus on the management node I am doing maintenance on solution. That's why I came up with what I did. I have the HA service start/stop slurmctld based on which node is the HA master and use systemd's "ExecStartPost" directive to run "scontrol takeover": [root@atmos-mgmt2 ~]# cat /etc/systemd/system/slurmctld.service.d/takeover.conf [Service] ExecStartPost=/usr/sbin/slurmctld_takeover [root@atmos-mgmt2 ~]# cat /usr/sbin/slurmctld_takeover #!/bin/bash /bin/sleep 3 # Make sure slurmctld is fully up and listening /usr/local/bin/scontrol takeover exit 0 This works perfectly as long as HA doesn't have an issue on failover :). Ideally, there would be a scontrol/dynamic way of telling Slurm > atmos-mgmt2 is now the ControlerMachine and atmos-mgmt1 is now the BackupController > stay that way until I tell you differently That way I could easily assign atmos-mgmt1 or atmos-mgmt2 as the ControlMachine as part of my HA failover without having to stop and start slurmctld like I do now. Is there anything like that available with scontrol? Thanks.
There is no way for scontrol to do what you want, but there is a way to modify your ExecStartPost script and slurm.conf to do it. First, in your slurm.conf, remove these lines: > ControlMachine=atmos-mgmt1 > ControlAddr=atmos-mgmt1 > BackupController=atmos-mgmt2 > BackupAddr=atmos-mgmt2 And put the following line instead: > include /usr/local/etc/slurmctlds.conf (The path may be whatever you wish.) Then, create `/usr/local/etc/slurmctlds.conf` (or whatever you chose) and put the following contents in it: > SlurmctldHost=atmos-mgmt1 > SlurmctldHost=atmos-mgmt2 `SlurmctldHost` is the replacement for `ControlMachine`, `ControlAddr`, `BackupMachine`, and `BackupAddr`. The order they are in determines which is master and which is backup. I am having you put them in a separate file because that file will be edited on failover, in the ExecStartPost script. This is what the script should look like: > #!/bin/bash > > /bin/sleep 3 # Make sure slurmctld is fully up and listening > /usr/local/bin/scontrol takeover > > # As you probably know, scontrol takeover will terminate the primary, so we > # can assume it is down at this point. > # Edit the /usr/local/etc/slurmctlds.conf to switch the backup to primary and > # the primary to backup. > backup=`cat /usr/local/etc/slurmctlds.conf` | head -n 1` > primary=`cat /usr/local/etc/slurmctlds.conf` | head -n 2 | tail -n 1` > printf '%s\n%s\n' "$primary" "$backup" > /usr/local/etc/slurmctlds.conf > # Reload the new config > /usr/local/bin/scontrol reconfig > exit 0 After that script runs, you should be fine to restart the new backup (old primary), and it won't take over because it will not be first in the list. Yes, this is a hack, and if it weren't for the "include" ability in slurm.conf, I couldn't recommend this. But I think it might work well for you. No promises, though.
This is an interesting angle on the problem I hadn't considered. I think the best way to handle it would be: 1) Modify slurm.conf to include the /usr/local/etc/slurmctlds.conf as suggested 2) Enable slurmctld.service on both cluster manager nodes, outside of HA, so they are always running on both 3) Create a "slurmctld_update_controller.service" that is a "OneShot" service the did the following - Rebuild /usr/local/etc/slurmctlds.conf based on which node it runs on -- atmos-mgmt1 SlurmctldHost=atmos-mgmt1 SlurmctldHost=atmos-mgmt2 -- atmos-mgmt2 SlurmctldHost=atmos-mgmt2 SlurmctldHost=atmos-mgmt1 - Run /usr/local/bin/scontrol reconfig 3) Put slurmctld_update_controller.service into the HA setup in place of slurmctld.service This way, slurmctld is always running on both nodes, but the HA service just updates /usr/local/etc/slurmctlds.conf on whichever is made the master. If neither is made the master (i.e. HA fails like happened this week) then nothing happens to Slurmctld, the old master stays in place until I resolve the HA problem. Before I implement this into our HA setup, I'd like to test it manually. Is there a way to query Slurmctld as to which host is the master, so I can verify as I issue: /bin/systemctl start slurmctld_update_controller.service on atmos-mgmt1 and atmos-mgmt2, that the update and scontrol reconfig worked as expected? Thanks.
Ok, That didn't work. I started with SlurmctldHost=atmos-mgmt1 SlurmctldHost=atmos-mgmt2 Verified with "scontrol ping" [root@atmos-mgmt1 ~]# scontrol ping Slurmctld(primary) at atmos-mgmt1 is UP Slurmctld(backup) at atmos-mgmt2 is UP Then changed to SlurmctldHost=atmos-mgmt1 SlurmctldHost=atmos-mgmt2 and issued "scontrol reconfig" and ALL hell broke loose. While the ping looked correct: [root@atmos-mgmt1 ~]# scontrol ping Slurmctld(primary) at atmos-mgmt2 is UP Slurmctld(backup) at atmos-mgmt1 is UP slurmctld on atmos-mgmt2 started getting RPC errors: [2020-04-03T10:41:54.641] slurmctld version 19.05.5 started on cluster atmos [2020-04-03T10:41:54.646] [job_submit/group_account] Submit Plugin Loaded. [2020-04-03T10:41:54.648] slurmctld running in background mode [2020-04-03T10:42:58.510] error: Invalid RPC received 1002 while in standby mode [2020-04-03T10:42:58.510] error: Invalid RPC received 1002 while in standby mode [2020-04-03T10:42:58.520] error: Invalid RPC received 1002 while in standby mode [2020-04-03T10:42:58.521] error: Invalid RPC received 1002 while in standby mode [2020-04-03T10:42:58.523] error: Invalid RPC received 1002 while in standby mode [2020-04-03T10:42:58.523] error: Invalid RPC received 1002 while in standby mode The compute nodes FREAKED OUT and we lost a few jobs on the cluster (from slurmd on one of the nodes that lost a job). [2020-04-03T10:43:19.833] [885867.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2020-04-03T10:43:19.838] [885867.batch] done with job [2020-04-03T10:43:43.582] error: Error responding to health check: Transport endpoint is not connected [2020-04-03T10:43:43.583] launch task 885867.173 request from UID:1854 GID:187 HOST:172.20.12.35 PORT:6372 [2020-04-03T10:43:43.586] error: Invalid job credential from 1854@172.20.12.35: Job credential expired [2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory [2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory [2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory [2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory [2020-04-03T10:43:43.586] error: stepd_connect to 885867.173 failed: No such file or directory [2020-04-03T10:43:43.587] error: _rpc_launch_tasks: unable to send return code to address:port=172.20.12.35:58392 msg_type=6001: Transport endpoint is not connected And atmos-mgmt1 thought it was still in charge. I am attaching slurmctld from atmos-mgmt1/2 and slurmd from one of the compute nodes that freaked. I think the issue is, from "man slurm.conf" > Slurm daemons should be shutdown and restarted if any of > these parameters are to be changed: AuthType, BackupAddr, BackupController, > ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort or > SlurmdPort. So, while "SlurmctldHost" is not explicitly listed in the man page, it probably groups into the same level as "ConrtrolAddr" and "BackupAddr".
Created attachment 13592 [details] Slurmctld from atmos-mgmt1
Created attachment 13593 [details] Slurmctld from atmos-mgmt2
Created attachment 13594 [details] Slurmd from one of the compute nodes.
Did you do an `scontrol takeover` before switching the SlurmctldHost entries? I believe you will need to do that. And since that command takes down the old primary, you will need to bring it back up after the scontrol reconfig.
We have a small test cluster that we use to test upgrades to Slurm and other OS components, before putting them onto the main system. It doesn't have a backup Slurmctld setup currently. To prevent additional issues on my production cluster, let me go ahead and get a backup controller configuration setup on my test cluster and then I can test the failover procedure without worrying if I'm going to cause another user impact. I'll get back to you on Monday or Tuesday once I've got that setup and done some testing. But, please confirm the following workflow as what I need to do/test: Assuming I start with this in /usr/local/slurm/etc/slurmctlds.conf: SlurmctldHost=atmos-mgmt1 SlurmctldHost=atmos-mgmt2 1) On atmos-mgmt2 I would "scontrol takeover" 2) On atmos-mgmt2 I would update /usr/local/etc/slurm/etc/slurmctlds.conf to: SlurmctldHost=atmos-mgmt2 SlurmctldHost=atmos-mgmt1 3) On atmos-mgmt2 I would "scontrol reconfigure" 4) On atmos-mgmt1 I would "/bin/systemctl start slurmctld" /usr/local/* is on a shared filesystem across the cluster, so updating it on any one node will update it on all of them. Thanks.
Yes, that is the procedure I was thinking about.
Ok, So that doesn't work. I setup the backup controller on my test cluster and tried what you suggested and Slurm just falls apart. [root@luna-mgmt ~]# cat /usr/local/slurm/etc/slurmctld.conf SlurmctldHost=luna-mgmt SlurmctldHost=luna2 [root@luna-mgmt ~]# cat make_luna-mgmt_active.sh #!/bin/bash /usr/local/bin/scontrol takeover /bin/cat > /usr/local/slurm/etc/slurmctld.conf << EOF SlurmctldHost=luna-mgmt SlurmctldHost=luna2 EOF /usr/local/bin/scontrol reconfigure exit 0 #!/bin/bash [root@luna2 ~]# cat make_luna2_active.sh /usr/local/bin/scontrol takeover /bin/cat > /usr/local/slurm/etc/slurmctld.conf << EOF SlurmctldHost=luna2 SlurmctldHost=luna-mgmt EOF /usr/local/bin/scontrol reconfigure exit 0 I am attaching the slurmctld from luna-mgmt and luna2. The system was so confused that NONE of s* commands would do anything other then timeout: > error: Unable to contact slurm controller (connect failure) Even though slurmctld was running on both nodes. I think the man page is right, if you want to make changes to the Primary and Backup controllers you HAVE to restart the Slurm daemons, all of them.
Created attachment 13598 [details] Sliurmctld off testing cluster, node luna-mgmt...
Created attachment 13599 [details] Sliurmctld off testing cluster, node luna2...
In that case, you were right in assuming the documentation is wrong. I will get on fixing that. You will need to shutdown and restart both controllers in order to do the kind of failover that you want to do, unfortunately.
One thing I forgot to say is that the logs indicate that compute nodes will have to be restarted as well for this procedure, which I was not aware of before. So I don't know how useful this will be.
> the logs indicate that compute nodes will have to be restarted as well > for this procedure, which I was not aware of before. So I don't know how > useful this will be. Yeah, restarting the slurmd on all our compute nodes on HA fail over isn't going to be workable. We'll stick with what we're currently doing and hope that setting MessageTimeout to a higher value is enough to resolve the problem, when we have an HA issue. Thanks.
You're welcome, and I apologize. Closing.