Sometimes, when our compute nodes run slurmd to register with the slurmctld service using the "dynamic future node" feature(https://slurm.schedmd.com/slurm.conf.html#OPT_Dynamic-Future-Nodes), the slurmd process daemonizes into the background and returns a successful exit code even if the node never actually registered with slurmctld, and therefore the node is not doing work yet. We know slurmctld did not register the new node, because running `scontrol show nodes` does not show the new compute node. Sometimes slurmctld just drops the attempted registration, so we need to go to the compute node and try to restart the slurmd process. In the slurmd logs, we don't see anything indicating slurmd wasn't able to register properly with slurmctld. Is there any way to query the slurmd process to see if it's properly registered with slurmctld? Or perhaps a way to run slurmd in a mode where it will return a failure exit code if it doesn't properly register with slurmctld? Or is the only way to do this, to query `scontrol show nodes` on slurmctld?
I would expect to see something similar to this in the logs if the node wasn't able to register with the controller correctly. [2022-07-22T13:31:45.025] error: Unable to register: Unable to contact slurm controller (connect failure) Or are you seeing the registration request show up in the controller logs and it's just not responded to? If that's the case I would be curious to know if you've changed the SlurmdTimeout from the default value of 300 seconds. https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout I'm afraid there isn't a way to query the slurmd to see if it has registered with the controller successfully. You can start slurmd in the foreground and I would expect it to be pretty clear if it wasn't communicating with the controller, but that's not a process you can automate easily, which I'm sure is what you're looking for. The same would hold true for debug logs. A possible alternative to 'scontrol' might be 'sinfo'. It will give you information about nodes you can use to verify that a node has registered. Are you seeing nodes fail to register frequently? Thanks, Ben
A few things: Yes, we've modified SlurmdTimeout, from the default of 300, down to 60. Do you think we should change it back to 300? After more experiments, I've verified that this strange behavior is newly happening very frequently only after an upgrade to slurm v22.05.2. Out of hundreds of nodes that we're trying to register with slurmd, only ~25% of them are successful. It happens very infrequently in slurm v20.11 (~<1% of the time). We're not seeing ``` [2022-07-22T13:31:45.025] error: Unable to register: Unable to contact slurm controller (connect failure) ``` in the slurmd logs, even with debug5 logging enabled. Slurmd is able to contact slurmctld. I suspect it is the case that "the registration request show up in the controller logs and it's just not responded to", I will try again to see what kind of output shows up in slurmctld with debug enabled, and paste it here in a subsequent reply. We do need to upgrade to slurm v22.05.2, so right now we're pursuing a strategy of introducing jitter to the slurmd requests, so they don't thundering-herd attempt to contact slurmctld all at once, and also add slurmd restart retries. This is non-ideal though, as we don't want to accidentally retry slurmd when it's on its way to a successful registration, which would introduce thrashing.
I do think that increasing the SlurmdTimeout back to 300 seconds will alleviate some of the failed registration attempts that you're seeing. You mention upgrading to 22.05.2 from 20.11. Is this an upgrade that has been partially performed? If so I assume that you have slurmdbd and slurmctld upgraded to 22.05 and maybe some of the nodes. Is that right? If you are able to collect some debug logs from the controller I would be interested in seeing them. Thanks, Ben
Our production environment is still all on v20.11 I am testing out v22.05.2 in a dev environment, and I am able to reproduce this "failed registration" very reliably. Will try increasing the slurmdtimeout to 300, to see what effect that has. Will also grab some debug logs from the controller to share with you.
I've got one additional question for you. Since this is on your dev environment, is it fully upgraded to 22.05.2, or are you practicing a rolling upgrade on that system to match what the upgrade process will look like on your production system? When you are able to collect some debug logs I would be glad to review them. Thanks, Ben
Hey Ben, I have results for you. > Since this is on your dev environment, is it fully upgraded to 22.05.2, or are you practicing a rolling upgrade on that system to match what the upgrade process will look like on your production system? This is a dev environment fully upgraded to 22.05.2, so slurmdbd, slurmctld, slurmd, and all tools (sinfo, sacct, etc.) are slurm 22.05.2. What I've done is an A/B experiment, with the exact same configuration (nearly the same slurm.conf excepting slurm 22.05 changes, same instance type running slurmctld, same instance types running slurmd, etc.). A is the control on slurm 20.11.7, and B is on slurm 22.05.2. The results are that experiment A(20.11.7) had 84/84 of the compute nodes successfully register to slurmctld (they show up in sinfo, and they show up in 'scontrol show nodes'). Experiment B(22.05.2) had only 36/84 of the compute nodes successfully register to slurmctld. 5 minutes later, I retried slurmd on all the failed compute nodes, and they all subsequently registered successfully. It seems that the more nodes there are trying to register, the worse this ratio becomes. So it seems there's some performance regression between slurm 20.11 and slurm 22.05, or we've tripped that performance regression due to some unknown interaction. I can reproduce this every time I run this experiment, so please let me know if you'd like me to try anything else. Some more information about our configuration: - I disabled the SlurmdTimeout flag, allowing it to default to 300 as you suggested - both slurmctld and slurmd have debug5 logging enabled - we use dynamic future nodes, as you can see from our slurm.conf - Both experiments request 84 r5.12xlarge instances from AWS, and when they become active, they run slurmd to try to register. - slurmd nodes launch slurmd using the following command: '/opt/slurm/sbin/slurmd -Fshape.r5.12xlarge:default-spot -f /etc/slurm/slurm.conf -M' - I've included all relevant log/config files I can think of (slurmctld with debug5, slurm.conf), with a prefix of 2011 for the control experiment A with slurm 20.11.7, and a prefix of 2205 for the experiment B. - I've also included a 2205slurmd_failed_retried.log, which shows an experiment B slurmd compute that failed to register fully with slurmctld (but reported success anyway, and then 5 minutes later, upon slurmd restart successfully registers)
Created attachment 26115 [details] Experiment B v22.05.2 slurmctld.log with debug5 logging
Created attachment 26116 [details] Experiment B v22.05.2 slurm.conf
Created attachment 26117 [details] Experiment B v22.05 slurmd.log This is a compute node's slurmd.log with debug5 logging. When it first starts up, slurmd seems to think it successfully registered with slurmctld, and the slurmd process returns with successful error code 0. Despite this, the node does not show up in sinfo, nor in 'scontrol show nodes' It is only on a subsequent 'systemctl restart slurmd' run 5 minutes later, that the node successfully registers with slurmctld and shows sinfo
Created attachment 26118 [details] Experiment A v20.11.7 slurm.conf Control experiment, base slurm.conf, with debug5 enabled.
Created attachment 26119 [details] Experiment A v20.11.7 slurmctld.log Base control experiment showing slurmctld.log with debug5 enabled. All 84/84 compute instances successfully used slurmd to register with slurmctld on their first attempt.
Comment on attachment 26115 [details] Experiment B v22.05.2 slurmctld.log with debug5 logging With slurm v22.05.2, only 36/84 of the slurmd compute nodes were able to successfully register slurmctld on their first attempt. (the bad nodes still had slurmd report successful error code 0, but a subsequent sinfo or 'scontrol show nodes' did not show these nodes) 5 minutes later, I restarted slurmd on all the remaining computes, and they all subsequently were able to register successfully.
Thank you for collecting that information. I'm still working through the logs, but I have a couple questions about them. I'm trying to correlate the events in the logs (from experiment B) but the time stamps don't overlap The timestamps on the slurmctld logs are from 12:35 - 12:44. The slurmd logs go from 13:49 - 13:57. Should these be for the same time periods and the time stamps are off, or were the slurmd logs from much later? I also couldn't see which node these slurmd logs were for. I would like to be able to correlate events in the slurmd logs with what's happening at the same time in the slurmctld logs. If you don't have logs for the same period of time readily available could I have you run this test one more time? I'm also looking at where some of these logs are coming from and there should be some additional information that shows up with the POWER DebugFlag. If you do need to run the test one more time it would be nice to have that flag enabled as well. Thanks, Ben
Hi Ben, apologies on the slurmd log mixup. I accidentally sent you logs for a slurmd compute node that occurred much later than the time period of the slurmctld logs. I’ve re-run the experiment, with DebugFlags=power, everything else the same as experiment B. In this experiment, 39/84 nodes registered successfully. Let’s call this experiment C. I’ve included the following files c_slurm.conf c_slurmctld.log c_slurmd.log The c_slurmd.log is for node compute3646, which didn’t successfully register at the first slurmd attempt, but came in successfully when I retried it 5 minutes later. We really appreciate you taking a look at this, let us know if you need any further information!
Created attachment 26125 [details] Experiment C slurmd.log This node, compute3646, didn't register successfully on the first slurmd attempt, but succeeded 5 minutes later.
Created attachment 26126 [details] Experiment C slurmctld.log
Created attachment 26127 [details] Experiment C slurm.conf
Thank you for collecting that information again. I can see that there were actually 3 times the node attempted to contact the controller. The first two times it was in the middle of handling the registration of another node and the registration attempt of compute3646 just didn't show up. The third time it attempted to register there weren't any other nodes registering at the time and it registered correctly. One thing I would like to have you try is configuring the ResumeRate to limit the rate of the number of nodes it attempts to resume at once. Can you change 'ResumeRate=0' to 'ResumeRate=60' in your slurm.conf? I'm not necessarily saying that you have to keep it that way, but I'd like for you to try it and let me know if it makes a difference in the behavior. Thanks, Ben
I ran the experiment again with ResumeRate=60, and it didn't have any effect. Still only 33/84 successfully registered.
I've got one more quick question. You stated in comment 7 that the more nodes you resume at once the worse the ratio of failures becomes. Can you find the point at which you start seeing failures? In other words, does everything work fine with a 15 node job, but start failing at 16 nodes? Or wherever that point is where you start consistently seeing failures. Thanks, Ben
I've got a quick update. I discussed this with one of our developers who has done quite a bit of work with elastic computing. He is able to reproduce behavior that looks very similar to what you are seeing and is digging into what is causing the problem. Thanks, Ben
> Can you find the point at which you start seeing failures? It's hard to find an exact threshold, because every run varies a little. But when requesting 9 instances, I'm seeing only 4/9 of them successfully register. Which implies the threshold for this issue is very low. I can try to find a case when it happens when I'm only requesting 3-4 instances. > He is able to reproduce behavior that looks very similar to what you are seeing and is digging into what is causing the problem. Sounds great! We eagerly await those results.
We were able to track the problem down to a change where node names could be mapped to multiple machines as they were coming up. A fix for this problem was checked in and will be available in 22.05.3. We are planning on releasing 22.05.3 tomorrow, though there is still a possibility that the release will be pushed back a little later. Since the fix is checked in I'll close this ticket, but please let us know if you still see similar behavior in 22.05.3. Thanks, Ben
Great! I will give 22.05.3 a try when it's ready, and get back to you if we still see issues. Much Thanks, Matthew
FYI, I managed to test the new 22.05.3, and it has fixed our issue. When requesting a lot of nodes with dynamic future mode, they all register correctly, and we no longer experience the thundering herd issue. Much thanks for the quick turnaround! - Matthew