After a normal maintenance window, with no changes to the slurm configuration nor network we are seeing the following error message when initiating a command such as: srun -N 200 -n 200 --reservation=PreventMaint slurmstepd timeout, _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection timed out with jobs >200 nodes reliably reproducing the condition. However it is intermittent at job sizes of 100 nodes or lower. Additionally, we occasionally observe, another 'Connection timed out' when a job is completing. The job will wait in cleanup phase since it never receives the task complete message. This is on a similar frequency and scale. These job launches are done on a quiet system. There is no observable load on master (where ctld/dbd/database exist) via top. Connection timeouts are reported in the slurmctld.log as well as on the offending compute node in its slurmd.log. The sysctl.conf contains: net.core.netdev_max_backlog=5000 net.core.rmem_max=2147483647 net.core.wmem_max=2147483647 net.ipv4.tcp_rmem=4096 65536 2147483647 net.ipv4.tcp_wmem=4096 65536 2147483647 net.ipv4.tcp_mtu_probing= 1 net.ipv4.max_syn_backlog= 65536 net.core.somaxconn= 65535 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait= 1 We did increase to no effect. >200 nodes generates this very quickly. Related error messages: srun: error: task <x> launch failed: Slurmd could not connect IO No other network errors are being reported. Specific netstat -in shows no TX/RX errors. The networking debug procedures are not finding any other network problems. setting DebugFlags=Steps & level=debug3 doesn't provide any new errors, just dispatch & scheduling messages. (Note: this system's logs are not directly available.) However, running salloc or sbatch does *not* trigger this reproduce this condition. Only unallocated initial srun's do.
Do srun's inside a bash script produce the same issue?
Sorry if my question is redundant since you said that: "running salloc or sbatch does *not* trigger this reproduce this condition. Only unallocated initial srun's do." We are looking into it.
Have you tried restarting the ncmd daemon (likely on sdb)?
This is not a Cray system. Standard architecture with master hosting ctld/dbd/and database.
Did anything change in your maintenance window? Any changes to hardware/software/configuration?
What is your Slurm message timeout (see "scontrol show config | grep MessageTimeout")? It would be possible to increase that and probably get jobs running again, but it would likely mask some other issue.
MessageTimeout is set to 60 seconds. We see the messages show up within a few seconds during a srun. Usually around 7 or 8 seconds into the run.
(In reply to Joseph 'Joshi' Fullop from comment #7) > MessageTimeout is set to 60 seconds. We see the messages show up within a > few seconds during a srun. Usually around 7 or 8 seconds into the run. That is a great clue. Each of Slurm's compute node deamons reads its local configuration file for parameters like the message timeout. Is there any chance of different slurm.conf files on some of the compute nodes? Perhaps the daemon was started before some file system was mounted and read an old/vestigial configuration file?
Another thing that does not appear in the original post, when srun-ing a hostname job, the task on the node actually completes and the output for the node reporting the error shows up in stdout. It appears just the reporting of the launch is failing. For example doing a 'srun -N100 -n100 -l --reservation=x hostname' will output all 100 hosts' hostnames. but then sometimes throw 1 or more 'task#: slurmstepd: error: _send_launch_resp: Failed to send RESPONSE_LAUNCH_TASKS: Connection timed out' Similarly, the reporting of task completion sometimes also fails. Both cases causes the job to not complete correctly. The slurm.conf files are all the same. We did a fresh reboot and have no reports of differing configs.
Would it be possible to get your Slurm configuration files? Did you look at the slurmctld and slurmd log files? Did anything there stand out? If the configuration files differ between nodes, you should see messages like this in the slurmctld log file: error: Node nid00001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
I don't know if this helps, but here is more information about what causes the error: The srun command is sending a launch RPC to Slurm's slurmd daemon the compute nodes. When the launch is completed by a slurmstepd process (spawned to manage the local application I/O, accounting, etc.), the slurmstepd initiates an RPC to the srun. That message from slurmstepd to the srun is what is timing out. Perhaps starting the srun with more verbose logging (add "-vvvv" to the srun command line) would provide some more information.
Will you also try running: sbatch -N100 -n100 --reservation=x --wrap="srun -l hostname" I'm wondering if it has something to do with the communications between the compute nodes and the login nodes (unless you are submitting your srun's from a compute node).
(In reply to Moe Jette from comment #11) > I don't know if this helps, but here is more information about what causes > the error: > > The srun command is sending a launch RPC to Slurm's slurmd daemon the > compute nodes. > > When the launch is completed by a slurmstepd process (spawned to manage the > local application I/O, accounting, etc.), the slurmstepd initiates an RPC to > the srun. That message from slurmstepd to the srun is what is timing out. > > Perhaps starting the srun with more verbose logging (add "-vvvv" to the srun > command line) would provide some more information. To add to this - srun needs to open ephemeral TCP ports for the slurmstepd process to connect back to. I'm guessing there's some sort of firewall between the login nodes and the compute nodes in this cluster? If you're running srun outside an allocation, this implies that the login node will need to allow connections back from the compute nodes on any arbitrary TCP port. With larger jobs, there's a higher chance that this may have fallen outside a "normal" range that you may have permitted through a firewall, as it'll be opening more ports than normal. Setting SrunPortRange in slurm.conf will at least restrict the range that srun listens on, and would make it easier to configure the appropriate firewall settings.
(In reply to Brian Christiansen from comment #12) > Will you also try running: > sbatch -N100 -n100 --reservation=x --wrap="srun -l hostname" I ran this with -N200 -n200 since that's what reliably reproduces the error on srun. The job completed with no errors. Removing the --wrap= part and changing sbatch back to srun causes the errors to appear.
(In reply to Michael Jennings from comment #14) > (In reply to Brian Christiansen from comment #12) > > Will you also try running: > > sbatch -N100 -n100 --reservation=x --wrap="srun -l hostname" > > I ran this with -N200 -n200 since that's what reliably reproduces the error > on srun. The job completed with no errors. Removing the --wrap= part and > changing sbatch back to srun causes the errors to appear. The slurmstepd process needs to talk to the srun command. In the sbatch case, the srun runs on the first compute node of the job's allocation. When executing srun directly, the communications are going back to the login node. Can you think of any hardware, software or configuration differences between the nodes that might account for the failure when executing srun directly?
We are testing the network between the computes and the front ends now. Will update with results soon.
We have confirmed packets being dropped on the front end/login nodes and are investigating further.
We have found the problem. There appears to have been firewall rules in regards to flooding on the FE nodes that were causing the drops when sruns were executed directly from the front ends. Hence also why it appeared to be observed only at a certain scale. Additionally, likely due to the complexity of the firewall rules, when the firewall was dropped (and after corrected) we saw the 300 node hostname runs drop from 20+ seconds to under 1 second. Thank you for your assistance in narrowing our root cause. I think we understand the mechanics a bit better now too.
Created attachment 5144 [details] Failure log for "srun -vvvv" as requested I went through the process of having someone DC the "srun -vvvv" that Moe asked for, so I'm going to go ahead and attach it here anyway in case it might reveal something else that's going on or just for future reference. Hope that's okay. The errors we were seeing about "IO" and RESPONSE_LAUNCH_TASKS are in there, so if they could potentially indicate anything apart from packet loss or network throttling, please let us know. If nothing else, at least I'm set up for going through that process a bit faster now! :-)