Hi, A few weeks ago a change to the cgroup.conf file was made that resulted in a bad interaction with the Intel OPA driver that caused massive node outages. The original cgroups files was: CgroupAutomount=yes CgroupMountpoint=/cgroup ConstrainCores=yes ConstrainRamSpace=no ConstrainSwapSpace=no TaskAffinity=no And it was changes to: CgroupAutomount=yes CgroupMountpoint=/cgroup ConstrainCores=yes ConstrainRamSpace=no ConstrainSwapSpace=no TaskAffinity=no # added 2019/05/30 https://bugs.schedmd.com/show_bug.cgi?id=5485 ConstrainKmemSpace=no After making this change then nodes in the cluster were set to do a rolling reboot to pick up this new configuration which took most of a week to get the majority of the nodes. Then on June 12 there were several nodes that went unresponsive or responded to pings but not to ssh request or the serial console. After working with Intel to verify that then network hardware was actually not the cause of the issue (including a reboot of the OPA chassis switch), they could not find anything majorly wrong with then switch or our configuration. They did direct us to change some sysctl parameters of tcp_mtu_probing=1 and net.core.somaxconn=2048 (previously 1024). After 2 more rounds of 80+ nodes going unresponsive, the original change was reverted so that the ConstrainKmemSpace line is now commented out and the 80+ nodes were rebooted. This made the cluster stable for 24 hours at which time we started rolling back all the nodes with the previous change. Now ConstrainKmemSpace=no is not suppose to affect anything if ConstrainRamSpace is set to no and is the default setting unless set to yes per the documentation. However, it does seem to cause something to happen that simulates a network issue resulting in ping loss, network file-system instability and or system hangs. We have not had this issue reoccur since then change was reverted. Slurm branch: https://github.com/SchedMD/slurm/tree/slurm-18-08-3-1 https://github.com/SchedMD/slurm/blob/slurm-18-08-3-1/doc/man/man5/cgroup.conf.5 ConstrainKmemSpace If configured to "yes" then constrain the job's Kmem RAM usage in addition to RAM usage. Only takes effect if ConstrainRAMSpace is set to "yes". The default value is "no". If set to yes, the job's Kmem limit will be set to AllowedKmemSpace if set; otherwise, the job's Kmem limit will be set to its RAM limit.
Please attach your current slurm.conf. (In reply to Adam Hough from comment #0) > After making this change then nodes in the cluster were set to do a rolling > reboot to pick up this new configuration which took most of a week to get > the majority of the nodes. Then on June 12 there were several nodes that > went unresponsive or responded to pings but not to ssh request or the serial > console. After working with Intel to verify that then network hardware was > actually not the cause of the issue (including a reboot of the OPA chassis > switch), they could not find anything majorly wrong with then switch or our > configuration. > > They did direct us to change some sysctl parameters of tcp_mtu_probing=1 and > net.core.somaxconn=2048 (previously 1024). I find it odd that you would need Path MTU discovery on an homogenous omnipath network. Do you have IPoFabric enabled with 8KB MTU on all nodes? Do you have syncookies enabled? > $ sysctl net.ipv4.tcp_syncookies Please see our large cluster guide here: https://slurm.schedmd.com/big_sys.html > After 2 more rounds of 80+ nodes going unresponsive, the original change was > reverted so that the ConstrainKmemSpace line is now commented out and the > 80+ nodes were rebooted. > This made the cluster stable for 24 hours at which > time we started rolling back all the nodes with the previous change. Were the nodes rebooted after changing this setting? Can you please call this as root on one the afflicted nodes and attach the output to this ticket? > grep -HR . /sys/fs/cgroup/memory/slurm*/ /cgroup/memory/slurm*/ Is the cluster currently stable with the change reverted? > Now ConstrainKmemSpace=no is not suppose to affect anything if > ConstrainRamSpace is set to no and is the default setting unless set to yes > per the documentation. However, it does seem to cause something to happen > that simulates a network issue resulting in ping loss, network file-system > instability and or system hangs. > > We have not had this issue reoccur since then change was reverted. Would it be possible to test this on a single node or a test system with same omnipath hardware? > Slurm branch: > https://github.com/SchedMD/slurm/tree/slurm-18-08-3-1 Please consider upgrading to the latest Slurm 18.08 patchset, it should be safe to do rolling upgrades.
In reply to We are planning to move to 18.08.07 next Tuesday. IPoIB is enabled on all then nodes in the fabric. Yes, the nodes that were drained or down in slurm were rebooted to pick up then reverted change. We are using Xcat's stateless booting and the nodes pickup their slurm configuration on boot through a post boot script. For some changes that require a restart of slurmd, it is easier to make the change to the master file and then reboot the nodes to pick-up the new change. The cluster is currently stable with this setting reverted on most of the nodes. there are still 3 nodes with then bad setting that have not been rebooted as they currently are running a job. I suspect that it was a combination of that setting and a users job that was hitting something to cause the mass outage. We can setup a single test node with this setting in changed to test with as the effects seem to need enough of the cluster to have that configuration changed for it to start having issues. If needed we can setup a teleconference / screen sharing session to allow someone to look at an system with this setting in place at either 18.08.03 or 18.08.07 patch levels. As for Intel's recommended settings that we changed to match what they have recommended in their documentation. https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Performance_Tuning_UG_H93143_v13_0.pdf While I also find it odd as well for a recommended setting of "tcp_mtu_probing=1", it does not affect the stability and did not make a change to the instability we encountered. The IPoIB MTU is the default of 2044 for the OPA IPoIB and was working okay before then change. [root@n2078 ~]# ip link show ib0 5: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1024 link/infiniband 80:81:00:02:fe:80:00:00:00:00:00:00:00:11:75:01:01:7b:e5:c4 brd 00:ff:ff:ff:ff:12:40:1b:80:01:00:00:00:00:00:00:ff:ff:ff:ff Syncookies does look to be enabled on a node with the old setting and is default for all the compute nodes. [root@n2078 ~]# sysctl net.ipv4.tcp_syncookies net.ipv4.tcp_syncookies = 1 [root@n2078 ~]# cat /proc/sys/fs/file-max 26120039 (This might need to get increased again but will need to do some research and testing.) [root@n2078 ~]# cat /proc/sys/net/ipv4/tcp_max_syn_backlog 2048 From an node with the setting still set: [root@n2078 ~]# ls /sys/fs/cgroup/memory/ cgroup.clone_children memory.kmem.failcnt memory.kmem.tcp.max_usage_in_bytes memory.memsw.limit_in_bytes memory.pressure_level notify_on_release cgroup.event_control memory.kmem.limit_in_bytes memory.kmem.tcp.usage_in_bytes memory.memsw.max_usage_in_bytes memory.soft_limit_in_bytes release_agent cgroup.procs memory.kmem.max_usage_in_bytes memory.kmem.usage_in_bytes memory.memsw.usage_in_bytes memory.stat tasks cgroup.sane_behavior memory.kmem.slabinfo memory.limit_in_bytes memory.move_charge_at_immigrate memory.swappiness memory.failcnt memory.kmem.tcp.failcnt memory.max_usage_in_bytes memory.numa_stat memory.usage_in_bytes memory.force_empty memory.kmem.tcp.limit_in_bytes memory.memsw.failcnt memory.oom_control memory.use_hierarchy [root@n2078 ~]# ls /cgroup/ cpuset freezer The grep command did not return anything which is expected since ConstrainRamSpace and ConstrainSwapSpace are being set to "no". One of the symptoms we saw during the outage is that some of the nodes randomly it seemed lost ping packets. After then change was reverted and enough nodes were rebooted then this issue also went away. [root@n2370 ~]# ping -f -c 100 mox1-opa PING mox1-opa.hyak.local (10.3.0.6) 56(84) bytes of data. .............. --- mox1-opa.hyak.local ping statistics --- 100 packets transmitted, 86 received, 14% packet loss, time 241ms rtt min/avg/max/mdev = 0.023/0.031/0.113/0.015 ms, ipg/ewma 2.435/0.036 ms [root@n2370 ~]# ping -f -c 1000 mox1-opa PING mox1-opa.hyak.local (10.3.0.6) 56(84) bytes of data. ........................................................................................................................ --- mox1-opa.hyak.local ping statistics --- 1000 packets transmitted, 880 received, 12% packet loss, time 2302ms rtt min/avg/max/mdev = 0.024/0.036/0.154/0.021 ms, ipg/ewma 2.304/0.042 ms [root@n2370 ~]# ping -f -c 1000 mox2-opa PING mox2-opa.hyak.local (10.3.0.7) 56(84) bytes of data. ................................................................................................................ --- mox2-opa.hyak.local ping statistics --- 1000 packets transmitted, 888 received, 11% packet loss, time 2162ms rtt min/avg/max/mdev = 0.028/0.041/0.180/0.022 ms, ipg/ewma 2.164/0.035 ms - Adam
Created attachment 10806 [details] UW Hyak Mox Slurm.conf Our current autogenerated slurm.conf.
(In reply to Adam Hough from comment #0) > # added 2019/05/30 https://bugs.schedmd.com/show_bug.cgi?id=5485 > ConstrainKmemSpace=no Since 18.08 (commit 32fabc5e006b8f41), the default for ConstrainKmemSpace is "no" so adding that to your config should have effectively changed nothing, If I understood your cgroup.conf changes correctly. (In reply to Adam Hough from comment #0) > Then on June 12 there were several nodes that > went unresponsive or responded to pings but not to ssh request or the serial > console. I assume you mean unresponsive in Slurm? Can you please attach your slurmctld logs. Which nodes went unresponsive? A list would be nice for looking at the logs to avoid confusing other unrelated issues. (In reply to Adam Hough from comment #4) > Yes, the nodes that were drained or down in slurm were rebooted to pick up > then reverted change. We are using Xcat's stateless booting and the nodes > pickup their slurm configuration on boot through a post boot script. For > some changes that require a restart of slurmd, it is easier to make the > change to the master file and then reboot the nodes to pick-up the new > change. That avoids any lingering cgroup configuration questions as a bonus. > The cluster is currently stable with this setting reverted on most of the > nodes. Can we reduce this ticket to SEV3 since the cluster is currently stable? Since setting ConstrainKmemSpace=no is a no-op, it is unclear how reverting the setting could have changed anything. > there are still 3 nodes with then bad setting that have not been > rebooted as they currently are running a job. I suspect that it was a > combination of that setting and a users job that was hitting something to > cause the mass outage. Can you please provide dmesg and slurmd logs from these nodes along with the grep command requested in comment #1. > The IPoIB MTU is the default of 2044 for the OPA IPoIB and was working okay > before then change. Which change exactly are you referring to? > Syncookies does look to be enabled on a node with the old setting and is > default for all the compute nodes. > [root@n2078 ~]# sysctl net.ipv4.tcp_syncookies > net.ipv4.tcp_syncookies = 1 > [root@n2078 ~]# cat /proc/sys/net/ipv4/tcp_max_syn_backlog > 2048 With syncookies enabled, the backlog number is basically ignored by the kernel. The soconnmax will be enforced by the kernel though. > [root@n2078 ~]# cat /proc/sys/fs/file-max > 26120039 Are you seeing any errors about too many file descriptors open? Can you please check the limits of the slurmctld and slurmd (on compute) processes? > $ pgrep slurmctld | xargs -i grep -H . /proc/{}/limits > $ pgrep slurmd | xargs -i grep -H . /proc/{}/limits > From an node with the setting still set: > [root@n2078 ~]# ls /sys/fs/cgroup/memory/ Calling ls is not helpful since the files are always the same, I need the contents of each file. > The grep command did not return anything which is expected since > ConstrainRamSpace and ConstrainSwapSpace are being set to "no". It should have returned the contents of all the pseudo files. Can you please attach that output? > One of the symptoms we saw during the outage is that some of the nodes > randomly it seemed lost ping packets. After then change was reverted and > enough nodes were rebooted then this issue also went away. Slurm has timeouts to catch when packets are timing out that we can increase. The ping output was fast enough that we shouldn't need to change that yet.
Adam, Any updates? --Nate
Hi, we updated to Slurm 19.05.0 last week then 19.05.1-2 this week so was a little busy with that process. Yes, you can lower it to the lowest severity level as the cluster is stable. This ticket was mainly to make the Slurm team aware of our previous issue with that setting. I have no idea how to test it without changing the entire cluster back to the previous setting which we cannot do to a production system. I agree that explicitly setting "ConstrainKmemSpace=no" should have done absolutely nothing but that was not the observed case since the cluster stabilized after ~20% of the nodes were rebooted with that setting commented out of the cgroups.conf and the ping packet loss decreased to 0% again for all the nodes once enough of the nodes were rebooted to remove the setting. By "unresponsive", I mean then nodes were unresponsive to SSH or serial over lan console and some nodes were either not ping-able on their OPA connection, ethernet connection, or both as well. There were a handful of nodes that just lost their OPA connection which meant the nodes lost communications with the slurmctld daemon as in the slurm.conf we have NodeAddr=<NodeHostname>-opa. The loss of the OPA only state was rare. Will work on the other information tomorrow.
(In reply to Adam Hough from comment #8) > Yes, you can lower it to the lowest severity level as the cluster is stable. Lowering severity per your response.
(In reply to Adam Hough from comment #8) > This ticket was mainly to make the Slurm > team aware of our previous issue with that setting. I have no idea how to > test it without changing the entire cluster back to the previous setting > which we cannot do to a production system. I will try to attempt replicate this issue using a VM cluster instead. Can you please provide the output of the following: > lsb_release -a > uname -a Replicating issues related to the omnipath drivers will likely not be possible using VMs. > I agree that explicitly setting "ConstrainKmemSpace=no" should have done > absolutely nothing but that was not the observed case since the cluster > stabilized after ~20% of the nodes were rebooted with that setting commented > out of the cgroups.conf and the ping packet loss decreased to 0% again for > all the nodes once enough of the nodes were rebooted to remove the setting. > > By "unresponsive", I mean then nodes were unresponsive to SSH or serial over > lan console and some nodes were either not ping-able on their OPA > connection, ethernet connection, or both as well. If general non-Slurm communications were lost, I would guess that this is an OPA bug that is triggered by setting ConstrainKmemSpace. If you have a test cluster with OPA hardware that you can replicate this issue with, it would surely speed up the debug process but I also suspect that this is an issue for Intel. > There were a handful of > nodes that just lost their OPA connection which meant the nodes lost > communications with the slurmctld daemon as in the slurm.conf we have > NodeAddr=<NodeHostname>-opa. The loss of the OPA only state was rare. If SSH isn't working, I wouldn't expect Slurm to work either. Were there any errors in the kernel logs (dmesg) during these periods?
I am closing this since we had another similar outage that was job related this time. So the previous time might have been job related as well but we never found the bad job combination.