7341 – ConstrainKmemSpace=no setting affected Intel Omnipath (OPA) Network

Ticket 7341 - ConstrainKmemSpace=no setting affected Intel Omnipath (OPA) Network

Summary: ConstrainKmemSpace=no setting affected Intel Omnipath (OPA) Network

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Limits (show other tickets)
Version:	18.08.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-07-02 17:34 MDT by Adam Hough
Modified:	2019-07-31 16:44 MDT (History)
CC List:	4 users (show)

See Also:
Site:	University of Washington
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
UW Hyak Mox Slurm.conf (86.20 KB, text/plain) 2019-07-04 12:02 MDT, Adam Hough	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Adam Hough 2019-07-02 17:34:58 MDT

Hi,

A few weeks ago a change to the cgroup.conf file was made that resulted in a bad interaction with the Intel OPA driver that caused massive node outages.

The original cgroups files was:
CgroupAutomount=yes
CgroupMountpoint=/cgroup
ConstrainCores=yes
ConstrainRamSpace=no
ConstrainSwapSpace=no
TaskAffinity=no

And it was changes to:
CgroupAutomount=yes
CgroupMountpoint=/cgroup
ConstrainCores=yes
ConstrainRamSpace=no
ConstrainSwapSpace=no
TaskAffinity=no
# added 2019/05/30 https://bugs.schedmd.com/show_bug.cgi?id=5485
ConstrainKmemSpace=no

After making this change then nodes in the cluster were set to do a rolling reboot to pick up this new configuration which took most of a week to get the majority of the nodes. Then on June 12 there were several nodes that went unresponsive or responded to pings but not to ssh request or the serial console. After working with Intel to verify that then network hardware was actually not the cause of the issue (including a reboot of the OPA chassis switch), they could not find anything majorly wrong with then switch or our configuration.

They did direct us to change some sysctl parameters of tcp_mtu_probing=1 and net.core.somaxconn=2048 (previously 1024).

After 2 more rounds of 80+ nodes going unresponsive, the original change was reverted so that the ConstrainKmemSpace line is now commented out and the 80+ nodes were rebooted. This made the cluster stable for 24 hours at which time we started rolling back all the nodes with the previous change.

Now ConstrainKmemSpace=no is not suppose to affect anything if ConstrainRamSpace is set to no and is the default setting unless set to yes per the documentation. However, it does seem to cause something to happen that simulates a network issue resulting in ping loss, network file-system instability and or system hangs.

We have not had this issue reoccur since then change was reverted.

Slurm branch:
https://github.com/SchedMD/slurm/tree/slurm-18-08-3-1

https://github.com/SchedMD/slurm/blob/slurm-18-08-3-1/doc/man/man5/cgroup.conf.5
ConstrainKmemSpace
If configured to "yes" then constrain the job's Kmem RAM usage in addition to
RAM usage. Only takes effect if ConstrainRAMSpace is set to "yes". The default
value is "no". If set to yes, the job's Kmem limit will be set to
AllowedKmemSpace if set; otherwise, the job's Kmem limit will be set to its RAM
limit.

Comment 1 Nate Rini 2019-07-02 18:04:14 MDT

Please attach your current slurm.conf.

(In reply to Adam Hough from comment #0)
> After making this change then nodes in the cluster were set to do a rolling
> reboot to pick up this new configuration which took most of a week to get
> the majority of the nodes.  Then on June 12 there were several nodes that
> went unresponsive or responded to pings but not to ssh request or the serial
> console. After working with Intel to verify that then network hardware was
> actually not the cause of the issue (including a reboot of the OPA chassis
> switch), they could not find anything majorly wrong with then switch or our
> configuration.
> 
> They did direct us to change some sysctl parameters of tcp_mtu_probing=1 and
> net.core.somaxconn=2048 (previously 1024).

I find it odd that you would need Path MTU discovery on an homogenous omnipath network. Do you have IPoFabric enabled with 8KB MTU on all nodes?

Do you have syncookies enabled?
> $ sysctl net.ipv4.tcp_syncookies

Please see our large cluster guide here: https://slurm.schedmd.com/big_sys.html

> After 2 more rounds of 80+ nodes going unresponsive, the original change was
> reverted so that the ConstrainKmemSpace line is now commented out and the
> 80+ nodes were rebooted.
> This made the cluster stable for 24 hours at which
> time we started rolling back all the nodes with the previous change. 
Were the nodes rebooted after changing this setting? Can you please call this as root on one the afflicted nodes and attach the output to this ticket?
> grep -HR . /sys/fs/cgroup/memory/slurm*/ /cgroup/memory/slurm*/

Is the cluster currently stable with the change reverted?
 
> Now ConstrainKmemSpace=no is not suppose to affect anything if
> ConstrainRamSpace is set to no and is the default setting unless set to yes
> per the documentation.  However, it does seem to cause something to happen
> that simulates a network issue resulting in ping loss, network file-system
> instability and or system hangs.
> 
> We have not had this issue reoccur since then change was reverted.  
Would it be possible to test this on a single node or a test system with same omnipath hardware?

> Slurm branch:
> https://github.com/SchedMD/slurm/tree/slurm-18-08-3-1
Please consider upgrading to the latest Slurm 18.08 patchset, it should be safe to do rolling upgrades.

Comment 4 Adam Hough 2019-07-04 12:00:26 MDT

In reply to 

We are planning to move to 18.08.07 next Tuesday. IPoIB is enabled on all then nodes in the fabric. 

Yes, the nodes that were drained or down in slurm were rebooted to pick up then reverted change. We are using Xcat's stateless booting and the nodes pickup their slurm configuration on boot through a post boot script. For some changes that require a restart of slurmd, it is easier to make the change to the master file and then reboot the nodes to pick-up the new change.

The cluster is currently stable with this setting reverted on most of the nodes.  there are still 3 nodes with then bad setting that have not been rebooted as they currently are running a job. I suspect that it was a combination of that setting and a users job that was hitting something to cause the mass outage.

We can setup a single test node with this setting in changed to test with as the effects seem to need enough of the cluster to have that configuration changed for it to start having issues. If needed we can setup a teleconference / screen sharing session to allow someone to look at an system with this setting in place at either 18.08.03 or 18.08.07 patch levels. 

As for Intel's recommended settings that we changed to match what they have recommended in their documentation.
https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_Performance_Tuning_UG_H93143_v13_0.pdf

While I also find it odd as well for a recommended setting of "tcp_mtu_probing=1", it does not affect the stability and did not make a change to the instability we encountered. 

The IPoIB MTU is the default of 2044 for the OPA IPoIB and was working okay before then change.

[root@n2078 ~]# ip link show ib0
5: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1024
    link/infiniband 80:81:00:02:fe:80:00:00:00:00:00:00:00:11:75:01:01:7b:e5:c4 brd 00:ff:ff:ff:ff:12:40:1b:80:01:00:00:00:00:00:00:ff:ff:ff:ff

Syncookies does look to be enabled on a node with the old setting and is default for all the compute nodes. 
[root@n2078 ~]# sysctl net.ipv4.tcp_syncookies
net.ipv4.tcp_syncookies = 1

[root@n2078 ~]# cat /proc/sys/fs/file-max
26120039

(This might need to get increased again but will need to do some research and testing.)
[root@n2078 ~]# cat /proc/sys/net/ipv4/tcp_max_syn_backlog
2048

From an node with the setting still set:
[root@n2078 ~]# ls  /sys/fs/cgroup/memory/
cgroup.clone_children  memory.kmem.failcnt             memory.kmem.tcp.max_usage_in_bytes  memory.memsw.limit_in_bytes      memory.pressure_level       notify_on_release
cgroup.event_control   memory.kmem.limit_in_bytes      memory.kmem.tcp.usage_in_bytes      memory.memsw.max_usage_in_bytes  memory.soft_limit_in_bytes  release_agent
cgroup.procs           memory.kmem.max_usage_in_bytes  memory.kmem.usage_in_bytes          memory.memsw.usage_in_bytes      memory.stat                 tasks
cgroup.sane_behavior   memory.kmem.slabinfo            memory.limit_in_bytes               memory.move_charge_at_immigrate  memory.swappiness
memory.failcnt         memory.kmem.tcp.failcnt         memory.max_usage_in_bytes           memory.numa_stat                 memory.usage_in_bytes
memory.force_empty     memory.kmem.tcp.limit_in_bytes  memory.memsw.failcnt                memory.oom_control               memory.use_hierarchy
[root@n2078 ~]# ls  /cgroup/
cpuset  freezer

The grep command did not return anything which is expected since ConstrainRamSpace and ConstrainSwapSpace are being set to "no".

One of the symptoms we saw during the outage is that some of the nodes randomly it seemed lost ping packets. After then change was reverted and enough nodes were rebooted then this issue also went away. 

[root@n2370 ~]# ping -f -c 100 mox1-opa
PING mox1-opa.hyak.local (10.3.0.6) 56(84) bytes of data.
..............
--- mox1-opa.hyak.local ping statistics ---
100 packets transmitted, 86 received, 14% packet loss, time 241ms
rtt min/avg/max/mdev = 0.023/0.031/0.113/0.015 ms, ipg/ewma 2.435/0.036 ms
[root@n2370 ~]# ping -f -c 1000 mox1-opa
PING mox1-opa.hyak.local (10.3.0.6) 56(84) bytes of data.
........................................................................................................................
--- mox1-opa.hyak.local ping statistics ---
1000 packets transmitted, 880 received, 12% packet loss, time 2302ms
rtt min/avg/max/mdev = 0.024/0.036/0.154/0.021 ms, ipg/ewma 2.304/0.042 ms
[root@n2370 ~]# ping -f -c 1000 mox2-opa
PING mox2-opa.hyak.local (10.3.0.7) 56(84) bytes of data.
................................................................................................................
--- mox2-opa.hyak.local ping statistics ---
1000 packets transmitted, 888 received, 11% packet loss, time 2162ms
rtt min/avg/max/mdev = 0.028/0.041/0.180/0.022 ms, ipg/ewma 2.164/0.035 ms 


- Adam

Comment 5 Adam Hough 2019-07-04 12:02:58 MDT

Created attachment 10806 [details]
UW Hyak Mox Slurm.conf

Our current autogenerated slurm.conf.

Comment 6 Nate Rini 2019-07-05 10:16:53 MDT

(In reply to Adam Hough from comment #0)
> # added 2019/05/30 https://bugs.schedmd.com/show_bug.cgi?id=5485
> ConstrainKmemSpace=no

Since 18.08 (commit 32fabc5e006b8f41), the default for ConstrainKmemSpace is "no" so adding that to your config should have effectively changed nothing, If I understood your cgroup.conf changes correctly.

(In reply to Adam Hough from comment #0)
> Then on June 12 there were several nodes that
> went unresponsive or responded to pings but not to ssh request or the serial
> console.

I assume you mean unresponsive in Slurm? Can you please attach your slurmctld logs.

Which nodes went unresponsive? A list would be nice for looking at the logs to avoid confusing other unrelated issues.

(In reply to Adam Hough from comment #4)
> Yes, the nodes that were drained or down in slurm were rebooted to pick up
> then reverted change. We are using Xcat's stateless booting and the nodes
> pickup their slurm configuration on boot through a post boot script. For
> some changes that require a restart of slurmd, it is easier to make the
> change to the master file and then reboot the nodes to pick-up the new
> change.
That avoids any lingering cgroup configuration questions as a bonus.

> The cluster is currently stable with this setting reverted on most of the
> nodes.
Can we reduce this ticket to SEV3 since the cluster is currently stable?

Since setting ConstrainKmemSpace=no is a no-op, it is unclear how reverting the setting could have changed anything.

> there are still 3 nodes with then bad setting that have not been
> rebooted as they currently are running a job. I suspect that it was a
> combination of that setting and a users job that was hitting something to
> cause the mass outage.
Can you please provide dmesg and slurmd logs from these nodes along with the grep command requested in comment #1.

> The IPoIB MTU is the default of 2044 for the OPA IPoIB and was working okay
> before then change.
Which change exactly are you referring to?
 
> Syncookies does look to be enabled on a node with the old setting and is
> default for all the compute nodes. 
> [root@n2078 ~]# sysctl net.ipv4.tcp_syncookies
> net.ipv4.tcp_syncookies = 1
> [root@n2078 ~]# cat /proc/sys/net/ipv4/tcp_max_syn_backlog
> 2048
With syncookies enabled, the backlog number is basically ignored by the kernel. The soconnmax will be enforced by the kernel though.

> [root@n2078 ~]# cat /proc/sys/fs/file-max
> 26120039

Are you seeing any errors about too many file descriptors open?

Can you please check the limits of the slurmctld and slurmd (on compute) processes?
> $ pgrep slurmctld | xargs -i grep -H . /proc/{}/limits
> $ pgrep slurmd | xargs -i grep -H . /proc/{}/limits

> From an node with the setting still set:
> [root@n2078 ~]# ls  /sys/fs/cgroup/memory/
Calling ls is not helpful since the files are always the same, I need the contents of each file.

> The grep command did not return anything which is expected since
> ConstrainRamSpace and ConstrainSwapSpace are being set to "no".
It should have returned the contents of all the pseudo files. Can you please attach that output?
 
> One of the symptoms we saw during the outage is that some of the nodes
> randomly it seemed lost ping packets. After then change was reverted and
> enough nodes were rebooted then this issue also went away. 
Slurm has timeouts to catch when packets are timing out that we can increase. The ping output was fast enough that we shouldn't need to change that yet.

Comment 7 Nate Rini 2019-07-16 15:37:08 MDT

Adam,

Any updates?

--Nate

Comment 8 Adam Hough 2019-07-16 19:46:40 MDT

Hi, we updated to Slurm 19.05.0 last week then 19.05.1-2 this week so was a little busy with that process.  Yes, you can lower it to the lowest severity level as the cluster is stable.  This ticket was mainly to make the Slurm team aware of our previous issue with that setting.  I have no idea how to test it without changing the entire cluster back to the previous setting which we cannot do to a production system.  

I agree that explicitly setting "ConstrainKmemSpace=no" should have done absolutely nothing but that was not the observed case since the cluster stabilized after ~20% of the nodes were rebooted with that setting commented out of the cgroups.conf and the ping packet loss decreased to 0% again for all the nodes once enough of the nodes were rebooted to remove the setting.

By "unresponsive", I mean then nodes were unresponsive to SSH or serial over lan console and some nodes were either not ping-able on their OPA connection, ethernet connection, or both as well.  There were a handful of nodes that just lost their OPA connection which meant the nodes lost communications with the slurmctld daemon as in the slurm.conf we have NodeAddr=<NodeHostname>-opa. The loss of the OPA only state was rare.

Will work on the other information tomorrow.

Comment 9 Nate Rini 2019-07-16 19:54:28 MDT

(In reply to Adam Hough from comment #8)
> Yes, you can lower it to the lowest severity level as the cluster is stable.
Lowering severity per your response.

Comment 10 Nate Rini 2019-07-16 19:59:58 MDT

(In reply to Adam Hough from comment #8)
> This ticket was mainly to make the Slurm
> team aware of our previous issue with that setting.  I have no idea how to
> test it without changing the entire cluster back to the previous setting
> which we cannot do to a production system.  
I will try to attempt replicate this issue using a VM cluster instead. Can you please provide the output of the following:
> lsb_release -a
> uname -a

Replicating issues related to the omnipath drivers will likely not be possible using VMs.

> I agree that explicitly setting "ConstrainKmemSpace=no" should have done
> absolutely nothing but that was not the observed case since the cluster
> stabilized after ~20% of the nodes were rebooted with that setting commented
> out of the cgroups.conf and the ping packet loss decreased to 0% again for
> all the nodes once enough of the nodes were rebooted to remove the setting.
> 
> By "unresponsive", I mean then nodes were unresponsive to SSH or serial over
> lan console and some nodes were either not ping-able on their OPA
> connection, ethernet connection, or both as well.
If general non-Slurm communications were lost, I would guess that this is an OPA bug that is triggered by setting ConstrainKmemSpace. If you have a test cluster with OPA hardware that you can replicate this issue with, it would surely speed up the debug process but I also suspect that this is an issue for Intel.

> There were a handful of
> nodes that just lost their OPA connection which meant the nodes lost
> communications with the slurmctld daemon as in the slurm.conf we have
> NodeAddr=<NodeHostname>-opa. The loss of the OPA only state was rare.
If SSH isn't working, I wouldn't expect Slurm to work either. Were there any errors in the kernel logs (dmesg) during these periods?

Comment 11 Adam Hough 2019-07-31 16:44:15 MDT

I am closing this since we had another similar outage that was job related this time. So the previous time might have been job related as well but we never found the bad job combination.