Ticket 9683

Summary:	cgroups/oom/failed jobs
Product:	Slurm	Reporter:	Matt Ezell <ezellma>
Component:	Other	Assignee:	Nate Rini <nate>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	england
Version:	20.02.0
Hardware:	Linux
OS:	Linux
Site:	ORNL-OLCF	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	logs Collected data from affected machine Slurmd.log with debug level 3, job #333 failures. complete log slurm_ticket_20200902.tgz

Description Matt Ezell 2020-08-27 10:53:53 MDT

Created attachment 15634 [details]
logs

A coworker at ORNL sent me the following report:

While testing a new machine we started to see some issues. Initially we saw a number of errors like:
 
andes1: /var/log/slurmd.log-20200815.gz:[2020-08-14T09:29:14.437] fatal: _handle_connection: pthread_create error Resource temporarily unavailable
 
In most of the cases slurmd would restart, or mark the node down.
 
slurmctld.log-20200820.gz:[2020-08-19T12:34:11.443] error: Nodes andes2 not responding, setting DOWN
 
The first action I took was to increase the number of files in a systemd dropin:
 
/etc/systemd/system/slurmd.service.d/override.conf
[Service]
LimitNOFILE=500000
 
This led to an oom message:
 
andes3: /var/log/slurmd.log-20200820.gz:[2020-08-19T15:02:12.792] [226.extern] _oom_event_monitor: oom-kill event count: 1
 
So I disabled the cgroup memory setting, our default setting was (ConstrainRAMSpace=yes):
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=no
TaskAffinity=no
AllowedRAMSpace=95
 
This cause in some cases our health check to not get enough memory:
 
andes3: /var/log/slurmd.log-20200820.gz:[2020-08-19T13:36:48.963] error: executing health_check: fork: Cannot allocate memory
andes3: /var/log/slurmd.log-20200820.gz:[2020-08-19T13:36:48.969] error: /opt/bin/goss-health-check.sh: exited with status 0xffffffff
 
Which would cause the job to fail, I did not look into this much I just disabled the health check to see if that helped.  And it did jobs now complete.
 
We are still seeing a lot of this type of error:
 
andes5: /var/log/slurmd.log-20200820.gz:[2020-08-19T17:03:55.224] [238.2] error: slurm_receive_msgs: Socket timed out on send/recv operation
andes5: /var/log/slurmd.log-20200820.gz:[2020-08-19T17:04:26.264] [238.2] error: slurm_receive_msgs: Socket timed out on send/recv operation
 
But we are also dealing with some ib fabric issues but we see these messages now even if the job completes which may or may not be related.
I am not really sure, on other piece of information here is that we don't see these send/recv errors till around 5600 mpi tasks.
 
We think we can reproduce the job failure/oom issue with OSU MPI Benchmarks by reverting our configs
 
I have attached the contents of `/etc/slurm` a sample of logs for a job and the slurmctld.log for that day.

Comment 1 Jason Booth 2020-08-27 12:07:24 MDT

Matt can you check to see if you have the following set in the slurmd service?

>Delegate=yes



From systemd documentation:

https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

Delegation

Service and scope units know a special Delegate boolean property. If set, then the processes inside the scope or service may control their own control group subtree (that means: create subcgroups directly via /sys/fs/cgroup). The effect of the property is that:

    All controllers systemd knows are enabled for that scope/service, if the scope/service runs privileged code.
    Access to the cgroup directory of the scope/service is permitted, and files/and directories are updated to get write access for the user specified in User= if the scope/unit runs unprivileged. Note that in this case access to any controllers is not available.
    systemd will refrain from moving processes across the "delegation" boundary.

Generally, the Delegate property is only useful for services that need to manage their own cgroup subtrees, such as container managers. After creating a unit with this property set, they should use /proc/$PID/cgroup to figure out the cgroup subtree path they may manage (the one from the name=systemd hierarchy!). Managers should refrain from making any changes to the cgroup tree outside of the subtrees for units they created with the Delegate flag turned on.

Comment 2 Clay England 2020-08-27 14:21:41 MDT

Yes that is set for slurmd.service:

 /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/slurmd.service.d/override.conf

Comment 3 Nate Rini 2020-08-27 19:35:22 MDT

(In reply to Clay England from comment #2)
>  /usr/lib/systemd/system/slurmd.service
> [Service]
> Type=forking

This is unrelated to this bug, but please update the systemd unit file to match: https://github.com/SchedMD/slurm/blob/master/etc/slurmd.service.in (in src/etc post compile). There are fewer surprise issue with systemd with the new config.

Comment 4 Nate Rini 2020-08-27 19:48:19 MDT

(In reply to Matt Ezell from comment #0)
> We think we can reproduce the job failure/oom issue with OSU MPI Benchmarks
> by reverting our configs

Is it possible to get debug3 logs from slurmd during one of these tests?

(In reply to Matt Ezell from comment #0)
> andes1: /var/log/slurmd.log-20200815.gz:[2020-08-14T09:29:14.437] fatal:
> _handle_connection: pthread_create error Resource temporarily unavailable

Were there any OOM events in dmesg or other low memory errors at this time in the system logs?

Can we get the following called on one of these nodes:
> free -h
> cat /proc/meminfo 
> cat /proc/swaps
> cat /proc/slabinfo
> sysctl -a |grep -i -e mem -e oom
> numactl -s -H
> uname -a
> slurmd -V
> pgrep slurmd | xargs -i grep -nH .  /proc/{}/cgroup

Comment 5 Clay England 2020-08-28 07:41:19 MDT

Created attachment 15641 [details]
Collected data from affected machine

Comment 6 Clay England 2020-08-28 07:45:33 MDT

I have a puppet branch ready to test with debug3 set and rolling back all the changes I mentioned except for NOFILE increase.  

There were no system memory or oom messages in dmesg which I thought was odd but figured it was something to do with the cgroups settings.

I have attached the data collected from one of the nodes in the clusters.  

I am waiting on some other testing to be completed before gathering the debug info, I will post it here as soon as I can.

Clay

Comment 7 Clay England 2020-08-28 08:57:40 MDT

Created attachment 15642 [details]
Slurmd.log with debug level 3, job #333 failures.

This is the debug level3 log file that shows the oom behavior but not the pthread failures since we saw those when the NOFILE was lower.  

Also from the coworker running the jobs:
srun -n 6400 -N 200 ./mpi/startup/osu_init

btw at 3200 ranks it does work (i ran that before this 6400 one)

So this was just osu_init from the OSU MPI Benchmark suite.

Comment 8 Nate Rini 2020-08-28 09:30:35 MDT

(In reply to Clay England from comment #5)
> Created attachment 15641 [details]
> Collected data from affected machine

> ***************
> *  slurmd -V  *
> ***************
> 
> slurm 20.02.0
Please upgrade this machine to 20.02.4 ASAP. The x.0 releases are not suggested for production machines. There was also a fix a fix for cgroups race condition in .3 that may be related.

Comment 9 Nate Rini 2020-08-28 10:03:15 MDT

(In reply to Clay England from comment #6)
> There were no system memory or oom messages in dmesg which I thought was odd
> but figured it was something to do with the cgroups settings.

Kernel doesn't always report these issues if the OOM is caught by cgroup events fd.

(In reply to Clay England from comment #7)
> Created attachment 15642 [details]
> Slurmd.log with debug level 3, job #333 failures.
The log appears to be truncated as there are no logs of the job ending. Can you please verify?

> $ sha1sum slurmd.log 
> 1ae62fee220235fd48dce178b8e80a58db6880cd  slurmd.log

.

> [2020-08-28T10:16:15.477] [333.0] debug2: _file_read_content: unable to open '/sys/fs/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
> [2020-08-28T10:16:15.477] [333.0] debug2: xcgroup_get_param: unable to get parameter 'cpus' for '/sys/fs/cgroup/cpuset/slurm'
Do you see _file_read_content error with other nodes?

Can you please call this on the node:
> ls -la /sys/fs/cgroup/cpuset/slurm/

Comment 10 Clay England 2020-09-01 11:22:01 MDT

Created attachment 15679 [details]
complete log

I think the log was not finished writing and that is why it was incomplete.  I have attached the full log from that day.

Here is the ls output you requested:
[root@andes1 ~]# ls -la /sys/fs/cgroup/cpuset/slurm/
total 0
drwxr-xr-x 2 root root 0 Sep  1 12:40 .
dr-xr-xr-x 3 root root 0 Sep  1 09:57 ..
-rw-r--r-- 1 root root 0 Sep  1 13:04 cgroup.clone_children
-rw-r--r-- 1 root root 0 Sep  1 13:04 cgroup.procs
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.cpu_exclusive
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.cpus
-r--r--r-- 1 root root 0 Sep  1 13:04 cpuset.effective_cpus
-r--r--r-- 1 root root 0 Sep  1 13:04 cpuset.effective_mems
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.mem_exclusive
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.mem_hardwall
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.memory_migrate
-r--r--r-- 1 root root 0 Sep  1 13:04 cpuset.memory_pressure
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.memory_spread_page
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.memory_spread_slab
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.mems
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.sched_load_balance
-rw-r--r-- 1 root root 0 Sep  1 13:04 cpuset.sched_relax_domain_level
-rw-r--r-- 1 root root 0 Sep  1 13:04 notify_on_release
-rw-r--r-- 1 root root 0 Sep  1 13:04 tasks


I have upgraded the computes but not the server to 20.02.4 and saw the some oom behavior.  I am trying to get all the nodes upgraded so we have a consistent cluster.  Thanks for your patience.

Comment 11 Nate Rini 2020-09-01 17:00:50 MDT

(In reply to Clay England from comment #10)
> I have upgraded the computes but not the server to 20.02.4 and saw the some
> oom behavior.

In general we suggest order of upgrade: slurmdbd, slurmctld and then slurmd. Since this is only a patch level upgrade, it should be fine but please consider this in the future.

> Thanks for your patience.
We happily work on your schedule with the tickets.

Comment 12 Nate Rini 2020-09-01 17:11:20 MDT

Please call the following on one of the trouble nodes:
> pgrep slurmd | xargs -i grep . -nH /proc/{}/limits /proc/{}/cgroup
> grep -nHR . /sys/fs/cgroup/cpuset/slurm/

Comment 13 Clay England 2020-09-03 11:55:03 MDT

Created attachment 15732 [details]
slurm_ticket_20200902.tgz

I have attached the command requested in the attached tarball.

Current state of the cluster:

All pieces have been upgraded to Slurm 20.02.4.
And currently the cgroups config is set like below:

cat /etc/slurm/cgroup.conf
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=no
TaskAffinity=no
AllowedRAMSpace=95

So we are not seeing the oom messages/dying jobs/slurmds.  I am still very concerned about this but other are running some tests of the filesystem and need the jobs to complete.
We can flip the ConstrainRAMSpace to yes after they are finied.

I have also attached some debug3 output from an osu_init run one with 5000 tasks, with no send/recv errors, and one with 6000 tasks that does have the send/recv errors?
srun: error: slurm_msg_sendto: address:port=172.30.18.180:6818 msg_type=5029: Socket timed out on send/recv operation

Any ideas about what this might be?  Also when I put the cgroups config back anything else you would like me to try.

Clay


From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, September 1, 2020 at 7:11 PM
To: John England <england@ornl.gov>
Subject: [EXTERNAL] [Bug 9683] cgroups/oom/failed jobs

Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=9683#c12> on bug 9683<https://bugs.schedmd.com/show_bug.cgi?id=9683> from Nate Rini<mailto:nate@schedmd.com>

Please call the following on one of the trouble nodes:

> pgrep slurmd | xargs -i grep . -nH /proc/{}/limits /proc/{}/cgroup

> grep -nHR . /sys/fs/cgroup/cpuset/slurm/

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 14 Nate Rini 2020-09-03 12:54:51 MDT

(In reply to Clay England from comment #13)
> So we are not seeing the oom messages/dying jobs/slurmds.  I am still very
> concerned about this but other are running some tests of the filesystem and
> need the jobs to complete.
> We can flip the ConstrainRAMSpace to yes after they are finied.
Having logs from a failure would be helpful.
 
Once memory constraints are activated again, please call this during one of test jobs:
> grep -nHR . /sys/fs/cgroup/memory/slurm/

.
 
> I have also attached some debug3 output from an osu_init run one with 5000
> tasks, with no send/recv errors, and one with 6000 tasks that does have the
> send/recv errors?
> srun: error: slurm_msg_sendto: address:port=172.30.18.180:6818
> msg_type=5029: Socket timed out on send/recv operation

The OSU have these log entries:
> OSU_INIT_391.log:slurmstepd: debug:  Note large processing time from _fork_all_tasks: usec=1087178 began=13:28:44.082
> OSU_INIT_391.log:slurmstepd: Warning: Note very large processing time from _fork_all_tasks: usec=3088324 began=13:28:44.084
This show that fork()/execve() calls are slow (3 & 1s each). This is usually caused by a slow filesystem. One way to verify this is to call the srun parts of the jobs multiple times or jobs multiple times on the same node to see if the dentry cache speeds it up.

We also see these errors:
>      68 error: slurm_msg_sendto: address:port=............:.... msg_type=....: Socket timed out on send/recv operation
>      80 error: slurm_msg_sendto: address:port=.............:.... msg_type=....: Socket timed out on send/recv operation
This error means that the REQUEST_FORWARD_DATA RPC is way too slow. This RPC is used by X11 forwarding and PMIX plugin. We can increase this time out by increase MessageTimeout in slurm.conf, the default is 10s. I suggest MessageTimeout=120 to see if we can atleast get this error to stop. This looks a network issue which packets getting lost and/or TCP resend is too slow.
 
> Any ideas about what this might be?
Looks like network issue (causing a filesystem issue). Could be that the system is just running out of free pages while caching writes for the filesystem. A good way to make sure this isn't happening is by adding this command as a job epilog (Epilog in slurm.conf) and job prolog (Prolog) as root:
> sync; echo 3 > /proc/sys/vm/drop_caches

> Also when I put the cgroups config back
> anything else you would like me to try.
The cpuset.txt confirms suggests this is not bug#9244 where no memory is given to a cgroup.

Also set "PropagateResourceLimits=NONE" during testing to make sure we aren't getting hit by some unexpected ulimit from the job submission shell.

Comment 15 Nate Rini 2020-09-09 15:51:07 MDT

Clay,

Any updates per the suggestions in comment #14?

Thanks,
--Nate

Comment 16 Clay England 2020-09-09 15:52:58 MDT

Sorry I have been concentrating on working out issues on the IB side so we can have a good known place to work for.
We are currently tuning the ipoib but I will try and get a run with more debugging on the cgroups issue.

Clay

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, September 9, 2020 at 5:51 PM
To: John England <england@ornl.gov>
Subject: [EXTERNAL] [Bug 9683] cgroups/oom/failed jobs

Comment # 15<https://bugs.schedmd.com/show_bug.cgi?id=9683#c15> on bug 9683<https://bugs.schedmd.com/show_bug.cgi?id=9683> from Nate Rini<mailto:nate@schedmd.com>

Clay,

Any updates per the suggestions in comment #14<show_bug.cgi?id=9683#c14>?

Thanks,

--Nate

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 17 Clay England 2020-09-15 08:12:37 MDT

We have made good progress on getting the ipoib tuned to be usable.  I am currently moving that tuning to the entire cluster an hope to have some more data for you in the next couple of days.

Clay

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, September 9, 2020 at 5:51 PM
To: John England <england@ornl.gov>
Subject: [EXTERNAL] [Bug 9683] cgroups/oom/failed jobs

Comment # 15<https://bugs.schedmd.com/show_bug.cgi?id=9683#c15> on bug 9683<https://bugs.schedmd.com/show_bug.cgi?id=9683> from Nate Rini<mailto:nate@schedmd.com>

Clay,

Any updates per the suggestions in comment #14<show_bug.cgi?id=9683#c14>?

Thanks,

--Nate

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 19 Clay England 2020-09-24 08:50:50 MDT

Just to give another update, things are still in an odd state with the under lying IB network.  Native IB seems to be the expected speed but ipoib was really slow 20 to 30 Gb/s not sure if this is causing the send/recv timeouts we were seeing.  Last week we were able to get the ipoib up to over 100 Gb/s but this made things unstable.  In simple tests we still saw the send/recv errors but that seemed to be a misconfiguration of the MPI build.  We are working hard to get the network working as we expect so we can address the slurm timeouts (if they still exist) and the cgroup oom issue.  Thanks for your time.

Clay

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Wednesday, September 9, 2020 at 5:51 PM
To: John England <england@ornl.gov>
Subject: [EXTERNAL] [Bug 9683] cgroups/oom/failed jobs

Comment # 15<https://bugs.schedmd.com/show_bug.cgi?id=9683#c15> on bug 9683<https://bugs.schedmd.com/show_bug.cgi?id=9683> from Nate Rini<mailto:nate@schedmd.com>

Clay,

Any updates per the suggestions in comment #14<show_bug.cgi?id=9683#c14>?

Thanks,

--Nate

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

Comment 20 Nate Rini 2020-09-24 09:07:51 MDT

(In reply to Clay England from comment #19)
> Just to give another update, things are still in an odd state with the under
> lying IB network.  Native IB seems to be the expected speed but ipoib was
> really slow 20 to 30 Gb/s not sure if this is causing the send/recv timeouts
> we were seeing.  Last week we were able to get the ipoib up to over 100 Gb/s
> but this made things unstable.  In simple tests we still saw the send/recv
> errors but that seemed to be a misconfiguration of the MPI build.  We are
> working hard to get the network working as we expect so we can address the
> slurm timeouts (if they still exist) and the cgroup oom issue.  Thanks for
> your time.

Thanks for the update.

I'll mark this timeout and it will reopen automatically on your next response.