9760 – Endless logging of "_oom_event_monitor: oom-kill event count"

Ticket 9760 - Endless logging of "_oom_event_monitor: oom-kill event count"

Summary: Endless logging of "_oom_event_monitor: oom-kill event count"

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	20.02.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim McMullan
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-09-06 23:50 MDT by Koji Tanaka
Modified:	2021-02-03 07:18 MST (History)
CC List:	2 users (show)

See Also:	9202 9737 10255
Site:	OIST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Koji Tanaka 2020-09-06 23:50:47 MDT

Hello,

It looks like our Slurm started having some problem and generating the following messages after the version update to 20.02.4. It happens sometimes on some nodes when a job tries to use more memory than allocated.

[2020-09-03T15:37:32.749] [1182781.batch] _oom_event_monitor: oom-kill event count: 1182778
[2020-09-03T15:47:18.847] [1182781.batch] _oom_event_monitor: oom-kill event count: 11182778
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827780
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827781
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827782
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827783
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827784
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827785
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827786
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827787
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827788
[2020-09-03T15:47:57.442] [1182781.batch] _oom_event_monitor: oom-kill event count: 11827789
[2020-09-03T15:57:21.792] [1182781.batch] _oom_event_monitor: oom-kill event count: 21182778

The count would go up endlessly..., and an interesting thing is that the job 1182781 in the log has already been cancelled by the user because he found similar oom killer messages in his log file. I confirmed that the job and the process have already disappeared from the compute node, but the Slurmd keeps trying to kill it(?) for some reason. This issue stays there until the disk partition (where slurmd.log is located) becomes full, or until the slurmd is restarted.

Thanks,
Koji

Comment 2 Tim McMullan 2020-09-08 13:22:30 MDT

Hi Koji!

Do you happen to have part of the slurmd log from the job getting canceled?  If this is happening on some specific nodes consistently, I'd be curious if you can get the logs from around the job stopping with loglevel=debug or higher.

Would you also be able to attach your slurm.conf and cgroup.conf files?

Thanks!
--Tim

Comment 3 Koji Tanaka 2020-09-14 23:21:20 MDT

Hi Tim,

This problem happens to random nodes with a particular user (who runs ChainerRL jobs) for now. 

Here're snippets from slurm.conf which I think are related to this issue:

##### start #####
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
TaskPluginParam=sched

JobAcctGatherType=jobacct_gather/linux
#####  end  #####

And, here's our cgroup.conf:

##### START #####
ConstrainDevices=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
#####  END  #####

There might be some kind of bug or incompatibility somewhere between jobacct_gather/linux and task/cgroup?

I kind of hegitate to put the entire slurm.conf here, so if you'd like to check some other parameters on slurm.conf, please let me know.

Bests,
Koji

Comment 4 Tim McMullan 2020-09-23 10:44:14 MDT

Hey Koji,

Sorry about the delayed reply on this!  Would you also be able to provide the "PrologFlags" option (if you have it) from your slurm.conf?

I've been trying to replicate the issue and have been having some difficulty, but realized that option might be part of it.  It may also help indicate if its related to the other bugs or not.

Thanks!
--Tim

Comment 5 Koji Tanaka 2020-09-28 19:59:46 MDT

Hi Tim,

Yes, we have the prologFlags on the slurm.conf, which is this:

PrologFlags             = Alloc,Contain,X11

I hope this would help you investigate it, and please let me know if you want to check some more things.

Bests,
Koji

Comment 6 Tim McMullan 2020-10-01 07:54:41 MDT

Hi Koji,

Thank you for the additional information!

I've been looking through this and I think there are a couple possibilities, but the common theme is the oom monitoring thread not exiting.

Would you be able to run with "SlurmdDebug=debug" and the next time this happens upload the slurmd log for an impacted node?  It's unclear why the thread isn't stopping and the additional logging should help determine if we get to the point where we try to stop the monitoring thread.

Can you also double check that all the related slurmstepd processes have stopped?  In the output of "ps" you may see them in any/all of the following forms:
slurmstepd: [$JOBID.extern]
slurmstepd: [$JOBID.batch]
slurmstepd: [$JOBID.$STEP]

Thanks,
--Tim

Comment 7 Koji Tanaka 2020-10-06 00:14:04 MDT

Thanks, Tim, for your suggestions.

The issue happens on random nodes, and we have many nodes on the cluster. So I have to find a way to reproduce it. Once it becomes reproduceable, I'll reserve some nodes and do it with SlurmdDebug=debug.

Also, here's an update from me: The problem happened with certain kinds of jobs on random nodes, and I had been communicating with the users, and they eventually found a bug on the application they're using. Then, we stopped seeing the issue.

I'll change the "Importance" as "4 - Minor Issue" for now.

Thanks again,
Koji

Comment 8 Tim McMullan 2020-10-06 10:03:39 MDT

Thank you for the update!  I'm glad fixing a bug in the application seems to have prevented the issue from surfacing.

Something else came up while chatting about this internally - are you spawning slurmd with systemd?  and if so, is "Delegate=yes" in the slurmd unit file?  If you are using systemd, but without "Delegate=yes" it can result in some unexpected behavior in slurm, and possibly something like this.

For the nodes that experienced issues, were there multiple steps running at the same time from the same job?

Thanks!
--Tim

Comment 9 Koji Tanaka 2020-10-06 23:46:05 MDT

Hi Tim,

We use systemd, and the Delegate=yes is in place.

Another update: I could produce a similar problem by stress-ng <https://github.com/ColinIanKing/stress-ng> by srun.

Here's what I do to try to use 12G while I book 4G:

$ srun --mem=4G --pty bash
$ stress-ng -m 1 --vm-bytes 12G

After a little while, slurmstepd starts to try to kill it, but cannot. Here're snippets of the slurmd.log(I restarted slurmd with SlurmdDebug=debug).

[2020-10-07T14:18:41.576] _run_prolog: prolog with lock for job 2263372 ran for 0 seconds
[2020-10-07T14:18:41.617] [2263372.extern] task/cgroup: /slurm/uid_2664/job_2263372: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-10-07T14:18:41.618] [2263372.extern] task/cgroup: /slurm/uid_2664/job_2263372/step_extern: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-10-07T14:18:42.060] launch task 2263372.0 request from UID:2664 GID:1276 HOST:10.145.10.24 PORT:38583
[2020-10-07T14:18:42.060] lllp_distribution jobid [2263372] implicit auto binding: cores,one_thread, dist 1
[2020-10-07T14:18:42.061] _lllp_generate_cpu_bind jobid [2263372]: mask_cpu,one_thread, 0x00000000000000000000000000000001
[2020-10-07T14:18:42.081] [2263372.0] task/cgroup: /slurm/uid_2664/job_2263372: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-10-07T14:18:42.081] [2263372.0] task/cgroup: /slurm/uid_2664/job_2263372/step_0: alloc=4096MB mem.limit=4096MB memsw.limit=unlimited
[2020-10-07T14:18:42.121] [2263372.0] in _window_manager
[2020-10-07T14:18:42.124] [2263372.0] task_p_pre_launch: Using sched_affinity for tasks
[2020-10-07T14:19:44.007] [2263372.0] _oom_event_monitor: oom-kill event count: 1
[2020-10-07T14:20:07.353] [2263372.0] _oom_event_monitor: oom-kill event count: 2
[2020-10-07T14:20:30.727] [2263372.0] _oom_event_monitor: oom-kill event count: 3
 : 
 : 
2020-10-07T14:35:18.308] [2263372.0] _oom_event_monitor: oom-kill event count: 41
[2020-10-07T14:35:42.057] [2263372.0] _oom_event_monitor: oom-kill event count: 42
[2020-10-07T14:36:05.017] [2263372.0] Step 2263372.0 hit memory limit at least once during execution. This may or may not result in some failure.
[2020-10-07T14:36:05.018] [2263372.0] error: Detected 42 oom-kill event(s) in step 2263372.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-10-07T14:36:05.106] [2263372.0] done with job
[2020-10-07T14:36:05.107] [2263372.extern] done with job

If I exit from the interactive srun, the count will finish, which means it's not endless. However, this is still a major problem because the 4G limit(--mem=4G) in the example is apprarently not working...

I'll set the Importance as "2 - High Impact".

Bests,
Koji

Comment 10 Koji Tanaka 2020-10-07 00:42:08 MDT

Tim,

Sorry, my assumption was wrong. I ran stress-ng while running "watch free" on another session, and I confirmed that cgroup's memory limit is working okay. stress-ng cannot use more than 4G in the RAM(while it's trying to use swap). 

It's just slurmstepd is unable to kill the job. This issue doesn't have impact to the other jobs on the same compute. The Importance should go down to "4". Sorry...

Then, I found the follows:

$ cat /sys/fs/cgroup/memory/slurm/memory.oom_control 
oom_kill_disable 0
under_oom 0
oom_kill 0

Is it just that I have to set oom_kill as 1 in some way?

Bests,
Koji

Comment 11 Tim McMullan 2020-10-07 06:26:58 MDT

Hi Koji,

No worries, that's certainly a strange problem!  

Slurm actually relies on the system to kill the process that is trying to exceed the cgroup limit, is there anything in dmesg about it attempting to kill the process?

It is a little counter-intuitive, but in memory.oom_control, oom_kill = 0 is actually saying that oom killing is enabled in the cgroup.

Also just to be sure that slurm seems to be setting the steps up correctly, can you check a job with a memory limit (eg srun --mem=1G sleep 300) and check "memory.limit_in_bytes" "memory.soft_limit_in_bytes" and "memory.oom_control" in the steps cgroup?

Thanks!
--Tim

Comment 12 Tim McMullan 2020-10-26 10:13:08 MDT

Hi Koji,

I've been digging through this and in both my normal testing and replicating your results with stress-ng, I have found only one real situation this is happening with.  This seems to only happen when a parent process forks a child that allocates a bunch of memory, and the parent re-tries that fork or very occasionally a fork() bomb.

I was able to confirm that when the worker thread for stress-ng gets killed, another one is created so we continue to see a process in the group that is getting killed.  Since stress-ng eventually terminates nicely it will complete properly (eventually) though with a lot of possible events logged.

I've only once gotten a fork() bomb to get a node into a state where I canceled the job but it continued to log for a bit, but that also resulted in the node being drained and there were still parts of the program executing.

I'm currently trying to determine if there is something actually wrong in slurm or if the oom-killer just isn't handling this well and causing some chaos (which it is wont to do sometimes...).

Just wanted to give you an update on where we are with this!
Thanks,
--Tim

Comment 13 Koji Tanaka 2020-10-26 20:11:48 MDT

Hi Tim,

Sorry for the long silence on my side. And thanks a lot for the update!

If you want me to try/check something at our cluster, please let me know.

Thanks again,
Koji

Comment 19 Tim McMullan 2020-11-20 08:10:37 MST

Hi Koji,

Are you able to share what application you were using when this was a problem?  Or was this an application built internally?

Thanks!
--Tim

Comment 21 Koji Tanaka 2020-11-24 00:04:37 MST

Hi Tim,

The jobs were running ChainerRL with MuJoCo. And also, stress-ng is something that can produce a similar problem as I described.

Thanks,
Koji

Comment 22 Tim McMullan 2020-11-24 06:22:12 MST

(In reply to Koji Tanaka from comment #21)
> Hi Tim,
> 
> The jobs were running ChainerRL with MuJoCo. And also, stress-ng is
> something that can produce a similar problem as I described.
> 
> Thanks,
> Koji

Thanks Koji!  I've mostly been using stress-ng as the test program as well.  Do you know what sort of bug was found and fixed that seemed to stop this?

I'm largely trying to identify if it is indeed one behavior pattern that we are looking at here.

Thanks!
--Tim

Comment 23 Koji Tanaka 2020-11-24 18:58:12 MST

Hi Tim,

I didn't hear the details, but the bug in the user's program was about a memory leak, and Slurm cannot terminate the job when it happens. And also, Chainer seems to fork into multiple processes. So if you solve it for stress-ng's behavior pattern, I think it will solve the problem.

Thanks a lot,
Koji

Comment 24 Tim McMullan 2020-12-23 08:04:33 MST

Hi Koji,

I've done a lot of looking/talking about this, and I don't think that the behavior we see with stress-ng is a bug.  The logging of many OOM errors should be an indication that there is something strange going on with the job, but not necessarily that we should step in and kill it.

In a couple other bugs, there are some fixes going in progress with how these get reported, and the possibility of them being reported against the wrong job (which may explain the other part of the issue you saw).  These patches are largely landing in 20.11.

Thanks!
--Tim

Comment 25 Tim McMullan 2021-02-03 07:18:43 MST

Since we don't believe the behavior from stress is a bug, the possible reporting bugs are being fixed, and we can't reproduce the node getting stuck I'm going to close this out for now.

Please let us know if you have any other issues!
Thanks,
--Tim