Ticket 4520

Summary:	slurm log messages & OOM not presented to stdout & stderr
Product:	Slurm	Reporter:	S Senator <sts>
Component:	slurmstepd	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, brian.gilmer, bsantos, david.gloe, fullop, mej
Version:	17.02.9
Hardware:	Cray XC
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5241
Site:	LANL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.11.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	sample job script to search logs & sources for relevant messages 17.02 local patch test results-20180314 OOM fixd test results 20180412

Description S Senator 2017-12-13 17:28:47 MST

Created attachment 5741 [details]
sample job script to search logs & sources for relevant messages

Our users have started to exercise the variety of nodes on trinity with varying memory limits, with and without vm.overcommit enabled. However, they are not seeing the messages that they expected with the previous scheduler.

The attached job script example is meant to dynamically create a prolog and epilog to do that. We are continuing to refine this; we are interested in understanding slurm best practices to address this.

The goal is that this output would be collected from the nodes which encountered a failure and to provide it back to the user without intervention by a privileged user.

Improvements to this and slurm best practice guidance are appreciated.

Comment 1 Alejandro Sanchez 2017-12-14 06:43:15 MST

Hi. Since Slurm version 17.11 some improvements have been taken towards providing a better feedback to users in out of memory conditions.

Example interactive job:
$ srun --mem=200 mem_eater
slurmstepd-compute1: error: Exceeded step memory limit at some point.
srun: error: compute1: task 0: Out Of Memory
$ sacct -j 20019
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20019         mem_eater         p1      acct1          2 OUT_OF_ME+    0:125 
$ scontrol show job 20019 | grep OUT
   JobState=OUT_OF_MEMORY Reason=OutOfMemory Dependency=(null)

Example batch job:
$ cat slurm-20020.out
Killed
slurmstepd-compute1: error: Exceeded step memory limit at some point.
$ sacct -j 20020
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
20020              wrap         p1      acct1          2 OUT_OF_ME+    0:125 
20020.batch       batch                 acct1          2 OUT_OF_ME+    0:125 
$ scontrol show job 20020 | grep OUT
   JobState=OUT_OF_MEMORY Reason=OutOfMemory Dependency=(null)


Relevant commits:
3a4479ffa79aed2d8a5aaa5f0fb38a87e77b3d1b since slurm-17-11-0-0pre1
2e9b5f69ea2c412e1eac4949e54f9d0b68d792ee since slurm-17-11-0-0pre1
9ee03613787c9b7d9c45dfc50a6e3c786f1c424d since slurm-17-11-0-0pre1
d218bcb488135f73c8514c3f4f083684639e6ae8 since slurm-17-11-0-0pre1
3006137d327597abe78e6b42ce033ed8321c5f08 since slurm-17-11-0-0pre1
2a75b72df7681e24aa6e7b8825ab6387392b6002 since slurm-17-11-0-0pre1
818a09e802587e68abc6a5c06f0be2d4ecfe97e3 since slurm-17-02-0-1 (Introduction of JOB_OOM).

Currently Slurm does not record 'dmesg' details anywhere. If you could elaborate on the details that the users are expecting to see and where when a job/step is oom-killed, we could study that and potentially proceed to reclassify this as a sev-5 enhancement request. Thanks.

Comment 2 Alejandro Sanchez 2017-12-14 06:55:03 MST

There is also additional information being logged to the node's slurmd.log, for example:

[2017-12-14T14:51:42.257] [20021.0] task/cgroup: /slurm_compute1/uid_1000/job_20021: alloc=200MB mem.limit=200MB memsw.limit=200MB
[2017-12-14T14:51:42.257] [20021.0] task/cgroup: /slurm_compute1/uid_1000/job_20021/step_0: alloc=200MB mem.limit=200MB memsw.limit=200MB
[2017-12-14T14:51:42.332] [20021.0] task 0 (12214) exited. Killed by signal 9.
[2017-12-14T14:51:42.332] [20021.0] error: Exceeded step memory limit at some point.

But again, the details logged by the kernel to the system log are not currently captured nor recorded anywhere by Slurm.

Comment 3 S Senator 2017-12-18 09:24:21 MST

Our local Cray engineer reports the following:
 "There is still a mechanism to report this information back to workload managers. It is via the kernel job module info and the CONFIG_CRAY_ABORT_INFO option which should be enabled by default. Alps still uses this method and reports the same error information today. Alps also provides a library, the alpscomm library, that makes that info available. I'm told Slurm may also be able to make use of the job_get_abort_info interface directly.

 

My understanding is that Slurm does not currently support this however. And that instead they recommend using the Slurm mem options you mentioned. It is unclear to me what it would take for Slurm to be able to support that."

Comment 7 Alejandro Sanchez 2017-12-19 09:15:12 MST

Hi, would appending the output of job_get_abort_info() message to the job's stderr be sufficient for your purposes? If that is the case we can give it a try. We guess though that the output of the function would not only include OOM related messages, and perhaps would require some filtering.

Comment 8 S Senator 2017-12-19 09:42:45 MST

That sounds like a good initial mechanism.

Comment 12 Alejandro Sanchez 2017-12-22 03:58:31 MST

Steven, we've been discussing this internally and we would like to remind that Slurm isn't ALPS, and right now it just works differently and you can't expect it to behave exactly as the previous scheduler. In my comments above I explained what's the OOM information currently provided by Slurm. We're going ahead and reclassify this a Sev-5 enhancement request. If you have a set of issues, CRAY-specific features and/or want Slurm to behave differently, we can discuss that further directly by e-mail under a sponsorship context. 

I think this is a "make Native-Slurm work like ALPS" request, as in bug 2873 or bug 4006, which also were reclassified as a Sev-5 requests. Please, let us know what do you think. Thank you.

Comment 13 S Senator 2018-01-02 10:24:10 MST

I understand how and why this has been reclassified as an enhancement and do not disagree. However, this is not a request to "make slurm be alps." The alps reference was meant to communicate two points, only:
1) that this feature is directly relevant to a user-generated use case and
2) that this appears to be possible with information available to a resource manager.

It *is* a request that slurm provide a mechanism to surface the relevant information to a user job, rather than only to an administrative log.

Comment 16 Danny Auble 2018-01-09 10:31:24 MST

Sort of related to this could you ask Cray to get the patch to add the oom_kill variable to the memory cgroup memory.oom_control in their kernel?

https://patchwork.kernel.org/patch/9737381/

It looks like it was added just this last May

more /sys/fs/cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 0

Comment 17 Moe Jette 2018-01-09 11:06:23 MST

Can you provide the related Cray bug number so we can link these together.

Comment 19 Joseph 'Joshi' Fullop 2018-01-10 10:46:24 MST

Our Cray case number is 200798. I will add Danny's cgroup/kernel post from here to it as soon as they get my new account configured.

Comment 21 David Gloe 2018-02-21 09:10:18 MST

The kernel change is available on kachina now:

nid00047:~ # cat /sys/fs/cgroup/memory/memory.oom_control
oom_kill_disable 0
under_oom 0
oom_kill 0

Comment 22 Danny Auble 2018-02-21 09:11:23 MST

Awesome, thanks David.

Comment 43 Moe Jette 2018-02-23 14:50:22 MST

Slurm version 17.11.3 has additional information on an OOM condition in the following commit:
https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb

Each OOM killer event is logged to stderr in real time. In addition the node and task information is also logged to stderr. This new logic works on all systems with cgroups, not just Crays. Here is an example of what that looks like:
$ srun -n4 -N2 --mem=50 mem_eater
slurmstepd: error: Detected 1 oom-kill event(s) in step 52.0 cgroup.
slurmstepd: error: Detected 1 oom-kill event(s) in step 52.0 cgroup.
srun: error: nid00017: tasks 2-3: Out Of Memory

After having worked with both this new Slurm logic (by Alejandro Sanchez) and Cray's job_get_abort_info() function, this seems to be the best solution.

Please let me know if this solution is satisfactory.

Comment 44 S Senator 2018-03-02 10:24:14 MST

What is the availability of a 17.02 backport of the patch provided by https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb referenced in Comment 43?

Comment 49 Alejandro Sanchez 2018-03-05 09:40:44 MST

Created attachment 6290 [details]
17.02 local patch

(In reply to S Senator from comment #44)
> What is the availability of a 17.02 backport of the patch provided by
> https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb referenced in
> Comment 43?

We've tested the attached standalone patch which applies cleanly on Slurm 17.02. It's a merge of 5 different commits. Please note that this patch will _NOT_ be included in any future release of 17.02, so you will need to manage this as a local patch.

Please, try it and let us know whether you need anything else from this bug. Thank you.

Comment 50 Moe Jette 2018-03-14 09:42:20 MDT

Closing bug based upon commits to Slurm version 17.11.3 and patch provide for version 17.02. Feel free to re-open if you encounter further issues.

Comment 51 Joseph 'Joshi' Fullop 2018-03-14 11:43:42 MDT

Created attachment 6377 [details]
test results-20180314

Comment 52 Joseph 'Joshi' Fullop 2018-03-14 11:50:38 MDT

Attached some test results (test results-20180314) from some test runs.  These tests run a memory exhaustion code with hugepages loaded, then without.  Each set does a srun with no vm-overcommit specified (default); with --vm-overcommit-enable ; and then with --vm-overcommit=disable.  The output captures the last 12 lines of the combined stdout and stderr.

What we see now is that there are NO indications in ANY of the cases that an OOM situation occurred. 

This was tested on 17.02 with the provided patch.  We are setting up 17.11.3 to test with it and will provide feedback once that is in place and tested.

Comment 53 Moe Jette 2018-03-14 16:17:36 MDT

The new logic only gets used if the task/cgroup plugin is configured to manage memory. That means your cgroup.conf file must contain either
   ConstrainRamSpace=yes
and/or
   ConstrainSwapSpace=yes

Could you check if your cgroup.conf file contains at least one of those?

If not, would it be possible/practical for you to add that?

If not, we might still be able to monitor for OOM, but that would require somewhat different logic. In any case the TaskPlugins parameter in slurm.conf would need to include "task/cgroup".

Comment 54 Joseph 'Joshi' Fullop 2018-03-21 13:59:24 MDT

Sorry for the absence on this and other topics as we have been otherwise focused the past couple of weeks.  We will hopefully be able to further test and assess in the next few days.

I will also take the potential use of Constrain(Ram|Swap}Space to the larger environments team for discussion.

We have a new baseline on which to test from Cray as well as the 17.11 version in combination.  Once we have that combo available for testing, I will report our findings.

Comment 55 Moe Jette 2018-04-12 07:48:40 MDT

(In reply to Joseph 'Joshi' Fullop from comment #54)
> Sorry for the absence on this and other topics as we have been otherwise
> focused the past couple of weeks.  We will hopefully be able to further test
> and assess in the next few days.
> 
> I will also take the potential use of Constrain(Ram|Swap}Space to the larger
> environments team for discussion.
> 
> We have a new baseline on which to test from Cray as well as the 17.11
> version in combination.  Once we have that combo available for testing, I
> will report our findings.

Do you have any update on this?

Comment 56 Joseph 'Joshi' Fullop 2018-04-12 11:06:44 MDT

Created attachment 6612 [details]
OOM fixd test results 20180412

Comment 57 Joseph 'Joshi' Fullop 2018-04-12 11:19:47 MDT

We were finally able to test the proposed fix of 17.11.5 on Cray's UP05. We found that we are able to have OOM messages generated in most all scenarios. (Please see OOM fix test results 20180412, attached).  I iterated over various MaxRAMPercent values and was able to get the OOM messages to generate correctly all the way up to 98% in most cases.  That level of memory usage is completely OK.  However, the oddball case is when --vm-overcommit=enable and Fortran is used WITH hugepages, the OOM stops being produced somewhere between 40 and 50% MaxRAMPercent.

So far, the users have responded that the functionality is acceptable, with the oddball case as the exception.  We expect that behavior is being caused by a different issue and we expect to discuss it at our joint meeting next week.

Comment 58 David Gloe 2018-04-12 11:30:21 MDT

Huge pages are not counted in the memory cgroup, so they won't contribute to the cgroup memory limit.

Comment 60 Moe Jette 2018-04-12 11:59:25 MDT

Thanks for the updates.

The cgroup hugetlb directory doesn't have the same oom_control mechanism as the memory cgroup. We'll need to investigate this more.

Comment 67 Moe Jette 2018-05-31 17:44:18 MDT

I have opened a new bug for the development of Slurm logic to report OOM 
events when using hugepages and am closing this bug. Cray's UP07 software release will provide infrastructure that should satisfy this requirement.

Comment 68 Moe Jette 2018-05-31 17:44:47 MDT

Closing this bug as relates to memory OOM (except for hugepages).

Comment 69 Joseph 'Joshi' Fullop 2018-06-01 14:17:59 MDT

Moe,

I wanted to let you know that we have started seeing the OOM messages showing up now WITH hugepages.  For whatever reason, we did not see them on our test machine gadget, but they do appear as desired on trinitite and trinity.

This info may or may not preclude you from wanting to handle hugepages in a more appropriate manner, but at lease now it is not nearly as an important issue for us.

Comment 70 Moe Jette 2018-06-01 14:20:38 MDT

(In reply to Joseph 'Joshi' Fullop from comment #69)
> Moe,
> 
> I wanted to let you know that we have started seeing the OOM messages
> showing up now WITH hugepages.  For whatever reason, we did not see them on
> our test machine gadget, but they do appear as desired on trinitite and
> trinity.
> 
> This info may or may not preclude you from wanting to handle hugepages in a
> more appropriate manner, but at lease now it is not nearly as an important
> issue for us.

Thank you for that information. I will try to determine what changed (Not something in Slurm, but presumably in Cray's kernel).

Comment 71 Moe Jette 2018-06-01 14:52:32 MDT

(In reply to Moe Jette from comment #70)
> (In reply to Joseph 'Joshi' Fullop from comment #69)
> > Moe,
> > 
> > I wanted to let you know that we have started seeing the OOM messages
> > showing up now WITH hugepages.  For whatever reason, we did not see them on
> > our test machine gadget, but they do appear as desired on trinitite and
> > trinity.
> > 
> > This info may or may not preclude you from wanting to handle hugepages in a
> > more appropriate manner, but at lease now it is not nearly as an important
> > issue for us.
> 
> Thank you for that information. I will try to determine what changed (Not
> something in Slurm, but presumably in Cray's kernel).

From David Gloe (Cray): 

The work to get hugepages in memory cgroups isn't done yet.
I'm not sure why they would see the OOM messages when using hugepages on trinitite/trinity.

Thanks,
David