Hi. Since Slurm version 17.11 some improvements have been taken towards providing a better feedback to users in out of memory conditions. Example interactive job: $ srun --mem=200 mem_eater slurmstepd-compute1: error: Exceeded step memory limit at some point. srun: error: compute1: task 0: Out Of Memory $ sacct -j 20019 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20019 mem_eater p1 acct1 2 OUT_OF_ME+ 0:125 $ scontrol show job 20019 | grep OUT JobState=OUT_OF_MEMORY Reason=OutOfMemory Dependency=(null) Example batch job: $ cat slurm-20020.out Killed slurmstepd-compute1: error: Exceeded step memory limit at some point. $ sacct -j 20020 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20020 wrap p1 acct1 2 OUT_OF_ME+ 0:125 20020.batch batch acct1 2 OUT_OF_ME+ 0:125 $ scontrol show job 20020 | grep OUT JobState=OUT_OF_MEMORY Reason=OutOfMemory Dependency=(null) Relevant commits: 3a4479ffa79aed2d8a5aaa5f0fb38a87e77b3d1b since slurm-17-11-0-0pre1 2e9b5f69ea2c412e1eac4949e54f9d0b68d792ee since slurm-17-11-0-0pre1 9ee03613787c9b7d9c45dfc50a6e3c786f1c424d since slurm-17-11-0-0pre1 d218bcb488135f73c8514c3f4f083684639e6ae8 since slurm-17-11-0-0pre1 3006137d327597abe78e6b42ce033ed8321c5f08 since slurm-17-11-0-0pre1 2a75b72df7681e24aa6e7b8825ab6387392b6002 since slurm-17-11-0-0pre1 818a09e802587e68abc6a5c06f0be2d4ecfe97e3 since slurm-17-02-0-1 (Introduction of JOB_OOM). Currently Slurm does not record 'dmesg' details anywhere. If you could elaborate on the details that the users are expecting to see and where when a job/step is oom-killed, we could study that and potentially proceed to reclassify this as a sev-5 enhancement request. Thanks. There is also additional information being logged to the node's slurmd.log, for example: [2017-12-14T14:51:42.257] [20021.0] task/cgroup: /slurm_compute1/uid_1000/job_20021: alloc=200MB mem.limit=200MB memsw.limit=200MB [2017-12-14T14:51:42.257] [20021.0] task/cgroup: /slurm_compute1/uid_1000/job_20021/step_0: alloc=200MB mem.limit=200MB memsw.limit=200MB [2017-12-14T14:51:42.332] [20021.0] task 0 (12214) exited. Killed by signal 9. [2017-12-14T14:51:42.332] [20021.0] error: Exceeded step memory limit at some point. But again, the details logged by the kernel to the system log are not currently captured nor recorded anywhere by Slurm. Our local Cray engineer reports the following: "There is still a mechanism to report this information back to workload managers. It is via the kernel job module info and the CONFIG_CRAY_ABORT_INFO option which should be enabled by default. Alps still uses this method and reports the same error information today. Alps also provides a library, the alpscomm library, that makes that info available. I'm told Slurm may also be able to make use of the job_get_abort_info interface directly. My understanding is that Slurm does not currently support this however. And that instead they recommend using the Slurm mem options you mentioned. It is unclear to me what it would take for Slurm to be able to support that." Hi, would appending the output of job_get_abort_info() message to the job's stderr be sufficient for your purposes? If that is the case we can give it a try. We guess though that the output of the function would not only include OOM related messages, and perhaps would require some filtering. That sounds like a good initial mechanism. Steven, we've been discussing this internally and we would like to remind that Slurm isn't ALPS, and right now it just works differently and you can't expect it to behave exactly as the previous scheduler. In my comments above I explained what's the OOM information currently provided by Slurm. We're going ahead and reclassify this a Sev-5 enhancement request. If you have a set of issues, CRAY-specific features and/or want Slurm to behave differently, we can discuss that further directly by e-mail under a sponsorship context. I think this is a "make Native-Slurm work like ALPS" request, as in bug 2873 or bug 4006, which also were reclassified as a Sev-5 requests. Please, let us know what do you think. Thank you. I understand how and why this has been reclassified as an enhancement and do not disagree. However, this is not a request to "make slurm be alps." The alps reference was meant to communicate two points, only: 1) that this feature is directly relevant to a user-generated use case and 2) that this appears to be possible with information available to a resource manager. It *is* a request that slurm provide a mechanism to surface the relevant information to a user job, rather than only to an administrative log. Sort of related to this could you ask Cray to get the patch to add the oom_kill variable to the memory cgroup memory.oom_control in their kernel? https://patchwork.kernel.org/patch/9737381/ It looks like it was added just this last May more /sys/fs/cgroup/memory/memory.oom_control oom_kill_disable 0 under_oom 0 oom_kill 0 Can you provide the related Cray bug number so we can link these together. Our Cray case number is 200798. I will add Danny's cgroup/kernel post from here to it as soon as they get my new account configured. The kernel change is available on kachina now: nid00047:~ # cat /sys/fs/cgroup/memory/memory.oom_control oom_kill_disable 0 under_oom 0 oom_kill 0 Awesome, thanks David. Slurm version 17.11.3 has additional information on an OOM condition in the following commit: https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb Each OOM killer event is logged to stderr in real time. In addition the node and task information is also logged to stderr. This new logic works on all systems with cgroups, not just Crays. Here is an example of what that looks like: $ srun -n4 -N2 --mem=50 mem_eater slurmstepd: error: Detected 1 oom-kill event(s) in step 52.0 cgroup. slurmstepd: error: Detected 1 oom-kill event(s) in step 52.0 cgroup. srun: error: nid00017: tasks 2-3: Out Of Memory After having worked with both this new Slurm logic (by Alejandro Sanchez) and Cray's job_get_abort_info() function, this seems to be the best solution. Please let me know if this solution is satisfactory. What is the availability of a 17.02 backport of the patch provided by https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb referenced in Comment 43? Created attachment 6290 [details] 17.02 local patch (In reply to S Senator from comment #44) > What is the availability of a 17.02 backport of the patch provided by > https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb referenced in > Comment 43? We've tested the attached standalone patch which applies cleanly on Slurm 17.02. It's a merge of 5 different commits. Please note that this patch will _NOT_ be included in any future release of 17.02, so you will need to manage this as a local patch. Please, try it and let us know whether you need anything else from this bug. Thank you. Closing bug based upon commits to Slurm version 17.11.3 and patch provide for version 17.02. Feel free to re-open if you encounter further issues. Created attachment 6377 [details]
test results-20180314
Attached some test results (test results-20180314) from some test runs. These tests run a memory exhaustion code with hugepages loaded, then without. Each set does a srun with no vm-overcommit specified (default); with --vm-overcommit-enable ; and then with --vm-overcommit=disable. The output captures the last 12 lines of the combined stdout and stderr. What we see now is that there are NO indications in ANY of the cases that an OOM situation occurred. This was tested on 17.02 with the provided patch. We are setting up 17.11.3 to test with it and will provide feedback once that is in place and tested. The new logic only gets used if the task/cgroup plugin is configured to manage memory. That means your cgroup.conf file must contain either ConstrainRamSpace=yes and/or ConstrainSwapSpace=yes Could you check if your cgroup.conf file contains at least one of those? If not, would it be possible/practical for you to add that? If not, we might still be able to monitor for OOM, but that would require somewhat different logic. In any case the TaskPlugins parameter in slurm.conf would need to include "task/cgroup". Sorry for the absence on this and other topics as we have been otherwise focused the past couple of weeks. We will hopefully be able to further test and assess in the next few days. I will also take the potential use of Constrain(Ram|Swap}Space to the larger environments team for discussion. We have a new baseline on which to test from Cray as well as the 17.11 version in combination. Once we have that combo available for testing, I will report our findings. (In reply to Joseph 'Joshi' Fullop from comment #54) > Sorry for the absence on this and other topics as we have been otherwise > focused the past couple of weeks. We will hopefully be able to further test > and assess in the next few days. > > I will also take the potential use of Constrain(Ram|Swap}Space to the larger > environments team for discussion. > > We have a new baseline on which to test from Cray as well as the 17.11 > version in combination. Once we have that combo available for testing, I > will report our findings. Do you have any update on this? Created attachment 6612 [details]
OOM fixd test results 20180412
We were finally able to test the proposed fix of 17.11.5 on Cray's UP05. We found that we are able to have OOM messages generated in most all scenarios. (Please see OOM fix test results 20180412, attached). I iterated over various MaxRAMPercent values and was able to get the OOM messages to generate correctly all the way up to 98% in most cases. That level of memory usage is completely OK. However, the oddball case is when --vm-overcommit=enable and Fortran is used WITH hugepages, the OOM stops being produced somewhere between 40 and 50% MaxRAMPercent. So far, the users have responded that the functionality is acceptable, with the oddball case as the exception. We expect that behavior is being caused by a different issue and we expect to discuss it at our joint meeting next week. Huge pages are not counted in the memory cgroup, so they won't contribute to the cgroup memory limit. Thanks for the updates. The cgroup hugetlb directory doesn't have the same oom_control mechanism as the memory cgroup. We'll need to investigate this more. I have opened a new bug for the development of Slurm logic to report OOM events when using hugepages and am closing this bug. Cray's UP07 software release will provide infrastructure that should satisfy this requirement. Closing this bug as relates to memory OOM (except for hugepages). Moe, I wanted to let you know that we have started seeing the OOM messages showing up now WITH hugepages. For whatever reason, we did not see them on our test machine gadget, but they do appear as desired on trinitite and trinity. This info may or may not preclude you from wanting to handle hugepages in a more appropriate manner, but at lease now it is not nearly as an important issue for us. (In reply to Joseph 'Joshi' Fullop from comment #69) > Moe, > > I wanted to let you know that we have started seeing the OOM messages > showing up now WITH hugepages. For whatever reason, we did not see them on > our test machine gadget, but they do appear as desired on trinitite and > trinity. > > This info may or may not preclude you from wanting to handle hugepages in a > more appropriate manner, but at lease now it is not nearly as an important > issue for us. Thank you for that information. I will try to determine what changed (Not something in Slurm, but presumably in Cray's kernel). (In reply to Moe Jette from comment #70) > (In reply to Joseph 'Joshi' Fullop from comment #69) > > Moe, > > > > I wanted to let you know that we have started seeing the OOM messages > > showing up now WITH hugepages. For whatever reason, we did not see them on > > our test machine gadget, but they do appear as desired on trinitite and > > trinity. > > > > This info may or may not preclude you from wanting to handle hugepages in a > > more appropriate manner, but at lease now it is not nearly as an important > > issue for us. > > Thank you for that information. I will try to determine what changed (Not > something in Slurm, but presumably in Cray's kernel). From David Gloe (Cray): The work to get hugepages in memory cgroups isn't done yet. I'm not sure why they would see the OOM messages when using hugepages on trinitite/trinity. Thanks, David |
Created attachment 5741 [details] sample job script to search logs & sources for relevant messages Our users have started to exercise the variety of nodes on trinity with varying memory limits, with and without vm.overcommit enabled. However, they are not seeing the messages that they expected with the previous scheduler. The attached job script example is meant to dynamically create a prolog and epilog to do that. We are continuing to refine this; we are interested in understanding slurm best practices to address this. The goal is that this output would be collected from the nodes which encountered a failure and to provide it back to the user without intervention by a privileged user. Improvements to this and slurm best practice guidance are appreciated.