Summary: | Exceeded step memory limit at some point | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jenny Williams <jennyw> |
Component: | Configuration | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | david.matthews, felip.moll, kaizaad, kilian, ryan_cox, tim, twhitehead |
Version: | 17.02.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | University of North Carolina at Chapel Hill | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 17.11.3 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Ticket Depends on: | |||
Ticket Blocks: | 4625 | ||
Attachments: |
scontrol show config |sort and cgroups.conf
sacct listing for bwa example sacct listing for phylogen_tree example Revert to pre-17.02 behavior Page cache growth |
Description
Jenny Williams
2017-05-17 16:59:16 MDT
There are two different counts that can lead to that, and I'm not sure which one you're seeing. Either memory.memsw.failcnt and/or memory.failcnt being non-zero will lead to this warning. While I can give you a local patch to surpress that warning, I don't think it's spurious and wouldn't recommend disabling this throughout your system. Although it does sound like the application manages to recover from bouncing off one of those two limit values. I'd be really curious what causes that momentary spike in VDATA. Are there any details about the application you can share that may shed some light? Created attachment 4594 [details]
sacct listing for bwa example
Created attachment 4595 [details]
sacct listing for phylogen_tree example
Comment on attachment 4594 [details]
sacct listing for bwa example
48G runs see "Memory exceeded at some point ", but 120G jobs do not
Comment on attachment 4595 [details]
sacct listing for phylogen_tree example
uses almost no RSS, but required 220Gn to run without "Memory exceeded error"
What is consistent amongst the dozen or so users reporting the error is a - it is generated when using NFS but not when using GPFS and b - the io is either for very large files or for very many small files. Typically the field is bioinformatics. While we could test against GPFS, that is not a feasible solution in general given our setup. Created attachment 4629 [details] Revert to pre-17.02 behavior I'm attaching a trivial patch which should prevent this while I work to understand the underlying cause. It does appear that something in the Linux I/O model is counting some of the page cache against the job; one other customer is reporting this problem as well (bug 3831), and I'm working on reproducing this, and seeing if there's some other cgroup stat we could switch to. We can reliably reproduce the issue if that helps at all. Thank you for the update and patch Jenny *** Ticket 3831 has been marked as a duplicate of this ticket. *** This is still unfortunately in progress from my end. We're looking into a number of fixes to the cgroup code before 17.11 is released, although it's unclear to me if there's an alternative way of gathering the memory usage data that would avoid this issue. In the meantime, that patch should continue to work throughout the 17.02 release. I do not intend on disabling that diagnostic message at this time. - Tim Revisiting this based on some further discussion yesterday. Right now, disabling the cgroup memory enforcement, and falling back to only using the periodic memory limit enforcement through jobacct_gather/linux may work around this. Within the cgroup implementation, there is now a Kmem limit field that may provide a way to tune the amount of reclaimable kernel page caches that are associated with a given job. I'm looking into how that interoperates - the kernel documentation is a little vague on how this interacts with the other limits though. If you have an application that readily triggers this behavior on NFSv3 mounts I would love to have that on hand when testing. - Tim (In reply to Tim Wickberg from comment #12) > Right now, disabling the cgroup memory enforcement, and falling back to only > using the periodic memory limit enforcement through jobacct_gather/linux may > work around this. What's wrong with reverting to the pre-17.02 behaviour (i.e. applying the patch)? The new behaviour is clearly wrong and I certainly wouldn't want to disable the cgroup memory enforcement. > If you have an application that readily triggers this behaviour on NFSv3 > mounts I would love to have that on hand when testing. I can trigger this problem using a simple dd command, e.g. dd if=/dev/zero of=/path/to/test/file bs=1MB count=1k conv=fdatasync I get the same problem writing to local storage or NFSv3. I've still be playing with this, but don't have a good overall solution to this. I'm asking one of our other engineers to dig into this in further detail, and take a look at what the actual underlying behavior is from the Linux cgroup code. Can you explain why you are not prepared to revert to the previous behaviour? Currently slurm will kill off a job which completed successfully because it, incorrectly, thinks it ran out of memory. How is this better than the previous behaviour? This thread appears to give the key points about the cgroup behaviour: https://groups.google.com/forum/#!msg/slurm-users/f359KT-3lsU/yRm8vkxeBwAJ (I included other links in bug 3831). You cannot use memory.failcnt to determine if oom_killer has been activated. Created attachment 5795 [details] Page cache growth This is a personal interpretation on what I think is happening (so I might be wrong), but the more tests I do, the more convinced I am with my thinking. First off, I want recall a few relevant things, that might be off-topic but want to mention: 1. There are two mechanisms in Slurm to enforce job memory limits. First is by interpreting the memory samples collected by JobAcctGatherType plugin and the second by using the TaskPlugin including at least 'task/cgroup' as a value and having ConstrainRAMSpace=yes in cgroup.conf. We recommend enforcing memory only with the TaskPlugin mechanism and only use JobAcctGatherType plugin to collect job stats information for accounting purposes, and not for memory enforcement purposes. To accomplish that I'd recommend adding 'NoOverMemoryKill' to JobAcctGatherParams. One of the disadvantages of using JobAcctGather plugin to enforce memory is that peaks of memory between samples might not be properly considered depending on the sampling frequency. Somehow TaskPlugin works like an 'interrupt' instead of a 'poll' mechanism. 2. For accounting purposes, better to use the jobacct_gather/linux instead of jobacct_gather/cgroup. Both should produce same results but cgroup is slower. ... After this preamble, now back on topic. I can reproduce what you mention, using 'dd' command to write zeros to a local disk (no need NFS to reproduce as you said). What I don't agree, and this can be discussed further, is that the behavior is incorrect. I think it is more related to what does the kernel account in terms of memory when you set the memory.limit_in_bytes. If I submit the following job request: alex@ibiza:~/t$ srun --mem=990M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync 1024+0 records in 1024+0 records out 1024000000 bytes (1.0 GB, 977 MiB) copied, 1.33275 s, 768 MB/s alex@ibiza:~/t$ I don't see any memory being enforced. Note I request --mem=990M, and the 'dd' command reports copying 977M. If instead I submit a slightly different dd request, now requesting --mem=979M: alex@ibiza:~/t$ srun --mem=979M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync 1024+0 records in 1024+0 records out 1024000000 bytes (1.0 GB, 977 MiB) copied, 1.27991 s, 800 MB/s slurmstepd-compute1: error: Exceeded step memory limit at some point. srun: error: compute1: task 0: Out Of Memory alex@ibiza:~/t$ I get the Out Of Memory. So... what I did is right after submitting srun, execute the following monitoring script: alex@ibiza:~/t$ cat monitor_memory.sh #!/bin/bash stats_file=/sys/fs/cgroup/memory/slurm_compute1/uid_1000/job_$1/step_0/memory.stat while [ -e "$stats_file" ]; do echo -e "\n" cat $stats_file sleep 0.2 done alex@ibiza:~/t$ alex@ibiza:~/t$ ./monitor_memory.sh 20060 > /home/alex/t/mem.stats.txt Inspecting the mem.stats.txt (see it attached), you can see that our requested --mem=979M is properly set: hierarchical_memory_limit 1026555904 // 1026555904 / 1024 / 1024 = --mem=979M hierarchical_memsw_limit 1026555904 And that the total_rss used by 'dd' stays constant at 1101824 bytes, /1024/1024 = 1MB: alex@ibiza:~/t$ grep -w "total_rss" mem.stats.txt total_rss 1101824 total_rss 1101824 total_rss 1101824 total_rss 1101824 total_rss 1101824 total_rss 1101824 alex@ibiza:~/t$ BUT, I see a rapid progression growth for total_cache and total_dirty stats: alex@ibiza:~/t$ egrep "total_cache|total_dirty" mem.stats.txt total_cache 203399168 total_dirty 203399168 total_cache 675606528 total_dirty 600109056 total_cache 986255360 around here memory.limit_in_bytes is reached total_dirty 676917248 total_cache 1022705664 total_dirty 503906304 total_cache 1022742528 total_dirty 309915648 total_cache 1022754816 total_dirty 84475904 alex@ibiza:~/t$ Short answer: Page Cache is accounted so even if 'dd' rss footprint stays barely at 1MB, since Page Cache is accounted you can rapidly get into an Out Of Memory situation depending on your application. If you look at https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt (2.2.1 Accounting details), you can read: "All mapped anon pages (RSS) and cache pages (Page Cache) are accounted." And if you look at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-memory you can read: memory.limit_in_bytes sets the maximum amount of user memory (including file cache). Long answer: Also describing some of the ideas in the following nice answer here: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di ----------------------------------------------------- The problem that you're hitting here has to do with how the kernel manages I/O. When you make a write to your filesystem, that write isn't immediately committed to disk; that would be incredibly inefficient. Instead, writes are cached in an area of memory referred to as the page cache, and periodically written in chunks out to disk. The "dirty" section of your log describes the size of this page cache that hasn't been written out to disk yet: > So what empties this dirty cache? Why isn't it doing it's job? 'Flush' on Linux is responsible for writing dirty pages out to disk. It's a daemon that wakes up periodically to determine if writes to disk are required, and, if so, performs them. If you are a C type of guy, start here. Flush is incredibly efficient; it does a great job of flushing stuff to disk when needed. And it's working exactly how it is supposed to. ... A program starts to write a big file, it buffers... buffers... and eventually hits it's hard limit, and starts calling the OOM manager. This isn't a failure of any particular component; it's expected behavior. Kind of. This sort of thing should be handled by cgroups, but it doesn't seem like it is. > Ok, that makes sense. But it's useless. I still need to write a big file. ... There are a few options here for things you can try to tweak: /proc/sys/vm/dirty_expire_centiseconds This controls how long a page can be held in the dirty cache and written to disk. By default it's 30 seconds; try setting it lower to start pushing it out faster. /proc/sys/vm/dirty_background_ratio This controls what percentage of active memory flush is allowed to fill up before it starts forcing writes. There is a bit of fiddling that goes into sorting out the exact total here, but the easiest explanation is to just look at your total memory. By default it's 10% (on some distros it's 5%). Set this lower; it'll force writes out to disk sooner. > Can't I just screw with the filesystem a bit? Well, yeah. But make sure you test this out.. you could affect performance. On your mounts in /etc/fstab where you'll be writing this to, add the 'sync' mount option. > Original answer: Try reducing the blocksize used by DD: dd if=/dev/zero of=test2 bs=512 count=1024000 You can only write one sector at a time (512 bytes on older HDDs, 4096 on newer). If DD is pushing writes to disk faster than the disk can accept them, it will start caching the writes in memory. That's why your file cache is growing. ----------------------------------------------------- ... So Page Cache contributes to the memory.limit_in_bytes, so that's how the kernel works and for now I don't think it is Slurm fault. We can discuss though how to mitigate this, and potentially reclassify as sev-5 so that new config options can be added to change these /proc/sys/vm/ parameters from Slurm, or otherwise I am open to proposals. Hope this is helpful and let me know if I am not in the right path with my thinking. (In reply to David Matthews from comment #16) > Can you explain why you are not prepared to revert to the previous > behaviour? Currently slurm will kill off a job which completed successfully > because it, incorrectly, thinks it ran out of memory. How is this better > than the previous behaviour? > > This thread appears to give the key points about the cgroup behaviour: > https://groups.google.com/forum/#!msg/slurm-users/f359KT-3lsU/yRm8vkxeBwAJ > (I included other links in bug 3831). > You cannot use memory.failcnt to determine if oom_killer has been activated. I agree and I think this is the gist of the problem we are facing, using memory.failcnt to determine an OOM condition is incorrect, especially if programs processes large files and "bounce up" against the cgroup memory limit _without_ being OOM'd because the kernel is able to reclaim their page cache. This change seems to have been introduced here: https://bugs.schedmd.com/show_bug.cgi?id=1963#c2 thanks -k I am out of the office until Jan 2nd, 2018. If you have questions that require a reply please email research@unc.edu. Regards, Virginia Wililams Systems Administrator Research Computing UNC Chapel Hill Ok. I see your point now. There are a few points I agree could be improved to provide a better feedback on Out of Memory conditions. 1. Since commit https://github.com/SchedMD/slurm/commit/707268a521ea9beb85 , a failcnt > 0 is logged as an error() instead of info(). 2. Since commit https://github.com/SchedMD/slurm/commit/818a09e802587e68ab , a failcnt > 0 makes it so the step/job is recorded as OutOfMemory. According to the kernel memory documentation[1], "failcnt shows the number of times that a usage counter hit its limit. When a memory cgroup hits a limit, failcnt increases and memory under it will be reclaimed." ... so yes I agree that failcnt > 0 doesn't necessarily imply that OOM actually killed the process. Under out of memory conditions the kernel might reclaim unused pages and that might solve the situation without any kill being involved. I think the best would be to reclassify this a sev-5 targeted for 18.08 and make some changes: If failcnt > 0, don't consider it as an error, but simply log that usage hit the limit failcnt times. Perhaps also report the disaggregated memory.stats back so that users can discern how much RSS, Page Cache and the different factors contribute to the usage (that can be optional). Then implement a mechanism using the memory.oom_control to register notifications that should be more reliable than failcnt value. Perhaps something similar to Ryan Cox oom notifierd daemon. [1] https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt Hi Jenny. I've committed the following patch to 17.11.2+ : https://github.com/SchedMD/slurm/commit/38e15bf74b14e8cbfc8b09305a which will clarify the messages reported under out of memory conditions. Basically this will tell the users that when failcnt > 0, this does not necessarily mean that a problem happend, just that the memory[+swap] limit was hit at some point during execution. With this commit I am going ahead and close this bug since that was the original concern with the user community. I've also opened a new Sev-5 bug 4578 commenting some ideas to provide a better and more detailed feedback back to the users/admins under OOM conditions: catching when oom-killer actually kills a process, potentially append the syslog kernel info when a process is killed and report the memory.stats disaggregated info when these events happen. (In reply to Alejandro Sanchez from comment #23) > With this commit I am going ahead > and close this bug since that was the original concern with the user > community. Our issue is not just the message - it's that fact that jobs are failing which should have succeeded. Please see bug 3831 which was closed as a duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest 3831 should be re-opened. (In reply to David Matthews from comment #24) > (In reply to Alejandro Sanchez from comment #23) > > With this commit I am going ahead > > and close this bug since that was the original concern with the user > > community. > > Our issue is not just the message - it's that fact that jobs are failing > which should have succeeded. Please see bug 3831 which was closed as a > duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest > 3831 should be re-opened. What is wrong with the updated message? alex@ibiza:~/t$ srun --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync 1024+0 records in 1024+0 records out 1024000000 bytes (1.0 GB, 977 MiB) copied, 3.40383 s, 301 MB/s slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once during execution. This may or may not result in some failure. srun: error: compute1: task 0: Out Of Memory alex@ibiza:~/t$ sacct -j 20002 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20002 dd p1 acct1 2 OUT_OF_ME+ 0:125 alex@ibiza:~/t$ It clearly states that the usage hit the limit at some point (which is totally true), but it also states that this may or may _not_ result in some failure (which is also true). It doesn't state that this is 100% an error, and it depends on whether the Kernel succeed to reclaim unused pages and OOM-Killer didn't actually trigger. Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just that the usage hit the limit at some point during execution and note also that the ExitCode is 0 (sucess). So don't understand why you state that the jobs are failing. I know the feedback could be improved as per comments in bug 4578, but right now what the message states and what sacct reports in the State and ExitCode fields doesn't seem incorrect to me. Anyhow I am open to suggestions. (In reply to Alejandro Sanchez from comment #25) > Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just > that the usage hit the limit at some point during execution and note also > that the ExitCode is 0 (sucess). So don't understand why you state that the > jobs are failing. > > I know the feedback could be improved as per comments in bug 4578, but right > now what the message states and what sacct reports in the State and ExitCode > fields doesn't seem incorrect to me. Anyhow I am open to suggestions. Please refer to bug 3831. The problem I reported results in jobs in the FAILED state. This has not been addressed as far as I can see. I've read bug 3831. Problem with current 17.11.2 is that you would like these messages: slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once during execution. This may or may not result in some failure. srun: error: compute1: task 0: Out Of Memory to continue being presented to the user output, but instead of prepended with the word 'error', just show them as a info message. Is that correct? (In reply to David Matthews from comment #26) > Please refer to bug 3831. The problem I reported results in jobs in the > FAILED state. This has not been addressed as far as I can see. With 17.11.2 I don't see the job recorded as FAILED, but instead as OUT_OF_MEMORY with ExitCode 0:125 as I showed in my previous comment. (In reply to Alejandro Sanchez from comment #27) > I've read bug 3831. Problem with current 17.11.2 is that you would like > these messages: > > slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once > during execution. This may or may not result in some failure. > srun: error: compute1: task 0: Out Of Memory > > to continue being presented to the user output, but instead of prepended > with the word 'error', just show them as a info message. Is that correct? I would prefer them to be an info message yes (because they don't necessarily indicate an error and users find this confusing). However, that is not my main concern. > With 17.11.2 I don't see the job recorded as FAILED, but instead as > OUT_OF_MEMORY with ExitCode 0:125 as I showed in my previous comment. I'm not currently in a position to test 17.11. With 17.02 the situation remains unchanged. For the moment I'll continue with 17.02 (patched) and will look at this again when we upgrade to 17.11. (In reply to David Matthews from comment #28) > I would prefer them to be an info message yes (because they don't > necessarily indicate an error and users find this confusing). However, that > is not my main concern. There is one problem though. If we change these messages from error() to info(), then they will not be displayed by default in the output of srun. If you read the srun man page for --slurmd-debug[1] option: "The slurmd debug information is copied onto the stderr of the job. By default only errors are displayed. This option applies to job and step allocations." [1] https://slurm.schedmd.com/srun.html#OPT_slurmd-debug Since most of the users won't even know about this option, I am more inclined in leaving the messages and errors() and instruct the user community making them know that the messages doesn't necessarily indicate an application failure, just may or may not. What do you think? > > With 17.11.2 I don't see the job recorded as FAILED, but instead as > > OUT_OF_MEMORY with ExitCode 0:125 as I showed in my previous comment. > > I'm not currently in a position to test 17.11. With 17.02 the situation > remains unchanged. > > For the moment I'll continue with 17.02 (patched) and will look at this > again when we upgrade to 17.11. Ok, still don't understand where you see the job recorded as FAILED and/or in which version. In 17.11.2 it is OUT_OF_MEMORY with ExitCode 0:125. I think I can come up with an intermediate solution for the info() problem. Let me discuss it internally and I'll come back to you. (In reply to Alejandro Sanchez from comment #30) > I think I can come up with an intermediate solution for the info() problem. > Let me discuss it internally and I'll come back to you. Just to note (as discussed in bug 3831) at 15.08 users would receive the memory exceeded warning message without "error:" preceding it. I'd just like this behaviour restored. Thanks. (In reply to Alejandro Sanchez from comment #25) > (In reply to David Matthews from comment #24) > > (In reply to Alejandro Sanchez from comment #23) > > > With this commit I am going ahead > > > and close this bug since that was the original concern with the user > > > community. > > > > Our issue is not just the message - it's that fact that jobs are failing > > which should have succeeded. Please see bug 3831 which was closed as a > > duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest > > 3831 should be re-opened. > > What is wrong with the updated message? > > alex@ibiza:~/t$ srun --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB > count=1k conv=fdatasync > 1024+0 records in > 1024+0 records out > 1024000000 bytes (1.0 GB, 977 MiB) copied, 3.40383 s, 301 MB/s > slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once > during execution. This may or may not result in some failure. > srun: error: compute1: task 0: Out Of Memory > alex@ibiza:~/t$ sacct -j 20002 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 20002 dd p1 acct1 2 OUT_OF_ME+ 0:125 > alex@ibiza:~/t$ > > It clearly states that the usage hit the limit at some point (which is > totally true), but it also states that this may or may _not_ result in some > failure (which is also true). It doesn't state that this is 100% an error, > and it depends on whether the Kernel succeed to reclaim unused pages and > OOM-Killer didn't actually trigger. > > Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just > that the usage hit the limit at some point during execution and note also > that the ExitCode is 0 (sucess). So don't understand why you state that the > jobs are failing. > > I know the feedback could be improved as per comments in bug 4578, but right > now what the message states and what sacct reports in the State and ExitCode > fields doesn't seem incorrect to me. Anyhow I am open to suggestions. I'd like to add that even though the exit code is 0, if you have "#SBATCH --mail-type=FAIL" you will still get an e-mail, so slurm still thinks the job has failed somehow. thanks -k (In reply to Kaizaad from comment #32) > (In reply to Alejandro Sanchez from comment #25) > > (In reply to David Matthews from comment #24) > > > (In reply to Alejandro Sanchez from comment #23) > > > > With this commit I am going ahead > > > > and close this bug since that was the original concern with the user > > > > community. > > > > > > Our issue is not just the message - it's that fact that jobs are failing > > > which should have succeeded. Please see bug 3831 which was closed as a > > > duplicate of this bug. This is certainly not a sev 5 issue for us. I suggest > > > 3831 should be re-opened. > > > > What is wrong with the updated message? > > > > alex@ibiza:~/t$ srun --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB > > count=1k conv=fdatasync > > 1024+0 records in > > 1024+0 records out > > 1024000000 bytes (1.0 GB, 977 MiB) copied, 3.40383 s, 301 MB/s > > slurmstepd-compute1: error: Step 20002.0 hit memory+swap limit at least once > > during execution. This may or may not result in some failure. > > srun: error: compute1: task 0: Out Of Memory > > alex@ibiza:~/t$ sacct -j 20002 > > JobID JobName Partition Account AllocCPUS State ExitCode > > ------------ ---------- ---------- ---------- ---------- ---------- -------- > > 20002 dd p1 acct1 2 OUT_OF_ME+ 0:125 > > alex@ibiza:~/t$ > > > > It clearly states that the usage hit the limit at some point (which is > > totally true), but it also states that this may or may _not_ result in some > > failure (which is also true). It doesn't state that this is 100% an error, > > and it depends on whether the Kernel succeed to reclaim unused pages and > > OOM-Killer didn't actually trigger. > > > > Note the sacct State OUT_OF_MEMORY doesn't mean that the job failed, just > > that the usage hit the limit at some point during execution and note also > > that the ExitCode is 0 (sucess). So don't understand why you state that the > > jobs are failing. > > > > I know the feedback could be improved as per comments in bug 4578, but right > > now what the message states and what sacct reports in the State and ExitCode > > fields doesn't seem incorrect to me. Anyhow I am open to suggestions. > > > I'd like to add that even though the exit code is 0, if you have "#SBATCH > --mail-type=FAIL" you will still get an e-mail, so slurm still thinks the > job has failed somehow. > > thanks > -k Furthermore, if you also have a job dependency chain that has "afterok", it will still consider the job failed and not continue with the job chain. thanks -k A quick note that on a lustre filesystem the job can actually be killed when it bumps the limit due to a known bug in lustre https://jira.hpdd.intel.com/browse/LU-5801 This has been fixed in the following commit: https://github.com/SchedMD/slurm/commit/943c4a130f39dbb1fb which will be available starting at 17.11.3. Here's an example of current behavior and output: 1. Job actually being oom-killed: alex@ibiza:~/t$ srun --mem=200 ./mem_eater slurmstepd-compute1: error: Detected 1 oom-kill event(s) in step 20014.0 cgroup. srun: error: compute1: task 0: Out Of Memory alex@ibiza:~/t$ sacct -j 20014 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20014 mem_eater p2 acct1 2 OUT_OF_ME+ 0:125 2. Job memory usage hitting the limit but not being oom-killed: alex@ibiza:~/t$ srun --slurmd-debug=info --mem=900M dd if=/dev/zero of=/home/alex/t/file bs=1MB count=1k conv=fdatasync ... 1024+0 records in 1024+0 records out 1024000000 bytes (1.0 GB, 977 MiB) copied, 2.01502 s, 508 MB/s slurmstepd-compute1: task 0 (9852) exited with exit code 0. slurmstepd-compute1: Step 20016.0 hit memory+swap limit at least once during execution. This may or may not result in some failure. slurmstepd-compute1: debug: _oom_event_monitor: No oom events detected. slurmstepd-compute1: debug: _oom_event_monitor: stopping. ... alex@ibiza:~/t$ sacct -j 20016 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20016 dd p2 acct1 2 COMPLETED 0:0 alex@ibiza:~/t$ alex@ibiza:~/t$ cat test.bash #!/bin/bash #SBATCH --mem=1500 ./mem_eater & ./mem_eater & srun --mem=200 ./mem_eater & wait alex@ibiza:~/t$ sbatch test.bash Submitted batch job 20018 alex@ibiza:~/t$ ls -ltr slurm-20018.out -rw-r--r-- 1 alex alex 459 Jan 23 14:47 slurm-20018.out alex@ibiza:~/t$ cat slurm-20018.out srun: error: compute1: task 0: Out Of Memory slurmstepd-compute1: error: Detected 1 oom-kill event(s) in step 20018.0 cgroup. /home/alex/slurm/17.11/ibiza/spool/slurmd-compute1/job20018/slurm_script: line 6: 10056 Killed ./mem_eater /home/alex/slurm/17.11/ibiza/spool/slurmd-compute1/job20018/slurm_script: line 6: 10057 Killed ./mem_eater slurmstepd-compute1: error: Detected 2 oom-kill event(s) in step 20018.batch cgroup. alex@ibiza:~/t$ sacct -j 20018 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 20018 test.bash p2 acct1 2 OUT_OF_ME+ 0:125 20018.batch batch acct1 2 OUT_OF_ME+ 0:125 20018.0 mem_eater acct1 1 OUT_OF_ME+ 0:125 alex@ibiza:~/t$ I think this better reflects what's actually happening under OOM conditions. Note also that this fixes the 'afterok' dependencies (if the job is oom-killed, the dependant job won't start, but if only hits the limit it will, bug 4625). It also fixes the --mail-type=FAIL concern (comment 32). BYU HPC sites might consider switching from their notifierd daemon to using this mechanism. I'll leave this bug open for a few days just in case you want to add further questions and/or issues. If nothing else is raised, I'll close it and leave bug 4578 sev-5 to change the API for 18.08 for further work (for instance adding a SystemComment when the usage hits the limit). Thanks for your patience. Thanks - I won't be able to test it for a while but it looks like a significant improvement. Marking this as resolved/fixed as per the provided patch. Future work will be tracked in bug 4578. |