Summary: | OOM While Job Completed | ||
---|---|---|---|
Product: | Slurm | Reporter: | Yang Liu <yang_liu2> |
Component: | Limits | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | felip.moll |
Version: | 20.02.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=12127 https://bugs.schedmd.com/show_bug.cgi?id=10255 https://bugs.schedmd.com/show_bug.cgi?id=9737 |
||
Site: | Brown Univ | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Yang Liu
2021-07-26 11:56:40 MDT
Hi Yang, Can you post the OS and Kernel versions that is using node1635? Have you done any OS/Kernel update recently? As you can see in the See Also bugs we faced similar cases related to OOM detected by cgroups, so we'll discard this first. Regards, Albert Hi Albert, No recent upgrade. [yliu385@node1635 ~]$ cat /etc/*release* cat: /etc/lsb-release.d: Is a directory NAME="Red Hat Enterprise Linux Server" VERSION="7.7 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.7" PRETTY_NAME="Red Hat Enterprise Linux Server 7.7 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.7:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.7 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.7" Red Hat Enterprise Linux Server release 7.7 (Maipo) Red Hat Enterprise Linux Server release 7.7 (Maipo) cpe:/o:redhat:enterprise_linux:7.7:ga:server [yliu385@node1635 ~]$ uname -r 3.10.0-1062.el7.x86_64 Linux version 3.10.0-1062.el7.x86_64 (mockbuild@x86-040.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Thu Jul 18 20:25:13 UTC 2019 Yang Hi Yang, We have detected some scenarios that could lead to false-OOM like yours seems to be. We are investigating them all together and we are close to confirm the root cause of all of them. Maybe it's a bit late late, but could attach the dmesg/syslog of node node1635 on 2021-07-20? Anyway, we'll keep you posted in our internal progress about this. Regards, Albert Hi Yang > Maybe it's a bit late late, but could attach the dmesg/syslog of node > node1635 on 2021-07-20? I see that you already mentioned no OOM on dmesg in comment 0, sorry. Then this ticket actually looks like a duplicate of bug 10255 that was already fixed in newer versions >= 20.02.7. Do you see a similar behavior in other jobs, or only on 1796948_176? Regards, Albert Hi Albert, Thank you for pointing out bug 10255. I tested the test job script and 'sbatch --wrap "sleep 10" ', and didn't receive OOM for the test jobs (4). Our user also reun her job and didn't receive OOM again. The OOM state is always for the extern step in bug 10255. So it seems our case is different to bug 10255. Thanks, Yang Hi Yang, Just to avoid confusions: - Can you confirm that you are still using 20.02.6 version, specially slurmds? - Are you able to reproduce the issue somehow? Thanks, Albert Hi Yang, Are you using Bright? Can you show me the output of cat /proc/mounts of node1635? Regards, Albert Hi Albert, 1. Yes, slurmd is 20.02. from 'sinf -o"%v"' 2. I am asking the user if there is a job which always receives the OOM error 3. I don't think we are using Bright. We have gpfs+slurm+infiniband 4. [yliu385@node1635 ~]$ cat /proc/mounts rootfs / rootfs rw,size=98303864k,nr_inodes=24575966 0 0 sysfs /sys sysfs rw,relatime 0 0 proc /proc proc rw,relatime 0 0 devtmpfs /dev devtmpfs rw,nosuid,size=98312196k,nr_inodes=24578049,mode=755 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0 devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,mode=755 0 0 tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0 cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0 cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0 cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_prio,net_cls 0 0 cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0 cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0 cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0 cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0 cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0 cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0 cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0 configfs /sys/kernel/config configfs rw,relatime 0 0 rootfs / tmpfs rw,relatime,mode=755 0 0 rw /.sllocal/log tmpfs rw,relatime 0 0 rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,relatime 0 0 hugetlbfs /dev/hugepages hugetlbfs rw,relatime 0 0 mqueue /dev/mqueue mqueue rw,relatime 0 0 systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=232861 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 gpfs /gpfs gpfs rw,relatime 0 0 mgt5:/install /install nfs4 ro,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.20.212.35,local_lock=none,addr=172.20.0.6 0 0 /etc/auto.misc /misc autofs rw,relatime,fd=5,pgrp=14327,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=276510 0 0 -hosts /net autofs rw,relatime,fd=11,pgrp=14327,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=276516 0 0 /etc/auto.cvmfs /cvmfs autofs rw,relatime,fd=17,pgrp=14327,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=276521 0 0 auto.direct /nfs/scratch autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0 auto.direct /truenas autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0 auto.direct /nfs/data autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0 auto.direct /nfs/jbailey5/baileyweb autofs rw,relatime,fd=23,pgrp=14327,timeout=300,minproto=5,maxproto=5,direct,pipe_ino=276525 0 0 auto.tserre_cifs /cifs/data autofs rw,relatime,fd=28,pgrp=14327,timeout=600,minproto=5,maxproto=5,indirect,pipe_ino=276529 0 0 auto.tserre_lrs /cifs/data/tserre_lrs autofs rw,relatime,fd=34,pgrp=14327,timeout=600,minproto=5,maxproto=5,indirect,pipe_ino=276533 0 0 auto.lrs /lrs autofs rw,relatime,fd=40,pgrp=14327,timeout=600,minproto=5,maxproto=5,indirect,pipe_ino=276537 0 0 fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0 tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=19666796k,mode=700 0 0 tmpfs /run/user/140348764 tmpfs rw,nosuid,nodev,relatime,size=19666796k,mode=700,uid=140348764,gid=601 0 0 Hi Yang, > 1. Yes, slurmd is 20.02. from 'sinf -o"%v"' Actually I'm more interested in the minor release number. Note that the problem on bug 10255 was actually a regression bug introduced on 20.02.6 and fixed in 20.02.7. This ticket is marked as "Version 20.02.6", so I'm wondering if maybe that user faced the issue because of that, but maybe now you've already updated Slurm to a version >= 20.02.7, so this would explain why the bug is no more reproduced. You can also run "slurmd -V" in node1635 to check it. > 2. I am asking the user if there is a job which always receives the OOM error Thanks. > 3. I don't think we are using Bright. We have gpfs+slurm+infiniband > > 4. > [yliu385@node1635 ~]$ cat /proc/mounts Ok, this discards some possible source of problems. Thanks, Albert Hi Albert, 1. [yliu385@node945 yliu385]$ slurmd -V slurm 20.02.6 [yliu385@node945 yliu385]$ sbatch --wrap "sleep 10" Submitted batch job 2052515 [yliu385@node945 yliu385]$ sacct -j 2052515 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2052515 wrap batch default 1 COMPLETED 0:0 2052515.bat+ batch default 1 COMPLETED 0:0 2052515.ext+ extern default 1 COMPLETED 0:0 Bug 12055 listed slurm version of 20.02.5 Yang Hi Yang, > Bug 10255 listed slurm version of 20.02.5 Yes, but although the reporter of that ticket set the version to .5, in our debug process we actually identified the issue as an undesired side effect of this commit that landed on .6: - https://github.com/SchedMD/slurm/commit/f93b16670f3b07f6209099c24425036f9c54d136 That undesired effect was solved on this commit that landed on .7: - https://github.com/SchedMD/slurm/commit/272c636d507e1dc59d987da478d42f6713d88ae1 > 1. [yliu385@node945 yliu385]$ slurmd -V > slurm 20.02.6 Thanks for double-checking. > [yliu385@node945 yliu385]$ sbatch --wrap "sleep 10" > Submitted batch job 2052515 > > [yliu385@node945 yliu385]$ sacct -j 2052515 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 2052515 wrap batch default 1 COMPLETED 0:0 > 2052515.bat+ batch default 1 COMPLETED 0:0 > 2052515.ext+ extern default 1 COMPLETED 0:0 I assume that all your nodes are running slurmd 20.02.6 but that you are not being able to reproduce the issue, right? And you are also using jobacct_gather/cgroup, right? If that's true, then I agree that this is not *exactly* what users on bug 10255 experienced (they were able to reproduce it quite easy), but still we cannot discard that the same Slurm version (20.02.6) with different versions of kernel behave slightly different (because cgroups sometime behave slightly different between kernel versions, that's why in comment 1 I asked for your kernel version). Note that the known bug was kind of a race condition between Slurm and Kernel's cgroups, so a just a faster/slower Kernel can make it happen always or even never. So, although we cannot be 100% sure that bug 10255 is the root cause of your issue, we cannot discarded neither. Therefore, my recommendation would be to update to latest 20.02 (or if you are able, better to latest 20.11). This way we can discard the known issue on 20.02.6 that could be impacting you (even not so hard like for other users). Once you update, if you are not able to reproduce the issue in >= 20.02.7, then I think that we should close this as duplicate of bug 10255. If you are able to reproduce the issue, then we should do further investigation knowing that the known bug 10255 is not the root cause (and also having some more clues about how to reproduce the issue). I hope that this is ok for you too. Regards, Albert Hi Albert, Yes, all nodes have slurmd 20.02.6: [yliu385@node945 10255]$ slurmd -V slurm 20.02.6 Surprisingly that our slurm is configured with jobacct_gather/linux, not jobacct_gather/cgroup. Any impacts from this configuration? [yliu385@node945 10255]$ scontrol show config|grep -i jobacct JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) Yang Hi Albert, Our user reported that the issue still persists randomly. I read the slurmd log and it seems that the <job_id>.batch exited before <job_id>.external. Is that an error which caused OOM for <job_id>.external? [2021-07-20T15:24:39.748] [1802599.batch] task 0 (83664) exited with exit code 0. [2021-07-20T15:24:39.748] [1802599.batch] debug: _oom_event_monitor: No oom events detected. [2021-07-20T15:24:39.748] [1802599.batch] debug: _oom_event_monitor: stopping. [2021-07-20T15:24:39.749] [1802599.batch] debug: step_terminate_monitor_stop signaling condition [2021-07-20T15:24:39.809] [1802599.batch] job 1802599 completed with slurm_rc = 0, job_rc = 0 [2021-07-20T15:24:39.809] [1802599.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2021-07-20T15:24:39.812] [1802599.batch] debug: Message thread exited [2021-07-20T15:24:39.812] [1802599.batch] done with job [2021-07-20T15:24:39.813] debug: credential for job 1802599 revoked [2021-07-20T15:24:39.813] [1802599.extern] debug: Handling REQUEST_SIGNAL_CONTAINER [2021-07-20T15:24:39.813] [1802599.extern] debug: _handle_signal_container for step=1802599.4294967295 uid=508 signal=18 [2021-07-20T15:24:39.814] [1802599.extern] Sent signal 18 to 1802599.4294967295 [2021-07-20T15:24:39.814] [1802599.extern] debug: Handling REQUEST_SIGNAL_CONTAINER [2021-07-20T15:24:39.814] [1802599.extern] debug: _handle_signal_container for step=1802599.4294967295 uid=508 signal=15 [2021-07-20T15:24:39.814] [1802599.extern] debug: step_terminate_monitor_stop signaling condition [2021-07-20T15:24:39.814] [1802599.extern] Sent signal 15 to 1802599.4294967295 [2021-07-20T15:24:39.814] [1802599.extern] debug: _oom_event_monitor: stopping. [2021-07-20T15:24:39.814] [1802599.extern] error: Detected 1 oom-kill event(s) in step 1802599.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2021-07-20T15:24:39.814] [1802599.extern] debug: task_g_post_term: task/cgroup: Cannot allocate memory [2021-07-20T15:24:39.815] [1802599.extern] debug: Handling REQUEST_STATE [2021-07-20T15:24:39.830] [1802599.extern] debug: Message thread exited [2021-07-20T15:24:39.830] [1802599.extern] done with job Yang Hi Yang, > Yes, all nodes have slurmd 20.02.6: > [yliu385@node945 10255]$ slurmd -V > slurm 20.02.6 Thanks for confirming. > Surprisingly that our slurm is configured with jobacct_gather/linux, not > jobacct_gather/cgroup. Any impacts from this configuration? > [yliu385@node945 10255]$ scontrol show config|grep -i jobacct > JobAcctGatherFrequency = 30 > JobAcctGatherType = jobacct_gather/linux > JobAcctGatherParams = (null) Ok, that could also explain some different behavior with other customers. I assume that you do use task/cgroup, right? Could you attach your slurm.conf? > Our user reported that the issue still persists randomly. Ok, assuming that we are impacted for the known bug mentioned before, that would be expected until we update to >= 20.02.7. > I read the slurmd > log and it seems that the <job_id>.batch exited before <job_id>.external. Is > that an error which caused OOM for <job_id>.external? Well, the error is not exactly that one step ends before/after another (this is not a problem), but it has more to do on how kernel reports "cgroups events" (that Slurm assumes as OOM events but they may be other events), and how in 20.02.6 we do the cleanup of the cgroups hierarchy of the job. In 20.02.7 we fixed the cleanup code on Slurm, so we avoid generating undesired cgroup events, so we stop detecting and reporting them as false-oom. But this is a too low-level detail. Do you think that you can manage to update to >= 20.02.7 so we can discard the known issue? Regards, Albert Hi Yang, Have you been able to update Slurm so we can discard hat you are impacted by the know issue on 20.02.6? Regards, Albert Hi Yang, Any news or plans about trying to update to solve or discard the known issue on 20.02.6? Regards, Albert Hi Albert, No, not yet. Our system team will plan and schedule the update. Once Slurm is updated, I will let you know. Best, Yang On Fri, Sep 3, 2021 at 4:33 AM <bugs@schedmd.com> wrote: > *Comment # 17 <https://bugs.schedmd.com/show_bug.cgi?id=12122#c17> on bug > 12122 <https://bugs.schedmd.com/show_bug.cgi?id=12122> from Albert Gil > <albert.gil@schedmd.com> * > > Hi Yang, > > Any news or plans about trying to update to solve or discard the known issue on > 20.02.6? > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hi Yang, Have you been able to upgrade discard the known issue on 20.02.6 (bug 10255)? Regards, Albert Hi Yang, If this is ok for you I'm closing this ticket assuming that it's a duplicate of bug 10255, and that an upgrade should fix it. But if you need further related support, please don't hesitate to reopen it. Regards, Albert *** This ticket has been marked as a duplicate of ticket 10255 *** |