Description
Anthony DelSorbo
2019-09-30 15:01:01 MDT
Anthony, From your ps output I can see how some processes are waiting in mem_cg kernel functions. Can you invoke ps and return the full WCHAN string? In the output you sent it is truncated. For example: ps -a opid,wchan:42,cmd you can also check in "cat /boot/System.map<yourkernelversion> |grep mem_cg" which functions are available for this substring. I am suspicious about a problem with cgroups which cannot kill the processes. I would need your slurm.conf, slurmd and slurmctld logs, and a "mount" command issued in a failing node. Sorry, I meant something like: ps -e --forest opid,ppid,user,wchan:42,cmd (In reply to Felip Moll from comment #2) > Sorry, I meant something like: > > ps -e --forest opid,ppid,user,wchan:42,cmd Felip, See below for PS output. As we send our slurmd to syslogs, I'm going to prune and attach the messages file as well as the slurm.conf file. ======== 125742 1 root mem_cgroup_oom_synchronize /apps/slurm/d/sbin/slurmd 87619 125742 root do_exit \_ [slurmd] <defunct> 87631 125742 root do_exit \_ [slurmd] <defunct> 87666 125742 root do_exit \_ [slurmd] <defunct> 87677 125742 root do_exit \_ [slurmd] <defunct> 87689 125742 root do_exit \_ [slurmd] <defunct> 87814 125742 root do_exit \_ [slurmd] <defunct> 87954 125742 root do_exit \_ [slurmd] <defunct> 88031 125742 root do_exit \_ [slurmd] <defunct> 88048 125742 root do_exit \_ [slurmd] <defunct> 88290 125742 root mem_cgroup_oom_synchronize \_ /apps/slurm/d/sbin/slurmd 87605 1 root do_wait slurmstepd: [120564.extern] 87977 87605 root hrtimer_nanosleep \_ sleep 100000000 87608 1 root do_wait slurmstepd: [120570.extern] 88062 87608 root hrtimer_nanosleep \_ sleep 100000000 87613 1 root do_wait slurmstepd: [120568.extern] 88128 87613 root hrtimer_nanosleep \_ sleep 100000000 87626 1 root flock_lock_inode_wait slurmstepd: [120580.extern] 88240 87626 root hrtimer_nanosleep \_ sleep 100000000 87636 1 root mem_cgroup_oom_synchronize slurmstepd: [120574.extern] 88182 87636 root hrtimer_nanosleep \_ sleep 100000000 87645 1 root do_wait slurmstepd: [120579.extern] 88115 87645 root hrtimer_nanosleep \_ sleep 100000000 87676 1 root mem_cgroup_oom_synchronize slurmstepd: [120569.extern] 88287 87676 root mem_cgroup_oom_synchronize \_ sleep 100000000 87686 1 root mem_cgroup_oom_synchronize slurmstepd: [120573.extern] 88237 87686 root hrtimer_nanosleep \_ sleep 100000000 87696 1 root mem_cgroup_oom_synchronize slurmstepd: [120575.extern] 88248 87696 root hrtimer_nanosleep \_ sleep 100000000 87712 1 root do_wait slurmstepd: [120563.extern] 88142 87712 root hrtimer_nanosleep \_ sleep 100000000 87727 1 root do_wait slurmstepd: [120576.extern] 88042 87727 root hrtimer_nanosleep \_ sleep 100000000 87744 1 root do_wait slurmstepd: [120577.extern] 88157 87744 root hrtimer_nanosleep \_ sleep 100000000 87760 1 root do_wait slurmstepd: [120578.extern] 87884 87760 root hrtimer_nanosleep \_ sleep 100000000 87791 1 root do_wait slurmstepd: [120566.extern] 88103 87791 root hrtimer_nanosleep \_ sleep 100000000 87809 1 root do_wait slurmstepd: [120571.extern] 88099 87809 root hrtimer_nanosleep \_ sleep 100000000 87824 1 root flock_lock_inode_wait slurmstepd: [120567.extern] 87833 1 root do_wait slurmstepd: [120565.extern] 88108 87833 root hrtimer_nanosleep \_ sleep 100000000 87847 1 root do_wait slurmstepd: [120572.extern] 88110 87847 root hrtimer_nanosleep \_ sleep 100000000 87965 1 root do_wait slurmstepd: [120588.extern] 88121 87965 root hrtimer_nanosleep \_ sleep 100000000 87972 1 root mem_cgroup_oom_synchronize slurmstepd: [120591.extern] 88233 87972 root hrtimer_nanosleep \_ sleep 100000000 88007 1 root do_wait slurmstepd: [120592.extern] 88118 88007 root hrtimer_nanosleep \_ sleep 100000000 88020 1 root do_wait slurmstepd: [120586.extern] 88137 88020 root hrtimer_nanosleep \_ sleep 100000000 88037 1 root mem_cgroup_oom_synchronize slurmstepd: [120581.extern] 88224 88037 root hrtimer_nanosleep \_ sleep 100000000 88046 1 root do_wait slurmstepd: [120582.extern] 88147 88046 root hrtimer_nanosleep \_ sleep 100000000 88047 1 root do_wait slurmstepd: [120589.extern] 88175 88047 root hrtimer_nanosleep \_ sleep 100000000 88050 1 root flock_lock_inode_wait slurmstepd: [120583.extern] 88219 88050 root hrtimer_nanosleep \_ sleep 100000000 88051 1 root do_wait slurmstepd: [120590.extern] 88164 88051 root hrtimer_nanosleep \_ sleep 100000000 88053 1 root do_wait slurmstepd: [120587.extern] 88123 88053 root hrtimer_nanosleep \_ sleep 100000000 88056 1 root do_wait slurmstepd: [120585.extern] 88154 88056 root hrtimer_nanosleep \_ sleep 100000000 88057 1 root do_wait slurmstepd: [120584.extern] 88222 88057 root hrtimer_nanosleep \_ sleep 100000000 88081 1 root mem_cgroup_oom_synchronize slurmstepd: [120578.batch] 92074 88081 root mem_cgroup_oom_synchronize \_ slurmstepd: [120578.batch] 88106 1 root flock_lock_inode_wait slurmstepd: [120564.batch] 88172 88106 rtruc pipe_wait \_ slurmstepd: [120564.batch] 88119 1 root mem_cgroup_oom_synchronize slurmstepd: [120571.batch] 92070 88119 root mem_cgroup_oom_synchronize \_ slurmstepd: [120571.batch] 88131 1 root mem_cgroup_oom_synchronize slurmstepd: [120572.batch] 92077 88131 root mem_cgroup_oom_synchronize \_ slurmstepd: [120572.batch] 88139 1 root flock_lock_inode_wait slurmstepd: [120579.batch] 88236 88139 rtruc pipe_wait \_ slurmstepd: [120579.batch] 88150 1 root mem_cgroup_oom_synchronize slurmstepd: [120592.batch] 92076 88150 root mem_cgroup_oom_synchronize \_ slurmstepd: [120592.batch] 88163 1 root flock_lock_inode_wait slurmstepd: [120565.batch] 88234 88163 rtruc pipe_wait \_ slurmstepd: [120565.batch] 88171 1 root mem_cgroup_oom_synchronize slurmstepd: [120568.batch] 88266 88171 rtruc pipe_wait \_ slurmstepd: [120568.batch] 88217 1 root futex_wait_queue_me slurmstepd: [120586.batch] 88231 1 root mem_cgroup_oom_synchronize slurmstepd: [120566.batch] 88257 1 root mem_cgroup_oom_synchronize slurmstepd: [120588.batch] Created attachment 11758 [details]
slurmd messages
Created attachment 11759 [details]
Slurm configuration
Created attachment 11760 [details]
mount command output
Created attachment 11761 [details]
mem_cg grep of system.map...
systemctld logs are nearly 1GB each - compressed. Is there another way I can send them up to you? Tony. Tony, I will first analyze what you've already uploaded before you send slurmctld logs. Do you have a dmesg from the failing node? mem_cgroup_oom_synchronize() kernel function is normally invoked when there's an Out of Memory event. It's also seen in the logs: Sep 27 02:57:24 h25c19 slurmstepd[440940]: _oom_event_monitor: oom-kill event count: 1 If the oom killed slurmstepds randomly it may lead to what you are seeing. I will respond better once I analyzed everything, but in the meantime can you investigate this possibility and can you also attach cgroup.conf from the node? From linux kernel code source/mm/memcontrol.c /** * mem_cgroup_oom_synchronize - complete memcg OOM handling * @handle: actually kill/wait or just clean up the OOM state * * This has to be called at the end of a page fault if the memcg OOM * handler was enabled. * * Memcg supports userspace OOM handling where failed allocations must * sleep on a waitqueue until the userspace task resolves the * situation. Sleeping directly in the charge context with all kinds * of locks held is not a good idea, instead we remember an OOM state * in the task and mem_cgroup_oom_synchronize() has to be called at * the end of the page fault to complete the OOM handling. * * Returns %true if an ongoing memcg OOM situation was detected and * completed, %false otherwise. */ bool mem_cgroup_oom_synchronize(bool handle) { ... } What I can also see is that in some nodes you have MemSpecLimit configured. Specially in h25c19 you have MemSpecLimit=2000 and RealMemory=95000. This means the slurmd daemons shouldn't be killed if they exhaust the memory allocation (ie. the Out-Of-Memory Killer is disabled for the daemon's memory cgroup) but I am wondering if this setting could be related to the issue we're seeing. I see other nodes too with strange MemSpecLimit, you are granting 128GB for slurmd daemon but the system has 192GB, so you're leaving 64GB for jobs. Is this intentional? NodeName=hfe[01-12] CPUs=40 CoresPerSocket=20 RealMemory=192000 MemSpecLimit=128000 Thanks Here is the content of cgroup.conf: [slurm@bqs1 etc]$ cat cgroup.conf CgroupAutomount=yes ConstrainCores=no # ConstrainDevices=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes #CgroupMountpoint=/sys/fs/cgroup It would not be unheard of that I did something stupid in configuring slurm. in you comment: "I see other nodes too with strange MemSpecLimit, you are granting 128GB for slurmd daemon but the system has 192GB, so you're leaving 64GB for jobs. Is this intentional? NodeName=hfe[01-12] CPUs=40 CoresPerSocket=20 RealMemory=192000 MemSpecLimit=128000 " These hfe nodes are login nodes. We have many users. So, I wanted to leave out enough memory for users to compile, edit, etc. Is there a better way? For the other compute nodes, is 2000 sufficient? (In reply to Anthony DelSorbo from comment #10) > Here is the content of cgroup.conf: > > [slurm@bqs1 etc]$ cat cgroup.conf > CgroupAutomount=yes > ConstrainCores=no > # ConstrainDevices=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > #CgroupMountpoint=/sys/fs/cgroup > > It would not be unheard of that I did something stupid in configuring slurm. > in you comment: > > "I see other nodes too with strange MemSpecLimit, you are granting 128GB for > slurmd daemon but the system has 192GB, so you're leaving 64GB for jobs. Is > this intentional? > > NodeName=hfe[01-12] CPUs=40 CoresPerSocket=20 RealMemory=192000 > MemSpecLimit=128000 " > > These hfe nodes are login nodes. We have many users. So, I wanted to leave > out enough memory for users to compile, edit, etc. Is there a better way? Well, just setting the RealMemory to the desired lower value would be better. For example with 66000 you could set MemSpecLimit=2000 and the remaining 64GB would be used for job allocations. I don't like the idea of setting slurmd cgroup memory to 128GB, you're technically allowing slurmd to grow up to 128GB. > For the other compute nodes, is 2000 sufficient? I would say this is sufficient, but I encourage you to check your nodes and guide yourself with the greatest RSS value seen. Besides that, can you provide a a dmesg from a failing node? Created attachment 11776 [details]
dmesg file from h2c07
I lost my previous node we were using for troubleshooting. Be aware we're now using h2c07.
Created attachment 11777 [details]
New process table listing for h2c07
Just in case you need it
Felip, Here's some additional info that may be relevant: The following job completed successfully, it is a single core job. The user's log does not indicate anything abnormal occurred. However, as you can see, the slurmstepd ended with an _oom_event_monitor message. And, although it states an oom-kill occurred, there's no evidence as to what was killed. Oct 2 16:57:05 h25c24 slurmstepd[453775]: job 161901 completed with slurm_rc = 0, job_rc = 0 Oct 2 16:57:05 h25c24 slurmstepd[453775]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 Oct 2 16:57:05 h25c24 slurmstepd[453775]: done with job Oct 2 16:57:05 h25c24 slurmstepd[453768]: Sent signal 18 to 161901.4294967295 Oct 2 16:57:05 h25c24 slurmstepd[453768]: Sent signal 15 to 161901.4294967295 Oct 2 16:57:05 h25c24 slurmstepd[453768]: _oom_event_monitor: oom-kill event count: 1 Oct 2 16:57:05 h25c24 slurmstepd[453768]: done with job Oct 2 16:57:05 h25c24 EPILOG-ROOT: JobInfo: JobId=161901 JobName=RAP_ncl_09_00_zip UserId=Eric.James(5133) GroupId=wrfruc(10019) MCS_label=N/A Priority=206024767 Nice=0 Account=wrfruc QOS=batch JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:01 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2019-10-02T16:57:04 EligibleTime=2019-10-02T16:57:04 AccrueTime=2019-10-02T16:57:04 StartTime=2019-10-02T16:57:04 EndTime=2019-10-02T16:57:05 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-02T16:57:04 Partition=hera AllocNode:Sid=hfe04:240106 ReqNodeList=(null) ExcNodeList=(null) NodeList=h25c24 BatchHost=h25c24 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=800M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=800M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/Eric.James Comment=83df185dcab2b6b36ec18d54aee12bd6 StdErr=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/ncl_201902040900_00_zip.log StdIn=/dev/null StdOut=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/ncl_201902040900_00_zip.log Power= In your dmesg I am seeing that job 89098 is being killed by oom. Specifically it kills ISi.ml60.x command, is that a user executable or some Isilon storage daemon (or other system component)? [668472.552618] ISi.ml60.x invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [668472.560855] ISi.ml60.x cpuset=/ mems_allowed=0-1 [668472.566094] CPU: 53 PID: 165051 Comm: ISi.ml60.x Tainted: P OE ------------ 3.10.0-957.21.3.el7.x86_64 #1 [668472.577400] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0016.C0001.020120190930 02/01/2019 [668472.588301] Call Trace: [668472.591418] [<ffffffffac363107>] dump_stack+0x19/0x1b [668472.597187] [<ffffffffac35db2a>] dump_header+0x90/0x229 [668472.603125] [<ffffffffabcd6b72>] ? default_wake_function+0x12/0x20 [668472.610014] [<ffffffffabdba386>] ? find_lock_task_mm+0x56/0xc0 [668472.616562] [<ffffffffabe317a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [668472.623884] [<ffffffffabdba834>] oom_kill_process+0x254/0x3d0 [668472.630336] [<ffffffffabe35586>] mem_cgroup_oom_synchronize+0x546/0x570 [668472.637656] [<ffffffffabe34a00>] ? mem_cgroup_charge_common+0xc0/0xc0 [668472.644802] [<ffffffffabdbb0c4>] pagefault_out_of_memory+0x14/0x90 [668472.651680] [<ffffffffac35c032>] mm_fault_error+0x6a/0x157 [668472.657874] [<ffffffffac3707c8>] __do_page_fault+0x3c8/0x4f0 [668472.664231] [<ffffffffac370925>] do_page_fault+0x35/0x90 [668472.670243] [<ffffffffac36c768>] page_fault+0x28/0x30 [668472.675995] Task in /slurm/uid_16039/job_89098/step_0/task_0 killed as a result of limit of /slurm/uid_16039/job_89098 As you can see the mem_cgroup_oom_synchronize function shows up here, which is the same than the one seen in 'ps' WCHAN field. In parallel I can see some Lustre errors: [1095389.926811] Lustre: 7633:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569434150/real 0] req@ffff8866fb23e880 x1644529215860368/t0(0) o103->scratch1-OST002f-osc-ffff8873e8f93800@10.186.224.28@o2ib:17/18 lens 328/224 e 0 to 1 dl 1569434157 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [1095389.958596] Lustre: 7633:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [1095389.969455] Lustre: scratch1-OST002f-osc-ffff8873e8f93800: Connection to scratch1-OST002f (at 10.186.224.28@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1095389.989152] Lustre: Skipped 7 previous similar messages [1095391.923466] Lustre: 7673:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569434151/real 0] req@ffff88709d6563c0 x1644529218064592/t0(0) o103->scratch1-OST005b-osc-ffff8873e8f93800@10.186.224.36@o2ib:17/18 lens 328/224 e 0 to 1 dl 1569434159 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [1095391.954481] Lustre: 7673:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 9468 previous similar messages [1095392.445297] drop_caches (192500): drop_caches: 3 [1095392.545268] epilog.root: Wed Sep 25 17:55:59 UTC 2019 Finished Job 105336 [1095392.657636] Lustre: scratch1-MDT0000-mdc-ffff8873e8f93800: Connection to scratch1-MDT0000 (at 10.186.224.66@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1095392.677498] Lustre: Skipped 60 previous similar messages [1095392.868931] Lustre: scratch1-OST0073-osc-ffff8873e8f93800: Connection restored to 10.186.224.26@o2ib (at 10.186.224.26@o2ib) [1095392.881834] Lustre: Skipped 10 previous similar messages [1095395.421728] Lustre: scratch1-MDT0000-mdc-ffff8873e8f93800: Connection restored to 10.186.224.66@o2ib (at 10.186.224.66@o2ib) [1095395.915843] Lustre: 7681:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569434151/real 0] req@ffff8868e147f200 x1644529218676544/t0(0) o103->scratch1-OST0064-osc-ffff8873e8f93800@10.186.224.38@o2ib:17/18 lens 328/224 e 0 to 1 dl 1569434159 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [1095395.947253] Lustre: 7681:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 188922 previous similar messages [1095397.028266] Lustre: scratch1-OST005c-osc-ffff8873e8f93800: Connection restored to 10.186.224.36@o2ib (at 10.186.224.36@o2ib) [1095468.190495] epilog.root: Wed Sep 25 17:57:15 UTC 2019 Finished Job 106268 [1655908.323319] Lustre: 7627:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569996005/real 0] req@ffff885c196117c0 x1644529317911328/t0(0) o400->scratch1-OST0007-osc-ffff8873e8f93800@10.186.224.22@o2ib:28/4 lens 224/224 e 0 to 1 dl 1569996012 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [1655908.323689] Lustre: scratch1-OST0008-osc-ffff8873e8f93800: Connection to scratch1-OST0008 (at 10.186.224.22@o2ib) was lost; in progress operations using this service will wait for recovery to complete [1655908.323690] Lustre: Skipped 1 previous similar message [1655908.380065] Lustre: 7627:0:(client.c:2116:ptlrpc_expire_one_request()) Skipped 36831 previous similar messages [1655914.385317] Lustre: 7614:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1569996012/real 0] req@ffff885eeaf5aac0 x1644529317913280/t0(0) o8->scratch1-OST0007-osc-ffff8873e8f93800@10.186.224.22@o2ib:28/4 lens 520/544 e 0 to 1 dl 1569996018 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [1656123.723948] Lustre: scratch1-OST006f-osc-ffff8873e8f93800: Connection restored to 10.186.224.24@o2ib (at 10.186.224.24@o2ib) [1656123.736981] Lustre: Skipped 60 previous similar messages [1656124.275015] Lustre: scratch1-OST0007-osc-ffff8873e8f93800: Connection restored to 10.186.224.24@o2ib (at 10.186.224.24@o2ib) [1656136.813117] Lustre: scratch1-OST000a-osc-ffff8873e8f93800: Connection restored to 10.186.224.24@o2ib (at 10.186.224.24@o2ib) [1656136.826170] Lustre: Skipped 2 previous similar messages How do you match the RealMemory setting and the real available memory in the system? Does they match exactly? If so I would recommend to reserve a bit more for the OS decreasing RealMemory. From the initial 'ps' note that the slurmd and stepd are in D state which probably means are waiting for input/output: F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 1 D root 125742 1 0 80 0 - 1020759 mem_cg Sep13 ? 00:00:18 /apps/slurm/d/sbin/slurmd ... 1 D root 88290 125742 0 80 0 - 1020759 mem_cg Sep27 ? 00:00:00 \_ /apps/slurm/d/sbin/slurmd So I am betting here you're having storage issues, be it because daemons are dying probably because of OOMs or otherwise because Lustre is failing at some point. Another note; the first failure I see in the logs is in this line: [496671.216158] epilog.root: Wed Sep 18 19:13:30 UTC 2019 Finished Job 75474 [496719.295392] Lustre: 7620:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1568834051/real 0] req@ffff886de0f4ed40 x1644525101596384/t0(0) o4->scratch1-OST0026-osc-ffff8873e8f93800@10.186.224.26@o2ib:6/4 lens 608/448 e 0 to 1 dl 1568834058 ref 3 fl Rpc:X/0/ffffffff rc 0/-1 [496719.324924] Lustre: scratch1-OST0026-osc-ffff8873e8f93800: Connection to scratch1-OST0026 (at 10.186.224.26@o2ib) was lost; in progress operations using this service will wait for recovery to complete ..... just after that the 'unzip' command got stuck into D state: [496940.418823] INFO: task unzip:55703 blocked for more than 120 seconds. [496940.426272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [496940.435067] unzip D ffff886aa1088640 0 55703 55680 0x00000000 Can you also provide me with: 'sysctl -a' in a failing node? Can you also enter the node, go to /sys/fs/cgroup/memory/slurm*/uid_*/job_*/ (for a job and respective user you see stuck in 'ps') and run: pwd; find . -name memory.oom_control -exec echo '{}' \; -exec cat '{}' \; Thanks (In reply to Felip Moll from comment #17) > Can you also provide me with: > > 'sysctl -a' in a failing node? > will attach file .... > > Can you also enter the node, go to /sys/fs/cgroup/memory/slurm*/uid_*/job_*/ > (for a job and respective user you see stuck in 'ps') and run: > > pwd; find . -name memory.oom_control -exec echo '{}' \; -exec cat '{}' \; > > > Thanks [root@h2c07 job_75246]# cd /sys/fs/cgroup/memory/slurm/uid_3014/job_106255 [root@h2c07 job_106255]# pwd; find . -name memory.oom_control -exec echo '{}' \; -exec cat '{}' \; /sys/fs/cgroup/memory/slurm/uid_3014/job_106255 ./step_batch/task_0/memory.oom_control oom_kill_disable 0 under_oom 0 ./step_batch/memory.oom_control oom_kill_disable 0 under_oom 0 ./step_extern/task_0/memory.oom_control oom_kill_disable 0 under_oom 0 ./step_extern/memory.oom_control oom_kill_disable 0 under_oom 0 ./memory.oom_control oom_kill_disable 0 under_oom 0 Created attachment 11800 [details]
sysctl - a output on node
(In reply to Anthony DelSorbo from comment #21) > Created attachment 11800 [details] > sysctl - a output on node Everything seems correct, so please, can you modify the RealMemory and MemSpec to a better value (RealMemory less than the physical memory on the node provided by free -m)? Can you also comment about comment 15 and comment 16? Thanks Anthony, there may be some relation with this bug and 7872. This: Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)' is the cause of the sympthoms of 7872 and can be part of this abandoned slurmstepds. What is happening is that for some reason the freezer cgroup is set to 'null' in your proctrack plugin, then the task plugin tries to get the pids to kill steps from this plugin and since there's no path of the cgroup it fails to find the pids. In a compute node: --- What does 'cat /apps/slurm/d/etc/slurmd.service' show? I want to see if systemd is involved in messing your cgroups. Show me also: gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/slurmd_2eservice and if it doesn't show anything, the full output of: gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1 --recurse After more digging this seems to be just a consequence of trying to kill tasks on a cgroup where no more tasks remain: Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)' so this message is also a symptom, not a consequence. It may have happened that an oom killed some stepd that shouldn't be killed. I encourage you to check my comments 22 and 23, apply the changes and then see what happens. Will wait for your feedback. Hi Anthony, Can you refer to my comments 22 and 23? Thank you. (In reply to Felip Moll from comment #25) > Hi Anthony, > > Can you refer to my comments 22 and 23? > > Thank you. Felip, I'm having a hard time understanding why I need to change the values I currently have and the ramifications thereof. (In reply to Felip Moll from comment #22) > (In reply to Anthony DelSorbo from comment #21) > > Created attachment 11800 [details] > > sysctl - a output on node > > Everything seems correct, so please, can you modify the RealMemory and > MemSpec to a better value (RealMemory less than the physical memory on the > node provided by free -m)? > In our theia system (soon to be decommissioned) we have: [root@t0268 ~]# free -m total used free shared buff/cache available Mem: 64329 1522 62135 410 671 59828 The node definition in slurm.conf is: NodeName=t[0001-1012,1027-1198] CPUs=24 CoresPerSocket=12 RealMemory=64000 and [root@t0268 ~]# scontrol sho node t0268 | egrep "RealMemory|MemSpecLimit|CfgTRES" RealMemory=64000 AllocMem=0 FreeMem=62133 Sockets=2 Boards=1 MemSpecLimit=2000 CfgTRES=cpu=24,mem=62.50G,billing=24 The DefMemPerCPU=2500 on partition theia. Multiply that by 24 cpus = total mem to job = 60000. I'm not experiencing issues on that system - but it is running 18.08. On the new hera system, I have: [root@h6c34 ~]# free -m total used free shared buff/cache available Mem: 95316 1741 90979 757 2595 91669 The node definition in slurm.conf is: NodeName=h1c[01-52],h[2-5]c[01-56],...,h[25]c[01-52] CPUs=40 CoresPerSocket=20 RealMemory=95000 and [root@h6c34 ~]# scontrol show node h6c34 | egrep "RealMemory|MemSpecLimit|CfgTRES" RealMemory=95000 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 MemSpecLimit=2000 CfgTRES=cpu=40,mem=95000M,billing=40 The DefMemPerCPU=2300 on partition hera x 40 cpus = 92000. This system is currently ruuning 19.05.1 and tomorrow it will be 19.05.3. The configuration is nearly identical. So according to my aggregate defMemPerCPU I'm already 3GB below the RealMemory value of 95000. If I understand you correctly, I should set my RealMemory to less than 90979. So, let's say 90000. Then what do I set memSpec to and how does that affect DefMemPerCPU? Doesn't that imply reducing the total amount of memory to the user as well? Tony. > Felip, I'm having a hard time understanding why I need to change the values > I currently have and the ramifications thereof. I wanted to set this to safe values to let the OS enough memory for I/O operations. If you set RealMemory very close to the available system memory then jobs can allocate all this memory and not leave any for other processes, which can lead to OOMs on the OS side. That may not be the reason of why you're seeing the defunct processes, but setting safer values will remove one piece out of the problem. > In our theia system (soon to be decommissioned) we have: > > [root@t0268 ~]# free -m > total used free shared buff/cache > available > Mem: 64329 1522 62135 410 671 > 59828 > > The node definition in slurm.conf is: > > NodeName=t[0001-1012,1027-1198] CPUs=24 CoresPerSocket=12 RealMemory=64000 Jobs are able to allocate up to 64000MB, leaving only 329 left for the OS. If I run "srun --mem=64000 -N1 ./my_mem_eater" I may trigger OOM on OS side. > I'm not experiencing issues on that system - but it is running 18.08. On > the new hera system, I have: > > [root@h6c34 ~]# free -m > total used free shared buff/cache > available > Mem: 95316 1741 90979 757 2595 > 91669 > > The node definition in slurm.conf is: > > NodeName=h1c[01-52],h[2-5]c[01-56],...,h[25]c[01-52] CPUs=40 > CoresPerSocket=20 RealMemory=95000 > The same in this case if I run 'srun --mem=95000..." > The DefMemPerCPU=2300 on partition hera x 40 cpus = 92000. The DefMemPerCPU can be overwritten by user command line --mem, or --mem-per-cpu. This is not a guarantee that jobs won't use more memory. > So according to my aggregate defMemPerCPU I'm already 3GB below the > RealMemory value of 95000. If users don't specify --mem* , yes. Otherwise they can consume 95000MB per node. > If I understand you correctly, I should set my RealMemory to less than > 90979. So, let's say 90000. No, to 95316-[safe_amount_for_OS]. If you consider that Lustre, I/O, system daemons and others will consume for example a max of 5GB, then I would set for example 90000MB, leaving 5316 for the OS. > Then what do I set memSpec to and how does > that affect DefMemPerCPU? MemSpecLimit=<amount> will substract 'amount' from what's available for job allocations. I.e. if you set RealMemory=90000 and MemSpecLimit=2000 then the memory allocatable for jobs will be 88000MB, and 2000MB will be set as the limit for the cgroup of slurmd, so slurmd daemon won't exceed 2000MB usage. > Doesn't that imply reducing the total amount of > memory to the user as well? Yes. Added to this information and returning back to the error seen in dmesg_h2c07.log, I have still unresolved questions from comment 15. 1. ISi.ml60.x, what is that command? Is this a specific job command, or something related to storage? 2. That command was killed due to out of memory in the job's cgroup, followed after an issue in Lustre storage. I was wondering if killing this command + later Lustre I/O issues had something to do with your slurmd defunct processes. 3. Comment 16 refers to unzip being in 'D' state after Lustre issues. I wonder if this has to do with the issue too. 4. I am still thinking about bug 7872 being a duplicate of this one. [668484.785112] Memory cgroup out of memory: Kill process 165028 (ISi.ml60.x) score 40 or sacrifice child [668484.794886] Killed process 165028 (ISi.ml60.x) total-vm:738984kB, anon-rss:208648kB, file-rss:2384kB, shmem-rss:1284kB [668484.863204] ISi.ml60.x invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 [668484.871516] ISi.ml60.x cpuset=/ mems_allowed=0-1 [668484.876797] CPU: 29 PID: 165044 Comm: ISi.ml60.x Tainted: P OE ------------ 3.10.0-957.21.3.el7.x86_64 #1 [668484.888029] Hardware name: Cray Inc. S2600BPB/S2600BPB, BIOS SE5C620.86B.00.01.0016.C0001.020120190930 02/01/2019 [668484.898813] Call Trace: [668484.901804] [<ffffffffac363107>] dump_stack+0x19/0x1b [668484.907458] [<ffffffffac35db2a>] dump_header+0x90/0x229 [668484.913273] [<ffffffffabcd6b72>] ? default_wake_function+0x12/0x20 [668484.920037] [<ffffffffabdba386>] ? find_lock_task_mm+0x56/0xc0 [668484.926430] [<ffffffffabe317a8>] ? try_get_mem_cgroup_from_mm+0x28/0x60 [668484.933585] [<ffffffffabdba834>] oom_kill_process+0x254/0x3d0 [668484.939866] [<ffffffffabe35586>] mem_cgroup_oom_synchronize+0x546/0x570 [668484.947002] [<ffffffffabe34a00>] ? mem_cgroup_charge_common+0xc0/0xc0 [668484.953977] [<ffffffffabdbb0c4>] pagefault_out_of_memory+0x14/0x90 [668484.960694] [<ffffffffac35c032>] mm_fault_error+0x6a/0x157 [668484.966711] [<ffffffffac3707c8>] __do_page_fault+0x3c8/0x4f0 [668484.972903] [<ffffffffac370925>] do_page_fault+0x35/0x90 [668484.978759] [<ffffffffac36c768>] page_fault+0x28/0x30 [668484.984349] Task in /slurm/uid_16039/job_89098/step_0/task_12 killed as a result of limit of /slurm/uid_16039/job_89098 [668484.995608] memory: usage 5242880kB, limit 5242880kB, failcnt 155852 [668485.002444] memory+swap: usage 5242880kB, limit 5242880kB, failcnt 694 [668485.009453] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 Actions on your side: a. Respond to questions or make observation to 1, 2, 3, 4 b. Decide whether to set safe values for RealMemory and optionally MemSpecLimit. c. Check for Lustre issues d. Inform if it is still happening frequently Hi, do you have any feedback on my last comment? Thanks! (In reply to Felip Moll from comment #28) > Hi, do you have any feedback on my last comment? > > Thanks! Hi Anthony, Since I haven't received any feedback in the last month I am heading to close this issue as INFOGIVEN. If you consider it is still unresolved, please mark the bug as OPEN again and provide the information requested in my latest comments. Thanks for your comprehension! |