Description
Ramy Adly
2020-11-04 03:46:29 MST
Ramy, Can you provide me with the full slurmd log? Also, are you starting slurmd with systemd? If so, can you paste here the output of: systemctl cat slurmd ? Thanks Created attachment 16532 [details]
slurmd log file
Hello Felip, I have attached the full slurmd log Also, Here is the requested output: # systemctl cat slurmd # /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm//slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target --------------------------------- # systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled) Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago Main PID: 52021 (slurmd) CGroup: /system.slice/slurmd.service └─52021 /opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon. -------------------------------------------------------------- Thanks! Regards, Ramy (In reply to Ramy Adly from comment #8) > Hello Felip, > > > I have attached the full slurmd log > Also, Here is the requested output: > # systemctl cat slurmd > # /usr/lib/systemd/system/slurmd.service > [Unit] > Description=Slurm node daemon > After=munge.service network.target remote-fs.target > #ConditionPathExists=/etc/slurm//slurm.conf > > [Service] > Type=simple > EnvironmentFile=-/etc/sysconfig/slurmd > ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1- > 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D > $SLURMD_OPTIONS > ExecReload=/bin/kill -HUP $MAINPID > KillMode=process > LimitNOFILE=131072 > LimitMEMLOCK=infinity > LimitSTACK=infinity > Delegate=yes > > > [Install] > WantedBy=multi-user.target > > > > --------------------------------- > # systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor > preset: disabled) > Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago > Main PID: 52021 (slurmd) > CGroup: /system.slice/slurmd.service > └─52021 > /opt/slurm/install/slurm-20.11.0-0rc1- > 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D > > Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon. > -------------------------------------------------------------- > > Thanks! > > > Regards, > Ramy While I am looking into that..: Do you start slurmd with systemd? if so.. can we do a test? Can you start slurmd manually (as root) instead of with systemd? Understanding, as you said, that the issue is always reproducible. I am reproducing the issue. No need for you to provide any more info. Will keep you posted. Thanks There are two issues in your original description: First is: slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy This error is new and catched since 20.11, so I provided a fix for the bug and it will be in since 20.11.0rc2, see commit a0181c789061508. Second is: [2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. which is being addressed in bug 9737. Does this one happen all the time? Can you still do a test and start slurmd completely outside of systemd? Thanks Starting slurmd out of systemd: /opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D Issue seems to persist. Here is slurmd log: [2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty: Device or resource busy [2020-11-08T10:20:28.299] [114478.0] done with job [2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory.force_empty: Device or resource busy [2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1 [2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill event(s) in StepId=114478.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2020-11-08T10:20:28.450] [114478.extern] done with job [2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4 files) in 0.000264 seconds Thank you, Ramy (In reply to Ramy Adly from comment #18) > Starting slurmd out of systemd: > /opt/slurm/install/slurm-20.11.0-0rc1- > 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D > > Issue seems to persist. Here is slurmd log: > > > [2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to > write 1 bytes to cgroup > /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty: > Device or resource busy > [2020-11-08T10:20:28.299] [114478.0] done with job > [2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable > to write 1 bytes to cgroup > /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory. > force_empty: Device or resource busy > [2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor: > oom-kill event count: 1 > [2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill > event(s) in StepId=114478.extern cgroup. Some of your processes may have > been killed by the cgroup out-of-memory handler. > [2020-11-08T10:20:28.450] [114478.extern] done with job > [2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4 > files) in 0.000264 seconds > > Thank you, > Ramy Can you upgrade to the latest master branch and check whether the oom-kill event error still happens? What's your exact kernel version? (uname -a) Hi Felip. Can you provide a commit # to pull against? Is it 158408a207285721d809e86ff66a52645ece0167 ? -greg (In reply to Greg Wickham from comment #20) > Hi Felip. > > Can you provide a commit # to pull against? > > Is it 158408a207285721d809e86ff66a52645ece0167 ? > > -greg That's the last one, so it is good. I was specifically referring to a0181c789061508, but go with the last instead. Hello Fellip, We have updated slurm to 158408a207285721d809e86ff66a52645ece0167. Kernel version is 3.10.0-1062.12.1.el7.x86_64 cgroup error has been resolved indeed. However OOM error is still occurring: [2020-11-11T09:41:44.916] [114565.0] done with job [2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1 [2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill event(s) in StepId=114565.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2020-11-11T09:41:44.993] [114565.extern] done with job [2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4 files) in 0.000201 seconds $ sacct -j 114565 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114565 hostname batch default 1 COMPLETED 0:0 114565.exte+ extern default 1 OUT_OF_ME+ 0:125 114565.0 hostname default 1 COMPLETED 0:0 Regards, Ramy (In reply to Ramy Adly from comment #22) > Hello Fellip, > > We have updated slurm to 158408a207285721d809e86ff66a52645ece0167. > > Kernel version is 3.10.0-1062.12.1.el7.x86_64 > > > cgroup error has been resolved indeed. > > However OOM error is still occurring: > > > [2020-11-11T09:41:44.916] [114565.0] done with job > [2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor: > oom-kill event count: 1 > [2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill > event(s) in StepId=114565.extern cgroup. Some of your processes may have > been killed by the cgroup out-of-memory handler. > [2020-11-11T09:41:44.993] [114565.extern] done with job > [2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4 > files) in 0.000201 seconds > > > $ sacct -j 114565 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 114565 hostname batch default 1 COMPLETED 0:0 > 114565.exte+ extern default 1 OUT_OF_ME+ 0:125 > 114565.0 hostname default 1 COMPLETED 0:0 > > > Regards, > Ramy Okay, good. We're working with the '1 oom-kill event(s)' error. It is not unique to you. I will let you know when we found the exact cause. That's just a quick update. I've found an issue but it may not be directly related. The issue is in the cgroups notify mechanism, when a process in a cgroup hierarchy is OOMed, an event is created and broadcasted to all listeners of events in the hierarchy. Since every step is a listener, every step does receive OOMs from other cgroups and adds it to its counter. Since you're only receiving an OOM on the extern cgroup... let's do a test: - If it is always reproducible: 1. Run an 'salloc' job on a specific node 2. Go to /sys/fs/cgroup/memory/slurm/uid_xx/job_yy/step_extern/ and do: cat memory.limit_in_bytes cat memory.oom_control Send me this information. 3. Exit salloc and check the OOM is registered in sacct 4. Send me a 'dmesg -T' in the node. Tell me which kernel version are you on your nodes. This issue is the same than bug 10255, if you don't mind I put you in CC in the other bug where is more people affected and will keep track of things there. So, ignore my last comments. I reproduced that in my CentOS 7 and I am investigating the cause in the other bug: [slurm@moll0 inst]$ sbatch --wrap "sleep 10" Submitted batch job 34 [slurm@moll0 inst]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 34 debug wrap slurm R 0:01 1 moll1 [slurm@moll0 inst]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 34 wrap debug slurm 1 COMPLETED 0:0 34.batch batch slurm 1 COMPLETED 0:0 34.extern extern slurm 1 OUT_OF_ME+ 0:125 [slurm@moll0 inst]$ uname -a Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [slurm@moll0 inst]$ cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) I am closing this one now. *** This ticket has been marked as a duplicate of ticket 10255 *** |