Hello, While testing slurm 20.11.0-0rc1, we have encountered an odd behavior. Although jobs finish successfully,slurm always produces the following error: $ srun --time=1:00 --mem=100m hostname ... cn110-23-l slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy From slurmd logs, It looks like slurm always detects OOM error while terminating any job: [2020-11-04T13:35:58.800] [114467.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy [2020-11-04T13:35:58.839] [114467.0] done with job [2020-11-04T13:35:58.885] [114467.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_extern/memory.force_empty: Device or resource busy [2020-11-04T13:35:58.905] [114467.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1 [2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2020-11-04T13:35:58.929] [114467.extern] done with job From sacct point of view, it seems that the job steps complete successfully but not .extern: $ sacct -j 114467 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114467 hostname batch default 1 COMPLETED 0:0 114467.exte+ extern default 1 OUT_OF_ME+ 0:125 114467.0 hostname default 1 COMPLETED 0:0 The same behavior is encountered with sbatch as well. Please advise. Regards, Ramy
Ramy, Can you provide me with the full slurmd log? Also, are you starting slurmd with systemd? If so, can you paste here the output of: systemctl cat slurmd ? Thanks
Created attachment 16532 [details] slurmd log file
Hello Felip, I have attached the full slurmd log Also, Here is the requested output: # systemctl cat slurmd # /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm//slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target --------------------------------- # systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled) Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago Main PID: 52021 (slurmd) CGroup: /system.slice/slurmd.service └─52021 /opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon. -------------------------------------------------------------- Thanks! Regards, Ramy
(In reply to Ramy Adly from comment #8) > Hello Felip, > > > I have attached the full slurmd log > Also, Here is the requested output: > # systemctl cat slurmd > # /usr/lib/systemd/system/slurmd.service > [Unit] > Description=Slurm node daemon > After=munge.service network.target remote-fs.target > #ConditionPathExists=/etc/slurm//slurm.conf > > [Service] > Type=simple > EnvironmentFile=-/etc/sysconfig/slurmd > ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1- > 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D > $SLURMD_OPTIONS > ExecReload=/bin/kill -HUP $MAINPID > KillMode=process > LimitNOFILE=131072 > LimitMEMLOCK=infinity > LimitSTACK=infinity > Delegate=yes > > > [Install] > WantedBy=multi-user.target > > > > --------------------------------- > # systemctl status slurmd > ● slurmd.service - Slurm node daemon > Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor > preset: disabled) > Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago > Main PID: 52021 (slurmd) > CGroup: /system.slice/slurmd.service > └─52021 > /opt/slurm/install/slurm-20.11.0-0rc1- > 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D > > Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon. > -------------------------------------------------------------- > > Thanks! > > > Regards, > Ramy While I am looking into that..: Do you start slurmd with systemd? if so.. can we do a test? Can you start slurmd manually (as root) instead of with systemd? Understanding, as you said, that the issue is always reproducible.
I am reproducing the issue. No need for you to provide any more info. Will keep you posted. Thanks
There are two issues in your original description: First is: slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy This error is new and catched since 20.11, so I provided a fix for the bug and it will be in since 20.11.0rc2, see commit a0181c789061508. Second is: [2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. which is being addressed in bug 9737. Does this one happen all the time? Can you still do a test and start slurmd completely outside of systemd? Thanks
Starting slurmd out of systemd: /opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D Issue seems to persist. Here is slurmd log: [2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty: Device or resource busy [2020-11-08T10:20:28.299] [114478.0] done with job [2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory.force_empty: Device or resource busy [2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1 [2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill event(s) in StepId=114478.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2020-11-08T10:20:28.450] [114478.extern] done with job [2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4 files) in 0.000264 seconds Thank you, Ramy
(In reply to Ramy Adly from comment #18) > Starting slurmd out of systemd: > /opt/slurm/install/slurm-20.11.0-0rc1- > 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D > > Issue seems to persist. Here is slurmd log: > > > [2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to > write 1 bytes to cgroup > /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty: > Device or resource busy > [2020-11-08T10:20:28.299] [114478.0] done with job > [2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable > to write 1 bytes to cgroup > /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory. > force_empty: Device or resource busy > [2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor: > oom-kill event count: 1 > [2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill > event(s) in StepId=114478.extern cgroup. Some of your processes may have > been killed by the cgroup out-of-memory handler. > [2020-11-08T10:20:28.450] [114478.extern] done with job > [2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4 > files) in 0.000264 seconds > > Thank you, > Ramy Can you upgrade to the latest master branch and check whether the oom-kill event error still happens? What's your exact kernel version? (uname -a)
Hi Felip. Can you provide a commit # to pull against? Is it 158408a207285721d809e86ff66a52645ece0167 ? -greg
(In reply to Greg Wickham from comment #20) > Hi Felip. > > Can you provide a commit # to pull against? > > Is it 158408a207285721d809e86ff66a52645ece0167 ? > > -greg That's the last one, so it is good. I was specifically referring to a0181c789061508, but go with the last instead.
Hello Fellip, We have updated slurm to 158408a207285721d809e86ff66a52645ece0167. Kernel version is 3.10.0-1062.12.1.el7.x86_64 cgroup error has been resolved indeed. However OOM error is still occurring: [2020-11-11T09:41:44.916] [114565.0] done with job [2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1 [2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill event(s) in StepId=114565.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. [2020-11-11T09:41:44.993] [114565.extern] done with job [2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4 files) in 0.000201 seconds $ sacct -j 114565 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 114565 hostname batch default 1 COMPLETED 0:0 114565.exte+ extern default 1 OUT_OF_ME+ 0:125 114565.0 hostname default 1 COMPLETED 0:0 Regards, Ramy
(In reply to Ramy Adly from comment #22) > Hello Fellip, > > We have updated slurm to 158408a207285721d809e86ff66a52645ece0167. > > Kernel version is 3.10.0-1062.12.1.el7.x86_64 > > > cgroup error has been resolved indeed. > > However OOM error is still occurring: > > > [2020-11-11T09:41:44.916] [114565.0] done with job > [2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor: > oom-kill event count: 1 > [2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill > event(s) in StepId=114565.extern cgroup. Some of your processes may have > been killed by the cgroup out-of-memory handler. > [2020-11-11T09:41:44.993] [114565.extern] done with job > [2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4 > files) in 0.000201 seconds > > > $ sacct -j 114565 > JobID JobName Partition Account AllocCPUS State ExitCode > ------------ ---------- ---------- ---------- ---------- ---------- -------- > 114565 hostname batch default 1 COMPLETED 0:0 > 114565.exte+ extern default 1 OUT_OF_ME+ 0:125 > 114565.0 hostname default 1 COMPLETED 0:0 > > > Regards, > Ramy Okay, good. We're working with the '1 oom-kill event(s)' error. It is not unique to you. I will let you know when we found the exact cause.
That's just a quick update. I've found an issue but it may not be directly related. The issue is in the cgroups notify mechanism, when a process in a cgroup hierarchy is OOMed, an event is created and broadcasted to all listeners of events in the hierarchy. Since every step is a listener, every step does receive OOMs from other cgroups and adds it to its counter. Since you're only receiving an OOM on the extern cgroup... let's do a test: - If it is always reproducible: 1. Run an 'salloc' job on a specific node 2. Go to /sys/fs/cgroup/memory/slurm/uid_xx/job_yy/step_extern/ and do: cat memory.limit_in_bytes cat memory.oom_control Send me this information. 3. Exit salloc and check the OOM is registered in sacct 4. Send me a 'dmesg -T' in the node. Tell me which kernel version are you on your nodes.
This issue is the same than bug 10255, if you don't mind I put you in CC in the other bug where is more people affected and will keep track of things there. So, ignore my last comments. I reproduced that in my CentOS 7 and I am investigating the cause in the other bug: [slurm@moll0 inst]$ sbatch --wrap "sleep 10" Submitted batch job 34 [slurm@moll0 inst]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 34 debug wrap slurm R 0:01 1 moll1 [slurm@moll0 inst]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 34 wrap debug slurm 1 COMPLETED 0:0 34.batch batch slurm 1 COMPLETED 0:0 34.extern extern slurm 1 OUT_OF_ME+ 0:125 [slurm@moll0 inst]$ uname -a Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [slurm@moll0 inst]$ cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) I am closing this one now. *** This ticket has been marked as a duplicate of ticket 10255 ***