Ticket 10122

Summary:	jobs always trigger "unable to write 1 bytes to cgroup" error
Product:	Slurm	Reporter:	Ramy Adly <ramy.ghattas>
Component:	Limits	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek, felip.moll, greg.wickham
Version:	20.11.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9641 https://bugs.schedmd.com/show_bug.cgi?id=9891 https://bugs.schedmd.com/show_bug.cgi?id=9737 https://bugs.schedmd.com/show_bug.cgi?id=4578 https://bugs.schedmd.com/show_bug.cgi?id=10255 https://bugs.schedmd.com/show_bug.cgi?id=10300
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log file

Description Ramy Adly 2020-11-04 03:46:29 MST

Hello,

While testing slurm 20.11.0-0rc1, we have encountered an odd behavior.

Although jobs finish successfully,slurm always produces the following error:

$ srun  --time=1:00   --mem=100m  hostname 
...
cn110-23-l
slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy


From slurmd logs, It looks like slurm always detects OOM error while terminating any job:

[2020-11-04T13:35:58.800] [114467.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy
[2020-11-04T13:35:58.839] [114467.0] done with job
[2020-11-04T13:35:58.885] [114467.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_extern/memory.force_empty: Device or resource busy
[2020-11-04T13:35:58.905] [114467.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-11-04T13:35:58.929] [114467.extern] done with job



From sacct point of view, it seems that the job steps complete successfully but not .extern:
$ sacct -j 114467
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
114467         hostname      batch    default          1  COMPLETED      0:0 
114467.exte+     extern               default          1 OUT_OF_ME+    0:125 
114467.0       hostname               default          1  COMPLETED      0:0 



The same behavior is encountered with sbatch as well.
Please advise.


Regards,
Ramy

Comment 6 Felip Moll 2020-11-05 11:22:23 MST

Ramy,

Can you provide me with the full slurmd log?

Also, are you starting slurmd with systemd? If so, can you paste here the output of:

systemctl cat slurmd

?

Thanks

Comment 7 Ramy Adly 2020-11-05 17:10:51 MST

Created attachment 16532 [details]
slurmd log file

Comment 8 Ramy Adly 2020-11-05 17:11:26 MST

Hello Felip,


I have attached the full slurmd log
Also, Here is the requested output:
# systemctl cat slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm//slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes


[Install]
WantedBy=multi-user.target



---------------------------------
# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago
 Main PID: 52021 (slurmd)
   CGroup: /system.slice/slurmd.service
           └─52021 /opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D

Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon.
--------------------------------------------------------------

Thanks!


Regards,
Ramy

Comment 9 Felip Moll 2020-11-06 02:36:59 MST

(In reply to Ramy Adly from comment #8)
> Hello Felip,
> 
> 
> I have attached the full slurmd log
> Also, Here is the requested output:
> # systemctl cat slurmd
> # /usr/lib/systemd/system/slurmd.service
> [Unit]
> Description=Slurm node daemon
> After=munge.service network.target remote-fs.target
> #ConditionPathExists=/etc/slurm//slurm.conf
> 
> [Service]
> Type=simple
> EnvironmentFile=-/etc/sysconfig/slurmd
> ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1-
> 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D
> $SLURMD_OPTIONS
> ExecReload=/bin/kill -HUP $MAINPID
> KillMode=process
> LimitNOFILE=131072
> LimitMEMLOCK=infinity
> LimitSTACK=infinity
> Delegate=yes
> 
> 
> [Install]
> WantedBy=multi-user.target
> 
> 
> 
> ---------------------------------
> # systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor
> preset: disabled)
>    Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago
>  Main PID: 52021 (slurmd)
>    CGroup: /system.slice/slurmd.service
>            └─52021
> /opt/slurm/install/slurm-20.11.0-0rc1-
> 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D
> 
> Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon.
> --------------------------------------------------------------
> 
> Thanks!
> 
> 
> Regards,
> Ramy

While I am looking into that..:

Do you start slurmd with systemd? if so.. can we do a test?
Can you start slurmd manually (as root) instead of with systemd? 

Understanding, as you said, that the issue is always reproducible.

Comment 10 Felip Moll 2020-11-06 09:09:22 MST

I am reproducing the issue. No need for you to provide any more info.

Will keep you posted.

Thanks

Comment 17 Felip Moll 2020-11-07 00:56:34 MST

There are two issues in your original description:

First is:

slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy

This error is new and catched since 20.11, so I provided a fix for the bug and it will be in since 20.11.0rc2, see commit a0181c789061508.

Second is:

[2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

which is being addressed in bug 9737.

Does this one happen all the time? Can you still do a test and start slurmd completely outside of systemd?

Thanks

Comment 18 Ramy Adly 2020-11-08 00:27:46 MST

Starting slurmd out of systemd:
/opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D

Issue seems to persist. Here is slurmd log:


[2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty: Device or resource busy
[2020-11-08T10:20:28.299] [114478.0] done with job
[2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory.force_empty: Device or resource busy
[2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill event(s) in StepId=114478.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-11-08T10:20:28.450] [114478.extern] done with job
[2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4 files) in 0.000264 seconds

Thank you,
Ramy

Comment 19 Felip Moll 2020-11-09 06:22:24 MST

(In reply to Ramy Adly from comment #18)
> Starting slurmd out of systemd:
> /opt/slurm/install/slurm-20.11.0-0rc1-
> 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D
> 
> Issue seems to persist. Here is slurmd log:
> 
> 
> [2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to
> write 1 bytes to cgroup
> /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty:
> Device or resource busy
> [2020-11-08T10:20:28.299] [114478.0] done with job
> [2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable
> to write 1 bytes to cgroup
> /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory.
> force_empty: Device or resource busy
> [2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor:
> oom-kill event count: 1
> [2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill
> event(s) in StepId=114478.extern cgroup. Some of your processes may have
> been killed by the cgroup out-of-memory handler.
> [2020-11-08T10:20:28.450] [114478.extern] done with job
> [2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4
> files) in 0.000264 seconds
> 
> Thank you,
> Ramy

Can you upgrade to the latest master branch and check whether the oom-kill event error still happens?

What's your exact kernel version? (uname -a)

Comment 20 Greg Wickham 2020-11-09 09:11:40 MST

Hi Felip.

Can you provide a commit # to pull against?

Is it 158408a207285721d809e86ff66a52645ece0167 ?

   -greg

Comment 21 Felip Moll 2020-11-09 10:07:06 MST

(In reply to Greg Wickham from comment #20)
> Hi Felip.
> 
> Can you provide a commit # to pull against?
> 
> Is it 158408a207285721d809e86ff66a52645ece0167 ?
> 
>    -greg

That's the last one, so it is good.


I was specifically referring to a0181c789061508, but go with the last instead.

Comment 22 Ramy Adly 2020-11-10 23:49:54 MST

Hello Fellip,

We have updated slurm to 158408a207285721d809e86ff66a52645ece0167.

Kernel version is 3.10.0-1062.12.1.el7.x86_64


cgroup error has been resolved indeed.

However OOM error is still occurring:


[2020-11-11T09:41:44.916] [114565.0] done with job
[2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill event(s) in StepId=114565.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-11-11T09:41:44.993] [114565.extern] done with job
[2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4 files) in 0.000201 seconds


$ sacct -j 114565
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
114565         hostname      batch    default          1  COMPLETED      0:0 
114565.exte+     extern               default          1 OUT_OF_ME+    0:125 
114565.0       hostname               default          1  COMPLETED      0:0 


Regards,
Ramy

Comment 23 Felip Moll 2020-11-11 00:58:43 MST

(In reply to Ramy Adly from comment #22)
> Hello Fellip,
> 
> We have updated slurm to 158408a207285721d809e86ff66a52645ece0167.
> 
> Kernel version is 3.10.0-1062.12.1.el7.x86_64
> 
> 
> cgroup error has been resolved indeed.
> 
> However OOM error is still occurring:
> 
> 
> [2020-11-11T09:41:44.916] [114565.0] done with job
> [2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor:
> oom-kill event count: 1
> [2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill
> event(s) in StepId=114565.extern cgroup. Some of your processes may have
> been killed by the cgroup out-of-memory handler.
> [2020-11-11T09:41:44.993] [114565.extern] done with job
> [2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4
> files) in 0.000201 seconds
> 
> 
> $ sacct -j 114565
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 114565         hostname      batch    default          1  COMPLETED      0:0 
> 114565.exte+     extern               default          1 OUT_OF_ME+    0:125 
> 114565.0       hostname               default          1  COMPLETED      0:0 
> 
> 
> Regards,
> Ramy

Okay, good.

We're working with the '1 oom-kill event(s)' error. It is not unique to you.

I will let you know when we found the exact cause.

Comment 24 Felip Moll 2020-11-19 11:27:10 MST

That's just a quick update.

I've found an issue but it may not be directly related.

The issue is in the cgroups notify mechanism, when a process in a cgroup hierarchy is OOMed, an event is created and broadcasted to all listeners of events in the hierarchy. Since every step is a listener, every step does receive OOMs from other cgroups and adds it to its counter.

Since you're only receiving an OOM on the extern cgroup... let's do a test:

- If it is always reproducible:
1. Run an 'salloc' job on a specific node
2. Go to /sys/fs/cgroup/memory/slurm/uid_xx/job_yy/step_extern/ and do:

cat memory.limit_in_bytes
cat memory.oom_control

Send me this information.

3. Exit salloc and check the OOM is registered in sacct
4. Send me a 'dmesg -T' in the node.

Tell me which kernel version are you on your nodes.

Comment 25 Felip Moll 2020-11-20 06:58:36 MST

This issue is the same than bug 10255, if you don't mind I put you in CC in the other bug where is more people affected and will keep track of things there.

So, ignore my last comments.

I reproduced that in my CentOS 7 and I am investigating the cause in the other bug:

[slurm@moll0 inst]$ sbatch --wrap "sleep 10" 
Submitted batch job 34
[slurm@moll0 inst]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                34     debug     wrap    slurm  R       0:01      1 moll1 
[slurm@moll0 inst]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
34                 wrap      debug      slurm          1  COMPLETED      0:0 
34.batch          batch                 slurm          1  COMPLETED      0:0 
34.extern        extern                 slurm          1 OUT_OF_ME+    0:125 
[slurm@moll0 inst]$ uname -a
Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[slurm@moll0 inst]$ cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core)

I am closing this one now.

*** This ticket has been marked as a duplicate of ticket 10255 ***