10122 – jobs always trigger "unable to write 1 bytes to cgroup" error

Ticket 10122 - jobs always trigger "unable to write 1 bytes to cgroup" error

Summary: jobs always trigger "unable to write 1 bytes to cgroup" error

Status:	RESOLVED DUPLICATE of ticket 10255

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Limits (show other tickets)
Version:	20.11.x
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-11-04 03:46 MST by Ramy Adly
Modified:	2020-11-26 09:51 MST (History)
CC List:	3 users (show)

See Also:	9641 9891 9737 4578 10255 10300
Site:	KAUST
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmd log file (9.34 KB, text/x-log) 2020-11-05 17:10 MST, Ramy Adly	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ramy Adly 2020-11-04 03:46:29 MST

Hello,

While testing slurm 20.11.0-0rc1, we have encountered an odd behavior.

Although jobs finish successfully,slurm always produces the following error:

$ srun  --time=1:00   --mem=100m  hostname 
...
cn110-23-l
slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy


From slurmd logs, It looks like slurm always detects OOM error while terminating any job:

[2020-11-04T13:35:58.800] [114467.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy
[2020-11-04T13:35:58.839] [114467.0] done with job
[2020-11-04T13:35:58.885] [114467.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_extern/memory.force_empty: Device or resource busy
[2020-11-04T13:35:58.905] [114467.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-11-04T13:35:58.929] [114467.extern] done with job



From sacct point of view, it seems that the job steps complete successfully but not .extern:
$ sacct -j 114467
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
114467         hostname      batch    default          1  COMPLETED      0:0 
114467.exte+     extern               default          1 OUT_OF_ME+    0:125 
114467.0       hostname               default          1  COMPLETED      0:0 



The same behavior is encountered with sbatch as well.
Please advise.


Regards,
Ramy

Comment 6 Felip Moll 2020-11-05 11:22:23 MST

Ramy,

Can you provide me with the full slurmd log?

Also, are you starting slurmd with systemd? If so, can you paste here the output of:

systemctl cat slurmd

?

Thanks

Comment 7 Ramy Adly 2020-11-05 17:10:51 MST

Created attachment 16532 [details]
slurmd log file

Comment 8 Ramy Adly 2020-11-05 17:11:26 MST

Hello Felip,


I have attached the full slurmd log
Also, Here is the requested output:
# systemctl cat slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm//slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes


[Install]
WantedBy=multi-user.target



---------------------------------
# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago
 Main PID: 52021 (slurmd)
   CGroup: /system.slice/slurmd.service
           └─52021 /opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D

Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon.
--------------------------------------------------------------

Thanks!


Regards,
Ramy

Comment 9 Felip Moll 2020-11-06 02:36:59 MST

(In reply to Ramy Adly from comment #8)
> Hello Felip,
> 
> 
> I have attached the full slurmd log
> Also, Here is the requested output:
> # systemctl cat slurmd
> # /usr/lib/systemd/system/slurmd.service
> [Unit]
> Description=Slurm node daemon
> After=munge.service network.target remote-fs.target
> #ConditionPathExists=/etc/slurm//slurm.conf
> 
> [Service]
> Type=simple
> EnvironmentFile=-/etc/sysconfig/slurmd
> ExecStart=/opt/slurm/install/slurm-20.11.0-0rc1-
> 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D
> $SLURMD_OPTIONS
> ExecReload=/bin/kill -HUP $MAINPID
> KillMode=process
> LimitNOFILE=131072
> LimitMEMLOCK=infinity
> LimitSTACK=infinity
> Delegate=yes
> 
> 
> [Install]
> WantedBy=multi-user.target
> 
> 
> 
> ---------------------------------
> # systemctl status slurmd
> ● slurmd.service - Slurm node daemon
>    Loaded: loaded (/usr/lib/systemd/system/slurmd.service; disabled; vendor
> preset: disabled)
>    Active: active (running) since Wed 2020-11-04 10:30:06 +03; 1 day 16h ago
>  Main PID: 52021 (slurmd)
>    CGroup: /system.slice/slurmd.service
>            └─52021
> /opt/slurm/install/slurm-20.11.0-0rc1-
> 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D
> 
> Nov 04 10:30:06 cn110-23-l systemd[1]: Started Slurm node daemon.
> --------------------------------------------------------------
> 
> Thanks!
> 
> 
> Regards,
> Ramy

While I am looking into that..:

Do you start slurmd with systemd? if so.. can we do a test?
Can you start slurmd manually (as root) instead of with systemd? 

Understanding, as you said, that the issue is always reproducible.

Comment 10 Felip Moll 2020-11-06 09:09:22 MST

I am reproducing the issue. No need for you to provide any more info.

Will keep you posted.

Thanks

Comment 17 Felip Moll 2020-11-07 00:56:34 MST

There are two issues in your original description:

First is:

slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114467/step_0/memory.force_empty: Device or resource busy

This error is new and catched since 20.11, so I provided a fix for the bug and it will be in since 20.11.0rc2, see commit a0181c789061508.

Second is:

[2020-11-04T13:35:58.919] [114467.extern] error: Detected 1 oom-kill event(s) in StepId=114467.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

which is being addressed in bug 9737.

Does this one happen all the time? Can you still do a test and start slurmd completely outside of systemd?

Thanks

Comment 18 Ramy Adly 2020-11-08 00:27:46 MST

Starting slurmd out of systemd:
/opt/slurm/install/slurm-20.11.0-0rc1-43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D

Issue seems to persist. Here is slurmd log:


[2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty: Device or resource busy
[2020-11-08T10:20:28.299] [114478.0] done with job
[2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory.force_empty: Device or resource busy
[2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill event(s) in StepId=114478.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-11-08T10:20:28.450] [114478.extern] done with job
[2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4 files) in 0.000264 seconds

Thank you,
Ramy

Comment 19 Felip Moll 2020-11-09 06:22:24 MST

(In reply to Ramy Adly from comment #18)
> Starting slurmd out of systemd:
> /opt/slurm/install/slurm-20.11.0-0rc1-
> 43efb7f754a4b1032dd2641ae45efcd5f0000656-CentOS-7.8.2003-MLNX/sbin/slurmd -D
> 
> Issue seems to persist. Here is slurmd log:
> 
> 
> [2020-11-08T10:20:28.240] [114478.0] error: _file_write_content: unable to
> write 1 bytes to cgroup
> /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_0/memory.force_empty:
> Device or resource busy
> [2020-11-08T10:20:28.299] [114478.0] done with job
> [2020-11-08T10:20:28.395] [114478.extern] error: _file_write_content: unable
> to write 1 bytes to cgroup
> /sys/fs/cgroup/memory/slurm/uid_170337/job_114478/step_extern/memory.
> force_empty: Device or resource busy
> [2020-11-08T10:20:28.418] [114478.extern] task/cgroup: _oom_event_monitor:
> oom-kill event count: 1
> [2020-11-08T10:20:28.439] [114478.extern] error: Detected 1 oom-kill
> event(s) in StepId=114478.extern cgroup. Some of your processes may have
> been killed by the cgroup out-of-memory handler.
> [2020-11-08T10:20:28.450] [114478.extern] done with job
> [2020-11-08T10:20:28.537] private-tmpdir: removed /local/tmp//114478.0 (4
> files) in 0.000264 seconds
> 
> Thank you,
> Ramy

Can you upgrade to the latest master branch and check whether the oom-kill event error still happens?

What's your exact kernel version? (uname -a)

Comment 20 Greg Wickham 2020-11-09 09:11:40 MST

Hi Felip.

Can you provide a commit # to pull against?

Is it 158408a207285721d809e86ff66a52645ece0167 ?

   -greg

Comment 21 Felip Moll 2020-11-09 10:07:06 MST

(In reply to Greg Wickham from comment #20)
> Hi Felip.
> 
> Can you provide a commit # to pull against?
> 
> Is it 158408a207285721d809e86ff66a52645ece0167 ?
> 
>    -greg

That's the last one, so it is good.


I was specifically referring to a0181c789061508, but go with the last instead.

Comment 22 Ramy Adly 2020-11-10 23:49:54 MST

Hello Fellip,

We have updated slurm to 158408a207285721d809e86ff66a52645ece0167.

Kernel version is 3.10.0-1062.12.1.el7.x86_64


cgroup error has been resolved indeed.

However OOM error is still occurring:


[2020-11-11T09:41:44.916] [114565.0] done with job
[2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor: oom-kill event count: 1
[2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill event(s) in StepId=114565.extern cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
[2020-11-11T09:41:44.993] [114565.extern] done with job
[2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4 files) in 0.000201 seconds


$ sacct -j 114565
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
114565         hostname      batch    default          1  COMPLETED      0:0 
114565.exte+     extern               default          1 OUT_OF_ME+    0:125 
114565.0       hostname               default          1  COMPLETED      0:0 


Regards,
Ramy

Comment 23 Felip Moll 2020-11-11 00:58:43 MST

(In reply to Ramy Adly from comment #22)
> Hello Fellip,
> 
> We have updated slurm to 158408a207285721d809e86ff66a52645ece0167.
> 
> Kernel version is 3.10.0-1062.12.1.el7.x86_64
> 
> 
> cgroup error has been resolved indeed.
> 
> However OOM error is still occurring:
> 
> 
> [2020-11-11T09:41:44.916] [114565.0] done with job
> [2020-11-11T09:41:44.960] [114565.extern] task/cgroup: _oom_event_monitor:
> oom-kill event count: 1
> [2020-11-11T09:41:44.983] [114565.extern] error: Detected 1 oom-kill
> event(s) in StepId=114565.extern cgroup. Some of your processes may have
> been killed by the cgroup out-of-memory handler.
> [2020-11-11T09:41:44.993] [114565.extern] done with job
> [2020-11-11T09:41:45.031] private-tmpdir: removed /local/tmp//114565.0 (4
> files) in 0.000201 seconds
> 
> 
> $ sacct -j 114565
>        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
> ------------ ---------- ---------- ---------- ---------- ---------- -------- 
> 114565         hostname      batch    default          1  COMPLETED      0:0 
> 114565.exte+     extern               default          1 OUT_OF_ME+    0:125 
> 114565.0       hostname               default          1  COMPLETED      0:0 
> 
> 
> Regards,
> Ramy

Okay, good.

We're working with the '1 oom-kill event(s)' error. It is not unique to you.

I will let you know when we found the exact cause.

Comment 24 Felip Moll 2020-11-19 11:27:10 MST

That's just a quick update.

I've found an issue but it may not be directly related.

The issue is in the cgroups notify mechanism, when a process in a cgroup hierarchy is OOMed, an event is created and broadcasted to all listeners of events in the hierarchy. Since every step is a listener, every step does receive OOMs from other cgroups and adds it to its counter.

Since you're only receiving an OOM on the extern cgroup... let's do a test:

- If it is always reproducible:
1. Run an 'salloc' job on a specific node
2. Go to /sys/fs/cgroup/memory/slurm/uid_xx/job_yy/step_extern/ and do:

cat memory.limit_in_bytes
cat memory.oom_control

Send me this information.

3. Exit salloc and check the OOM is registered in sacct
4. Send me a 'dmesg -T' in the node.

Tell me which kernel version are you on your nodes.

Comment 25 Felip Moll 2020-11-20 06:58:36 MST

This issue is the same than bug 10255, if you don't mind I put you in CC in the other bug where is more people affected and will keep track of things there.

So, ignore my last comments.

I reproduced that in my CentOS 7 and I am investigating the cause in the other bug:

[slurm@moll0 inst]$ sbatch --wrap "sleep 10" 
Submitted batch job 34
[slurm@moll0 inst]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                34     debug     wrap    slurm  R       0:01      1 moll1 
[slurm@moll0 inst]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
34                 wrap      debug      slurm          1  COMPLETED      0:0 
34.batch          batch                 slurm          1  COMPLETED      0:0 
34.extern        extern                 slurm          1 OUT_OF_ME+    0:125 
[slurm@moll0 inst]$ uname -a
Linux moll0 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[slurm@moll0 inst]$ cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core)

I am closing this one now.

*** This ticket has been marked as a duplicate of ticket 10255 ***