Ticket 9891

Summary:	memory.force_empty Device or resource busy
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	slurmstepd	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	shawn.needham, tim
Version:	20.11.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9244 https://bugs.schedmd.com/show_bug.cgi?id=9714 https://bugs.schedmd.com/show_bug.cgi?id=10122
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.6 20.11.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	cgroup.conf in 20.11.0pre1 image slurm.conf in 20.11.0pre1 image

Description Anthony DelSorbo 2020-09-24 11:53:28 MDT

We are running pre-release1 of 20.11 and we're seeing this error when running even a relatively simple job:

----
#!/bin/bash             
#SBATCH -A nesccmgmt                       
#SBATCH -t 2   
#SBATCH -N 1            
#SBATCH -p juno            
#SBATCH -o OUT.txt        
hostname
date
--------------------------------
Output:
--------------------------------
j1c13
Thu Sep 24 16:21:19 UTC 2020
slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memo\
ry/slurm/uid_20115/job_3743911/step_batch/memory.force_empty: Device or resource busy
_______________________________________________________________
Start Epilog v20.07.22 on node j1c13 for job 3743911 :: Thu Sep 24 16:21:20 UTC 2020
Job 3743911 (not serial) finished for user Gyorgy.Fekete in partition juno with exit code 0:0
_______________________________________________________________
End Epilogue v20.07.22 Thu Sep 24 16:21:20 UTC 2020
--------------------------------------------------------

This can be replicated by simply running:

[Anthony.DelSorbo@jfe01:12 simple]$ srun -A nesccmgmt -t 1 -N 1 hostname
j1c12
slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy

Below is a snippet of the slurmd log for that job (redirected to syslogs).  Please let me know what else I can provide you to arrive at a root cause.

Tony.

 988 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  ../../../../src/slurmd/common/xcgroup.c:1199: _file_write_uint32s: safe_write (6 of 6) failed: No space left on device
 989 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: _file_write_uint32s: write pid 71411 to /sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0/cgroup.procs failed: No space left on device
 990 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0'
 991 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task/cgroup: _memcg_initialize: /slurm/uid_1209/job_3743913: alloc=90000MB mem.limit=90000MB memsw.limit=90000MB
 992 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task/cgroup: _memcg_initialize: /slurm/uid_1209/job_3743913/step_0: alloc=90000MB mem.limit=90000MB memsw.limit=90000MB
 993 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  task/cgroup: _oom_event_monitor: started.
 994 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug level is 'error'
 995 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  IO handler started pid=71411
 996 Sep 24 17:26:16 j1c12 slurmstepd[71411]: starting 1 tasks
 997 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task 0 (71418) started 2020-09-24T17:26:16
 998 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  Setting slurmstepd oom_adj to -1000
 999 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  jobacct_gather/cgroup: jobacct_gather_cgroup_cpuacct_attach_task: jobacct_gather_cgroup_cpuacct_attach_task: StepId=3743913.0 taskid 0 max_task_id 0
1000 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  jobacct_gather/cgroup: jobacct_gather_cgroup_memory_attach_task: jobacct_gather_cgroup_memory_attach_task: StepId=3743913.0 taskid 0 max_task_id 0
1001 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 71418 ave_freq = 2687988 mem size/max 0/0 vmem size/max 351879168/351879168, disk read size/max (0/0), disk write s     ize/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
1002 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  Sending launch resp rc=0
1003 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  mpi type = pmi2
1004 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  task/affinity: task_p_pre_launch: affinity StepId=3743913.0, task:0 bind:8448
1005 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  [job 3743913] attempting to run slurm task_prolog [/apps/slurm/logues/prolog.task.sh]
1006 Sep 24 17:26:16 j1c12 PROLOG-TASK: Job 3743913: Skipping slurm task prolog for step 0 and procid 0
1007 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_CPU in environment
1008 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_FSIZE in environment
1009 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_DATA in environment
1010 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_STACK in environment
1011 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_CORE in environment
1012 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_RSS in environment
1013 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_NPROC in environment
1014 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_NOFILE in environment
1015 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in environment
1016 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug:  Couldn't find SLURM_RLIMIT_AS in environment
1017 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 71418 ave_freq = 1168212 mem size/max 0/0 vmem size/max 351879168/351879168, disk read size/max (0/0), disk write s     ize/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0
1018 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task 0 (71418) exited with exit code 0.
1019 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  [job 3743913] attempting to run slurm task_epilog [/apps/slurm/logues/epilog.task.sh]
1020 Sep 24 17:26:16 j1c12 EPILOG-TASK: Job 3743913: Skipping slurm task epilog for step 0 and procid 0
1021 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  task/affinity: task_p_post_term: affinity StepId=3743913.0, task 0
1022 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  task/cgroup: _oom_event_monitor: No oom events detected.
1023 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  task/cgroup: _oom_event_monitor: stopping.
1024 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug:  step_terminate_monitor_stop signaling condition
1025 Sep 24 17:26:17 j1c12 slurmstepd[71411]: debug:  ../../../../src/slurmd/common/xcgroup.c:1296: _file_write_content: safe_write (1 of 1) failed: Device or resource busy
1026 Sep 24 17:26:17 j1c12 slurmstepd[71411]: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy
1027 Sep 24 17:26:17 j1c12 slurmstepd[71411]: debug:  Waiting for IO

Comment 2 Jason Booth 2020-09-25 10:11:51 MDT

Anthony - Please note that 20.11 is in an unstable state and should not be run on any production system. Also note that all bugs should be filed as severity 4 if testing 20.11. Also, have you contacted us regarding your desire to test 20.11? We do encourage this but we also want to work with sites by giving them code that is mostly stable or has the list of known issues outlined so as to not duplicate bugs with known issues.

Comment 3 Anthony DelSorbo 2020-09-25 10:44:01 MDT

(In reply to Jason Booth from comment #2)
> Anthony - Please note that 20.11 is in an unstable state and should not be
> run on any production system. Also note that all bugs should be filed as
> severity 4 if testing 20.11. Also, have you contacted us regarding your
> desire to test 20.11? We do encourage this but we also want to work with
> sites by giving them code that is mostly stable or has the list of known
> issues outlined so as to not duplicate bugs with known issues.

Fair enough Jason.  First time addressing a pre-release version.  We weren't certain about the process and just assumed if you wanted to test it you simply go out to the github site and download it.  If there's a certain process we should follow, we'd be glad to comply.

I downgraded the severity to 4.  Not an issue with us.  The pre-release version is not on a production system but in a switchable directory tree, with its own database, on our test environment.

Please let us know if there's a different version we should be downloading to test the (paid) enhancements you've developed for NOAA and for which you're awaiting our feedback.  

Best,

Tony.

Comment 4 Anthony DelSorbo 2020-09-25 10:47:25 MDT

BTW - You added a couple of other "See also" bugs #9244 and #9714.  But neither is readable by me as I get and "unauthorized" message.

Comment 5 Jason Booth 2020-09-25 10:51:47 MDT

Tony 

>BTW - You added a couple of other "See also" bugs #9244 and #9714.  But neither is readable by me as I get and "unauthorized" message.

That is correct. These are bugs we are internally working on which are not public yet. I have added Tim from our product development side of SchedMD. We generally have you work with him when tasting pre-release software.

Comment 6 Tim Wickberg 2020-09-25 11:48:03 MDT

As Jason noted, these are handled at a lower priority while the release is still undergoing development. It looks like there is an internal ticket tracking this issue at the moment, and I've asked that patch be reviewed sooner.

Configs - slurm.conf and cgroup.conf - from this test system would help verify that's the same issue though.

I'm going to hand this off to Felip who's been managing that internal ticket; he should get back to you next week with status on that patch, and be able to confirm if your configs line up with the known issue or not.

- Tim

Comment 7 Anthony DelSorbo 2020-09-25 12:03:10 MDT

Created attachment 16041 [details]
cgroup.conf in 20.11.0pre1 image

Comment 8 Anthony DelSorbo 2020-09-25 12:03:37 MDT

Created attachment 16042 [details]
slurm.conf in 20.11.0pre1 image

Comment 9 Anthony DelSorbo 2020-09-25 12:07:48 MDT

(In reply to Tim Wickberg from comment #6)
> As Jason noted, these are handled at a lower priority while the release is
> still undergoing development.

Thanks Tim.  Uploaded configs per your request.  If there's a different procedure you'd like us to follow for requesting and testing these pre-releases, please let us know.  We'd be glad to help in anyway we can.

Best,

Tony.

Comment 10 Felip Moll 2020-11-02 09:21:07 MST

Anthony,

Sorry for not having responded before. I've been working in the solution for this issue and we've to do a couple of iterations since we detected other issues.

My patch is now ready to be reviewed in the internal bug 9244. The problem basically consisted in an issue in certain distros/kernels where cpuset.mems and cpuset.cpus in the cgroup subsystem could be created empty. If no cpus or memory nodes were allowed to a cgroup, then it was not possible to add pids to it so giving a "No space left on device" error like you've seen:

 989 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: _file_write_uint32s: write pid 71411 to /sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0/cgroup.procs failed: No space left on device
 990 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0'

About the different memory.force_empty, this is set by jobacctgather/cgroup at cgroup creation, and I am not sure why this could happen.

Are you still playing with 20.11 and do you still see this particular error? In any case I think the error is not harmful.

I will inform you when the cpuset patch is accepted and commited.

Sorry again for my delay in the response.

Comment 11 Felip Moll 2020-11-02 09:23:50 MST

(In reply to Anthony DelSorbo from comment #9)
> (In reply to Tim Wickberg from comment #6)
> > As Jason noted, these are handled at a lower priority while the release is
> > still undergoing development.
> 
> Thanks Tim.  Uploaded configs per your request.  If there's a different
> procedure you'd like us to follow for requesting and testing these
> pre-releases, please let us know.  We'd be glad to help in anyway we can.
> 
> Best,
> 
> Tony.

About the pre-release: submitting a bug is OK but note this will be given low priority since pre-releases are considered unsupported and still under active development despite changes in RPC layers and new features are not incorporated anymore.

Of course we are welcoming any report you can have.

Thanks for your interest in testing this one.

Comment 13 Felip Moll 2020-11-03 06:13:50 MST

Anthony

A fix has been applied to:

- 20.02.6 commit 666d2eedebac
- 20.11.0pre1 (master) commit cd20c16b169a

Plese open a new bug or reopen this one if after these patches you still have issues.

Thanks

Comment 14 Felip Moll 2020-11-07 00:52:21 MST

Anthony,

This is a new comment specifically for this message:

slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy

I've continued investigating this in bug 10122 and I've found that we were setting the memory.force_empty flag too early in the code. A fix has been commited in 20.11.0rc2 for this issue, see commit a0181c789061508. In 20.02 the fix won't be introduced because this code has been this way since 2012. The difference is now we were catching the error.

So I think everything is ok now.

Thanks for your reports!