We are running pre-release1 of 20.11 and we're seeing this error when running even a relatively simple job: ---- #!/bin/bash #SBATCH -A nesccmgmt #SBATCH -t 2 #SBATCH -N 1 #SBATCH -p juno #SBATCH -o OUT.txt hostname date -------------------------------- Output: -------------------------------- j1c13 Thu Sep 24 16:21:19 UTC 2020 slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memo\ ry/slurm/uid_20115/job_3743911/step_batch/memory.force_empty: Device or resource busy _______________________________________________________________ Start Epilog v20.07.22 on node j1c13 for job 3743911 :: Thu Sep 24 16:21:20 UTC 2020 Job 3743911 (not serial) finished for user Gyorgy.Fekete in partition juno with exit code 0:0 _______________________________________________________________ End Epilogue v20.07.22 Thu Sep 24 16:21:20 UTC 2020 -------------------------------------------------------- This can be replicated by simply running: [Anthony.DelSorbo@jfe01:12 simple]$ srun -A nesccmgmt -t 1 -N 1 hostname j1c12 slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy Below is a snippet of the slurmd log for that job (redirected to syslogs). Please let me know what else I can provide you to arrive at a root cause. Tony. 988 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: ../../../../src/slurmd/common/xcgroup.c:1199: _file_write_uint32s: safe_write (6 of 6) failed: No space left on device 989 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: _file_write_uint32s: write pid 71411 to /sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0/cgroup.procs failed: No space left on device 990 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0' 991 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task/cgroup: _memcg_initialize: /slurm/uid_1209/job_3743913: alloc=90000MB mem.limit=90000MB memsw.limit=90000MB 992 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task/cgroup: _memcg_initialize: /slurm/uid_1209/job_3743913/step_0: alloc=90000MB mem.limit=90000MB memsw.limit=90000MB 993 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: task/cgroup: _oom_event_monitor: started. 994 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug level is 'error' 995 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: IO handler started pid=71411 996 Sep 24 17:26:16 j1c12 slurmstepd[71411]: starting 1 tasks 997 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task 0 (71418) started 2020-09-24T17:26:16 998 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: Setting slurmstepd oom_adj to -1000 999 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: jobacct_gather/cgroup: jobacct_gather_cgroup_cpuacct_attach_task: jobacct_gather_cgroup_cpuacct_attach_task: StepId=3743913.0 taskid 0 max_task_id 0 1000 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: jobacct_gather/cgroup: jobacct_gather_cgroup_memory_attach_task: jobacct_gather_cgroup_memory_attach_task: StepId=3743913.0 taskid 0 max_task_id 0 1001 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 71418 ave_freq = 2687988 mem size/max 0/0 vmem size/max 351879168/351879168, disk read size/max (0/0), disk write s ize/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 1002 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: Sending launch resp rc=0 1003 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: mpi type = pmi2 1004 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: task/affinity: task_p_pre_launch: affinity StepId=3743913.0, task:0 bind:8448 1005 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: [job 3743913] attempting to run slurm task_prolog [/apps/slurm/logues/prolog.task.sh] 1006 Sep 24 17:26:16 j1c12 PROLOG-TASK: Job 3743913: Skipping slurm task prolog for step 0 and procid 0 1007 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_CPU in environment 1008 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_FSIZE in environment 1009 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_DATA in environment 1010 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_STACK in environment 1011 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_CORE in environment 1012 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_RSS in environment 1013 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_NPROC in environment 1014 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_NOFILE in environment 1015 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_MEMLOCK in environment 1016 Sep 24 17:26:16 j1c12 slurmstepd[71418]: debug: Couldn't find SLURM_RLIMIT_AS in environment 1017 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: jobacct_gather/cgroup: jag_common_poll_data: jag_common_poll_data: Task 0 pid 71418 ave_freq = 1168212 mem size/max 0/0 vmem size/max 351879168/351879168, disk read size/max (0/0), disk write s ize/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 1018 Sep 24 17:26:16 j1c12 slurmstepd[71411]: task 0 (71418) exited with exit code 0. 1019 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: [job 3743913] attempting to run slurm task_epilog [/apps/slurm/logues/epilog.task.sh] 1020 Sep 24 17:26:16 j1c12 EPILOG-TASK: Job 3743913: Skipping slurm task epilog for step 0 and procid 0 1021 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: task/affinity: task_p_post_term: affinity StepId=3743913.0, task 0 1022 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: task/cgroup: _oom_event_monitor: No oom events detected. 1023 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: task/cgroup: _oom_event_monitor: stopping. 1024 Sep 24 17:26:16 j1c12 slurmstepd[71411]: debug: step_terminate_monitor_stop signaling condition 1025 Sep 24 17:26:17 j1c12 slurmstepd[71411]: debug: ../../../../src/slurmd/common/xcgroup.c:1296: _file_write_content: safe_write (1 of 1) failed: Device or resource busy 1026 Sep 24 17:26:17 j1c12 slurmstepd[71411]: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy 1027 Sep 24 17:26:17 j1c12 slurmstepd[71411]: debug: Waiting for IO
Anthony - Please note that 20.11 is in an unstable state and should not be run on any production system. Also note that all bugs should be filed as severity 4 if testing 20.11. Also, have you contacted us regarding your desire to test 20.11? We do encourage this but we also want to work with sites by giving them code that is mostly stable or has the list of known issues outlined so as to not duplicate bugs with known issues.
(In reply to Jason Booth from comment #2) > Anthony - Please note that 20.11 is in an unstable state and should not be > run on any production system. Also note that all bugs should be filed as > severity 4 if testing 20.11. Also, have you contacted us regarding your > desire to test 20.11? We do encourage this but we also want to work with > sites by giving them code that is mostly stable or has the list of known > issues outlined so as to not duplicate bugs with known issues. Fair enough Jason. First time addressing a pre-release version. We weren't certain about the process and just assumed if you wanted to test it you simply go out to the github site and download it. If there's a certain process we should follow, we'd be glad to comply. I downgraded the severity to 4. Not an issue with us. The pre-release version is not on a production system but in a switchable directory tree, with its own database, on our test environment. Please let us know if there's a different version we should be downloading to test the (paid) enhancements you've developed for NOAA and for which you're awaiting our feedback. Best, Tony.
BTW - You added a couple of other "See also" bugs #9244 and #9714. But neither is readable by me as I get and "unauthorized" message.
Tony >BTW - You added a couple of other "See also" bugs #9244 and #9714. But neither is readable by me as I get and "unauthorized" message. That is correct. These are bugs we are internally working on which are not public yet. I have added Tim from our product development side of SchedMD. We generally have you work with him when tasting pre-release software.
As Jason noted, these are handled at a lower priority while the release is still undergoing development. It looks like there is an internal ticket tracking this issue at the moment, and I've asked that patch be reviewed sooner. Configs - slurm.conf and cgroup.conf - from this test system would help verify that's the same issue though. I'm going to hand this off to Felip who's been managing that internal ticket; he should get back to you next week with status on that patch, and be able to confirm if your configs line up with the known issue or not. - Tim
Created attachment 16041 [details] cgroup.conf in 20.11.0pre1 image
Created attachment 16042 [details] slurm.conf in 20.11.0pre1 image
(In reply to Tim Wickberg from comment #6) > As Jason noted, these are handled at a lower priority while the release is > still undergoing development. Thanks Tim. Uploaded configs per your request. If there's a different procedure you'd like us to follow for requesting and testing these pre-releases, please let us know. We'd be glad to help in anyway we can. Best, Tony.
Anthony, Sorry for not having responded before. I've been working in the solution for this issue and we've to do a couple of iterations since we detected other issues. My patch is now ready to be reviewed in the internal bug 9244. The problem basically consisted in an issue in certain distros/kernels where cpuset.mems and cpuset.cpus in the cgroup subsystem could be created empty. If no cpus or memory nodes were allowed to a cgroup, then it was not possible to add pids to it so giving a "No space left on device" error like you've seen: 989 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: _file_write_uint32s: write pid 71411 to /sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0/cgroup.procs failed: No space left on device 990 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0' About the different memory.force_empty, this is set by jobacctgather/cgroup at cgroup creation, and I am not sure why this could happen. Are you still playing with 20.11 and do you still see this particular error? In any case I think the error is not harmful. I will inform you when the cpuset patch is accepted and commited. Sorry again for my delay in the response.
(In reply to Anthony DelSorbo from comment #9) > (In reply to Tim Wickberg from comment #6) > > As Jason noted, these are handled at a lower priority while the release is > > still undergoing development. > > Thanks Tim. Uploaded configs per your request. If there's a different > procedure you'd like us to follow for requesting and testing these > pre-releases, please let us know. We'd be glad to help in anyway we can. > > Best, > > Tony. About the pre-release: submitting a bug is OK but note this will be given low priority since pre-releases are considered unsupported and still under active development despite changes in RPC layers and new features are not incorporated anymore. Of course we are welcoming any report you can have. Thanks for your interest in testing this one.
Anthony A fix has been applied to: - 20.02.6 commit 666d2eedebac - 20.11.0pre1 (master) commit cd20c16b169a Plese open a new bug or reopen this one if after these patches you still have issues. Thanks
Anthony, This is a new comment specifically for this message: slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy I've continued investigating this in bug 10122 and I've found that we were setting the memory.force_empty flag too early in the code. A fix has been commited in 20.11.0rc2 for this issue, see commit a0181c789061508. In 20.02 the fix won't be introduced because this code has been this way since 2012. The difference is now we were catching the error. So I think everything is ok now. Thanks for your reports!