Description
Anthony DelSorbo
2020-09-24 11:53:28 MDT
Anthony - Please note that 20.11 is in an unstable state and should not be run on any production system. Also note that all bugs should be filed as severity 4 if testing 20.11. Also, have you contacted us regarding your desire to test 20.11? We do encourage this but we also want to work with sites by giving them code that is mostly stable or has the list of known issues outlined so as to not duplicate bugs with known issues. (In reply to Jason Booth from comment #2) > Anthony - Please note that 20.11 is in an unstable state and should not be > run on any production system. Also note that all bugs should be filed as > severity 4 if testing 20.11. Also, have you contacted us regarding your > desire to test 20.11? We do encourage this but we also want to work with > sites by giving them code that is mostly stable or has the list of known > issues outlined so as to not duplicate bugs with known issues. Fair enough Jason. First time addressing a pre-release version. We weren't certain about the process and just assumed if you wanted to test it you simply go out to the github site and download it. If there's a certain process we should follow, we'd be glad to comply. I downgraded the severity to 4. Not an issue with us. The pre-release version is not on a production system but in a switchable directory tree, with its own database, on our test environment. Please let us know if there's a different version we should be downloading to test the (paid) enhancements you've developed for NOAA and for which you're awaiting our feedback. Best, Tony. BTW - You added a couple of other "See also" bugs #9244 and #9714. But neither is readable by me as I get and "unauthorized" message. Tony
>BTW - You added a couple of other "See also" bugs #9244 and #9714. But neither is readable by me as I get and "unauthorized" message.
That is correct. These are bugs we are internally working on which are not public yet. I have added Tim from our product development side of SchedMD. We generally have you work with him when tasting pre-release software.
As Jason noted, these are handled at a lower priority while the release is still undergoing development. It looks like there is an internal ticket tracking this issue at the moment, and I've asked that patch be reviewed sooner. Configs - slurm.conf and cgroup.conf - from this test system would help verify that's the same issue though. I'm going to hand this off to Felip who's been managing that internal ticket; he should get back to you next week with status on that patch, and be able to confirm if your configs line up with the known issue or not. - Tim Created attachment 16041 [details]
cgroup.conf in 20.11.0pre1 image
Created attachment 16042 [details]
slurm.conf in 20.11.0pre1 image
(In reply to Tim Wickberg from comment #6) > As Jason noted, these are handled at a lower priority while the release is > still undergoing development. Thanks Tim. Uploaded configs per your request. If there's a different procedure you'd like us to follow for requesting and testing these pre-releases, please let us know. We'd be glad to help in anyway we can. Best, Tony. Anthony, Sorry for not having responded before. I've been working in the solution for this issue and we've to do a couple of iterations since we detected other issues. My patch is now ready to be reviewed in the internal bug 9244. The problem basically consisted in an issue in certain distros/kernels where cpuset.mems and cpuset.cpus in the cgroup subsystem could be created empty. If no cpus or memory nodes were allowed to a cgroup, then it was not possible to add pids to it so giving a "No space left on device" error like you've seen: 989 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: _file_write_uint32s: write pid 71411 to /sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0/cgroup.procs failed: No space left on device 990 Sep 24 17:26:16 j1c12 slurmstepd[71411]: error: task_cgroup_cpuset_create: unable to add slurmstepd to cpuset cg '/sys/fs/cgroup/cpuset/slurm/uid_1209/job_3743913/step_0' About the different memory.force_empty, this is set by jobacctgather/cgroup at cgroup creation, and I am not sure why this could happen. Are you still playing with 20.11 and do you still see this particular error? In any case I think the error is not harmful. I will inform you when the cpuset patch is accepted and commited. Sorry again for my delay in the response. (In reply to Anthony DelSorbo from comment #9) > (In reply to Tim Wickberg from comment #6) > > As Jason noted, these are handled at a lower priority while the release is > > still undergoing development. > > Thanks Tim. Uploaded configs per your request. If there's a different > procedure you'd like us to follow for requesting and testing these > pre-releases, please let us know. We'd be glad to help in anyway we can. > > Best, > > Tony. About the pre-release: submitting a bug is OK but note this will be given low priority since pre-releases are considered unsupported and still under active development despite changes in RPC layers and new features are not incorporated anymore. Of course we are welcoming any report you can have. Thanks for your interest in testing this one. Anthony A fix has been applied to: - 20.02.6 commit 666d2eedebac - 20.11.0pre1 (master) commit cd20c16b169a Plese open a new bug or reopen this one if after these patches you still have issues. Thanks Anthony, This is a new comment specifically for this message: slurmstepd: error: _file_write_content: unable to write 1 bytes to cgroup /sys/fs/cgroup/memory/slurm/uid_1209/job_3743913/step_0/memory.force_empty: Device or resource busy I've continued investigating this in bug 10122 and I've found that we were setting the memory.force_empty flag too early in the code. A fix has been commited in 20.11.0rc2 for this issue, see commit a0181c789061508. In 20.02 the fix won't be introduced because this code has been this way since 2012. The difference is now we were catching the error. So I think everything is ok now. Thanks for your reports! |