Created attachment 26444 [details]
slurmd log excerpt w/memsw file not found messages
The config archive contains all symlinks, (no files), except for the plugstack.conf. Please update with those config files and validate that they are not just the link to the file on disk. Created attachment 26452 [details]
archive of currently active .conf values in our 22.05.2+cgroups test cluster
(links dereferenced)
Btw, just to say it out loud here in searchable the bug text, we have zero swap space on our nodes. I'm able to reproduce this on my end and get this same error message. The message is coming from a check that runs after a job/step to see if any OOM events occurred. As far as I can tell, it seems that Slurm does not check if you have swap enabled before it does this check, so this could possibly be a bug. I'll have to discuss this internally more to see if this is true. (In reply to Ben Glines from comment #5) > I'm able to reproduce this on my end and get this same error message. > > The message is coming from a check that runs after a job/step to see if any > OOM events occurred. As far as I can tell, it seems that Slurm does not > check if you have swap enabled before it does this check, so this could > possibly be a bug. I'll have to discuss this internally more to see if this > is true. Thanks much for the analysis and feedback, Ben. Best, Lyn It appears to me to be a bug, and I've created a patch that will go under review by another engineer before we decide to change anything. My patch first checks if memory swap is enabled before checking for any OOM events.
Fortunately, I don't think this will cause you any problems with the way it is, even without a patch to suppress the log message. Here's the _failcnt function that is giving you the log message:
> static uint64_t _failcnt(xcgroup_t *cg, char *param)
> {
> uint64_t value = 0;
>
> if (xcgroup_get_uint64_param(cg, param, &value) != SLURM_SUCCESS) {
> log_flag(CGROUP, "unable to read '%s' from '%s'",
> param, cg->path);
> value = 0;
> }
>
> return value;
> }
The backtrace for reading memsw goes as such:
_failcnt() -> xcgroup_get_uint64_param() -> common_file_read_uint64s() -> open()
When open() fails, the log messages are printed and an error is returned, but in _failcnt, when an error is detected, `value` is just set to 0, which represents no errors. Besides the log messages (which only happen when the cgroup DebugFlag is set), you shouldn't see any problems with this. The function cgroup_p_step_stop_oom_mgr() will still check for any OOM events related to normal memory usage.
Hi Ben, Sorry for the late reply; I was off on Fri. I do appreciate your efforts on this, your clear explanation of what's going on, and that I can safely ignore this set of messages. Feel free to update the status of this ticket however you see fit. All the best, Lyn This has been fixed in 22.05.4+ commits 39648a9447..d10afcba40 Slurm will now check cgroups to see if swap is enabled on the system before checking for swap related OOM events. Thanks for pointing this out to us Lyn! Closing this bug now. |
Created attachment 26442 [details] archive of currently active .conf values in our 22.05.2+cgroups test cluster Hi Folks, Context: we are a week out from upgrading our Discover production cluster from 21.08.2-2 to 22.05.3; during the outage for this upgrade, we will also enable cgroup.conf. We'd appreciate your feedback on the attached config files and slurmd log excerpt, which come from a test cluster that we've already upgraded to 22.05.2, and upon which we have cgroup.conf enabled and tested successfully, for the most part. You'll see we have ConstrainSwapSpace=no (had it in there explicitly, but then realized it's the default, so commmented it out). Nonetheless, with DebugFlags=Cgroup enabled, we get all manner of unhelpful, cascading messages starting with: [2022-08-23T15:07:49.061] [36501810.extern] cgroup/v1: common_file_read_uint64s: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_598491810/job_36501810/step_extern/memory.memsw.failcnt' for reading : No such file or directory Please take a look at the attached .conf files (tarball), and let us know if anything leaps out at you. I'll upload the slurmd log excerpt from which the preceding unable-to-open message was pulled in just a sec. Thanks much, Lyn