Ticket 14814

Summary:	Sanity-check cgroup config prior to rollout
Product:	Slurm	Reporter:	Lyn <lyn.gerner>
Component:	Configuration	Assignee:	Ben Glines <ben.glines>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	jvilarru
Version:	22.05.2
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=15163
Site:	NASA - NCCS	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	archive of currently active .conf values in our 22.05.2+cgroups test cluster slurmd log excerpt w/memsw file not found messages archive of currently active .conf values in our 22.05.2+cgroups test cluster

Description Lyn 2022-08-23 15:07:17 MDT

Created attachment 26442 [details]
archive of currently active .conf values in our 22.05.2+cgroups test cluster

Hi Folks,

Context: we are a week out from upgrading our Discover production cluster from 21.08.2-2 to 22.05.3; during the outage for this upgrade, we will also enable cgroup.conf.

We'd appreciate your feedback on the attached config files and slurmd log excerpt, which come from a test cluster that we've already upgraded to 22.05.2, and upon which we have cgroup.conf enabled and tested successfully, for the most part. You'll see we have ConstrainSwapSpace=no (had it in there explicitly, but then realized it's the default, so commmented it out). Nonetheless, with DebugFlags=Cgroup enabled, we get all manner of unhelpful, cascading messages starting with:

[2022-08-23T15:07:49.061] [36501810.extern] cgroup/v1: common_file_read_uint64s: CGROUP: unable to open '/sys/fs/cgroup/memory/slurm/uid_598491810/job_36501810/step_extern/memory.memsw.failcnt' for reading : No such file or directory

Please take a look at the attached .conf files (tarball), and let us know if anything leaps out at you. I'll upload the slurmd log excerpt from which the preceding unable-to-open message was pulled in just a sec.

Thanks much,
Lyn

Comment 1 Lyn 2022-08-23 15:08:45 MDT

Created attachment 26444 [details]
slurmd log excerpt w/memsw file not found messages

Comment 2 Jason Booth 2022-08-23 16:00:32 MDT

The config archive contains all symlinks, (no files), except for the plugstack.conf. Please update with those config files and validate that they are not just the link to the file on disk.

Comment 3 Lyn 2022-08-24 09:18:12 MDT

Created attachment 26452 [details]
archive of currently active .conf values in our 22.05.2+cgroups test cluster

(links dereferenced)

Comment 4 Lyn 2022-08-24 10:37:34 MDT

Btw, just to say it out loud here in searchable the bug text, we have zero swap space on our nodes.

Comment 5 Ben Glines 2022-08-25 13:18:25 MDT

I'm able to reproduce this on my end and get this same error message.

The message is coming from a check that runs after a job/step to see if any OOM events occurred. As far as I can tell, it seems that Slurm does not check if you have swap enabled before it does this check, so this could possibly be a bug. I'll have to discuss this internally more to see if this is true.

Comment 7 Lyn 2022-08-25 16:27:47 MDT

(In reply to Ben Glines from comment #5)
> I'm able to reproduce this on my end and get this same error message.
> 
> The message is coming from a check that runs after a job/step to see if any
> OOM events occurred. As far as I can tell, it seems that Slurm does not
> check if you have swap enabled before it does this check, so this could
> possibly be a bug. I'll have to discuss this internally more to see if this
> is true.

Thanks much for the analysis and feedback, Ben.

Best,
Lyn

Comment 8 Ben Glines 2022-08-26 15:27:53 MDT

It appears to me to be a bug, and I've created a patch that will go under review by another engineer before we decide to change anything. My patch first checks if memory swap is enabled before checking for any OOM events.

Fortunately, I don't think this will cause you any problems with the way it is, even without a patch to suppress the log message. Here's the _failcnt function that is giving you the log message:

> static uint64_t _failcnt(xcgroup_t *cg, char *param)
> {
> 	uint64_t value = 0;
> 
> 	if (xcgroup_get_uint64_param(cg, param, &value) != SLURM_SUCCESS) {
> 		log_flag(CGROUP, "unable to read '%s' from '%s'",
> 			 param, cg->path);
> 		value = 0;
> 	}
> 
> 	return value;
> }

The backtrace for reading memsw goes as such:
_failcnt() -> xcgroup_get_uint64_param() -> common_file_read_uint64s() -> open()

When open() fails, the log messages are printed and an error is returned, but in _failcnt, when an error is detected, `value` is just set to 0, which represents no errors. Besides the log messages (which only happen when the cgroup DebugFlag is set), you shouldn't see any problems with this. The function cgroup_p_step_stop_oom_mgr() will still check for any OOM events related to normal memory usage.

Comment 9 Lyn 2022-08-29 09:10:41 MDT

Hi Ben,

Sorry for the late reply; I was off on Fri. 

I do appreciate your efforts on this, your clear explanation of what's going on, and that I can safely ignore this set of messages.

Feel free to update the status of this ticket however you see fit.

All the best,
Lyn

Comment 19 Ben Glines 2022-09-12 15:25:16 MDT

This has been fixed in 22.05.4+ commits 39648a9447..d10afcba40

Slurm will now check cgroups to see if swap is enabled on the system before checking for swap related OOM events.

Thanks for pointing this out to us Lyn! Closing this bug now.