Ticket 14613

Summary:	Query regrading Cgroup
Product:	Slurm	Reporter:	Bangarusamy <bangarusamy.kumarasamy_ext>
Component:	Configuration	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	jvilarru
Version:	- Unsupported Older Versions
Hardware:	Linux
OS:	Linux
Site:	Novartis	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Bangarusamy 2022-07-26 06:00:46 MDT

Hi Team,

We have enabled cgroup for Slurm jobs. I have queries related this cgroup.
When slurm Memory/CPU resources are 100% allocated to Slurm jobs and resource updated in cgroup files, will this cgroup reservation will affect the resource allocation to root PID's or any other system USER PID's like splunk, qualys, adclient, and other OS related PID's?

Please let me know, if you need any further information.

Comment 1 Oriol Vilarrubi 2022-07-26 07:06:08 MDT

(In reply to Bangarusamy from comment #0)
> Hi Team,
> 
> We have enabled cgroup for Slurm jobs. I have queries related this cgroup.
> When slurm Memory/CPU resources are 100% allocated to Slurm jobs and
> resource updated in cgroup files, will this cgroup reservation will affect
> the resource allocation to root PID's or any other system USER PID's like
> splunk, qualys, adclient, and other OS related PID's?
> 
> Please let me know, if you need any further information.

Hello Bangarusamy,

The cgroup that slurm creates for the jobs only affects the slurm jobs, the processes outside this cgroup will not be affected at all.
What you need to do is take into account the amount of memory that those "non-slurm" processes will use and substract that from the physical memory of the node, and set that as the RealMemory parameter of the node, for example:

You have a node with 32GB of RAM, and the OS, and all the system processes together use 2GB of memory. Then you need to configure your node RealMemory to 30GB.

If you do not do that, then slurm cannot ensure that you do not get any other process killed. Using previous situation as example: Let's suppose that all system processes and OS combined use 1.5GB and you have configured all 32GB of the node to slurm, then a job that uses from 30.5GB to 32GB could cause an OOM event in the node, and the OS is the one in charge of deciding which process will get OOM'ed, so it can be either the job, or some system process, thus the necessity of measuring properly the system memory usage.

I am not sure what you mean with "and resource updated in cgroup files", do you mean that you change some cgroup parameter, change the memory of the node, or something else?

Regards.

Comment 2 Oriol Vilarrubi 2022-08-05 08:36:18 MDT

Hello Bangarusamy,

Can I help you anymore with this cgroup question?

Regards.

Comment 3 Oriol Vilarrubi 2022-08-10 04:59:33 MDT

Hello Bangarusamy,

I'm closing this bug as infogiven, if you have more doubts about cgroups please do not hesitate to reopen it.

Regards.