Ticket 11154

Summary:	Extern process isn't putting processes in all cgroups
Product:	Slurm	Reporter:	Mikael Öhman <mikael.ohman>
Component:	slurmstepd	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.02.4
Hardware:	Linux
OS:	Linux
Site:	SNIC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	C3SE	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Mikael Öhman 2021-03-19 12:24:56 MDT

I use pam_slurm_adopt, and cgroups;

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

and cgroup.conf;

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=100
ConstrainSwapSpace=yes
AllowedSwapSpace=0
ConstrainDevices=yes

sbatch and srun processes seem to land in the correct cgroups and it all works perfectly, but extern steps when ssh'ing into the node does not.

The cpuset cgroup contains the extern PID as I expected;
/sys/fs/cgroup/cpuset/slurm/uid_xxxxx/job_xxx/cgroup.procs

but it's not to be found in memory and devices:
/sys/fs/cgroup/memory/slurm/uid_xxxxx/job_xxx/cgroup.procs
/sys/fs/cgroup/devices/slurm/uid_xxxxx/job_xxx/cgroup.procs

and running "nvidia-smi" on a shared GPU node I ssh into via pam_slurm_adopt I see all GPUs and not just the ones allocated to the job.
This suggests to me that the ssh shell isn't constrained in terms of memory or gpus.

Oversight, bug, or have I missed a configuration option?

Comment 1 Marcin Stolarek 2021-03-22 05:30:03 MDT

Mikael,

Can you share your pam configuration for sshd?

cheers,
Marcin

Comment 2 Mikael Öhman 2021-03-23 03:53:45 MDT

Turns out systemd-logind wasn't disabled and masked on these nodes. The fact that the cpuset group was working threw me off. Sorry for the noise.

Best regards, Mikael

*** This ticket has been marked as a duplicate of ticket 5920 ***