Ticket 11154

Summary: Extern process isn't putting processes in all cgroups
Product: Slurm Reporter: Mikael Öhman <mikael.ohman>
Component: slurmstepdAssignee: Marcin Stolarek <cinek>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: SNIC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: C3SE Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Mikael Öhman 2021-03-19 12:24:56 MDT
I use pam_slurm_adopt, and cgroups;

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

and cgroup.conf;

CgroupMountpoint=/sys/fs/cgroup
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
AllowedRAMSpace=100
ConstrainSwapSpace=yes
AllowedSwapSpace=0
ConstrainDevices=yes

sbatch and srun processes seem to land in the correct cgroups and it all works perfectly, but extern steps when ssh'ing into the node does not.

The cpuset cgroup contains the extern PID as I expected;
/sys/fs/cgroup/cpuset/slurm/uid_xxxxx/job_xxx/cgroup.procs

but it's not to be found in memory and devices:
/sys/fs/cgroup/memory/slurm/uid_xxxxx/job_xxx/cgroup.procs
/sys/fs/cgroup/devices/slurm/uid_xxxxx/job_xxx/cgroup.procs

and running "nvidia-smi" on a shared GPU node I ssh into via pam_slurm_adopt I see all GPUs and not just the ones allocated to the job.
This suggests to me that the ssh shell isn't constrained in terms of memory or gpus.

Oversight, bug, or have I missed a configuration option?
Comment 1 Marcin Stolarek 2021-03-22 05:30:03 MDT
Mikael,

Can you share your pam configuration for sshd?

cheers,
Marcin
Comment 2 Mikael Öhman 2021-03-23 03:53:45 MDT
Turns out systemd-logind wasn't disabled and masked on these nodes. The fact that the cpuset group was working threw me off. Sorry for the noise.

Best regards, Mikael

*** This ticket has been marked as a duplicate of ticket 5920 ***