Ticket 2130

Summary:	slurm_pam_adopt and cgroup devices susbsystem
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	Configuration	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	da, ryan_cox, sthiell
Version:	15.08.3
Hardware:	Linux
OS:	Linux
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.6 16.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Kilian Cavalotti 2015-11-10 09:42:05 MST

Hi,

We're using the slurm_pam_adopt PAM module in conjunction with the cgroup devices subsystem, in order to restrict user access to GPUs. It works mostly fine, except for a corner case I can't seem to figure out: when a user SSH to a node, her shell doesn't inherit the cgroup devices settings, but is instead denied access to all devices.

So, it's basically:
1. srun --gres gpu:2 --pty bash -> can see 2 GPUs
2. ssh to the node with pam_slurm_adopt -> can't see any GPU at all


Our config is as follows:
JobAcctGatherType       = jobacct_gather/cgroup
ProctrackType           = proctrack/cgroup
PrologFlags             = Alloc,Contain
TaskPlugin              = task/cgroup

cgroup.conf:
ConstrainDevices=yes
AllowedDevicesFile=/etc/slurm/cgroup_allowed_devices_file.conf


When I run this job, I correctly get access to only 2 GPUs:

$ srun -p test --gres gpu:2 --pty bash
xs-0057 $ nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-198795b7-0588-3272-ee67-7b395851387f)
GPU 1: Tesla K80 (UUID: GPU-68e053ec-d3d2-92c5-5e19-3a12d2444d4a)
xs-0057 $

The cgroups eem to be set correctly, as the logs show, first for step_extern, then for step_0:

slurmd[5840]: launch task 5736.0 request from 215845.2709@10.10.0.21 (port 10681)
slurmstepd[6248]: task/cgroup: /slurm/uid_215845/job_5736: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB
slurmstepd[6248]: task/cgroup: /slurm/uid_215845/job_5736/step_extern: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB
slurmstepd[6248]: task/cgroup: manage devices jor job '5736'
[...]
slurmstepd[6248]: Allowing access to device c 195:0 rwm
slurmstepd[6248]: Allowing access to device c 195:1 rwm
slurmstepd[6248]: Not allowing access to device c 195:2 rwm
slurmstepd[6248]: Not allowing access to device c 195:3 rwm
slurmstepd[6248]: Not allowing access to device c 195:4 rwm
slurmstepd[6248]: Not allowing access to device c 195:5 rwm
slurmstepd[6248]: Not allowing access to device c 195:6 rwm
slurmstepd[6248]: Not allowing access to device c 195:7 rwm
slurmstepd[6248]: Not allowing access to device c 195:8 rwm
[...]
slurmstepd[6254]: task/cgroup: /slurm/uid_215845/job_5736: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB
slurmstepd[6254]: task/cgroup: /slurm/uid_215845/job_5736/step_0: alloc=12000MB mem.limit=12000MB memsw.limit=12000MB
slurmstepd[6254]: task/cgroup: manage devices jor job '5736'
[...]
slurmstepd[6254]: Allowing access to device c 195:0 rwm
slurmstepd[6254]: Allowing access to device c 195:1 rwm
slurmstepd[6254]: Not allowing access to device c 195:2 rwm
slurmstepd[6254]: Not allowing access to device c 195:3 rwm
slurmstepd[6254]: Not allowing access to device c 195:4 rwm
slurmstepd[6254]: Not allowing access to device c 195:5 rwm
slurmstepd[6254]: Not allowing access to device c 195:6 rwm
slurmstepd[6254]: Not allowing access to device c 195:7 rwm
slurmstepd[6254]: Not allowing access to device c 195:8 rwm
[...]


Then, when I ssh to the node, the PID of the new shell is correctly added to the step_extern's process list, yet I can't seem to list the GPUs:

$ ssh xs-0057
xs-0057 $ echo $$
7202
xs-0057 $ cat /cgroup/devices/slurm/uid_215845/job_5736/step_extern/cgroup.procs
6248
6258
7197
7201
7202
7418
xs-0057 $ nvidia-smi -L
No devices found.


Since I was not entirely sure if it was a Slurm issue or more a cgroup issue, I tried to do this by hand:

1. create a new cgroup:
xs-0057 # mkdir /cgroup/devices/test/

2. ssh as a user to the node:
$ ssh xs-0057
xs-0057 $ echo $$
8216

3. configure the cgroup (195:* devices are the /dev/nvidia* devices, 195:255 is /dev/nvidiactl)
xs-0057 # echo 8216 > /cgroup/devices/test/cgroup.procs
xs-0057 # echo a > /cgroup/devices/test/devices.deny
xs-0057 # echo "c 195:0 rwm" > /cgroup/devices/test/devices.allow
xs-0057 # echo "c 195:1 rwm" > /cgroup/devices/test/devices.allow
xs-0057 # echo "c 195:255 rwm" > /cgroup/devices/test/devices.allow
xs-0057 # cat /cgroup/devices/test/devices.list
c 195:0 rwm
c 195:1 rwm
c 195:255 rwm

4. check that access restriction works as a user:
xs-0057 $ nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-198795b7-0588-3272-ee67-7b395851387f)
GPU 1: Tesla K80 (UUID: GPU-68e053ec-d3d2-92c5-5e19-3a12d2444d4a)

5. now SSH in a new shell, add its PID to the test cgroup, and see what the device access is:
$ ssh xs-0057 
xs-0057 $ echo $$
9094

as root in another session:
xs-0057 # echo 9094 > /cgroup/devices/test/cgroup.procs

back as the user:
xs-0057 $ echo $$
9094
xs-0057 $ nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-198795b7-0588-3272-ee67-7b395851387f)
GPU 1: Tesla K80 (UUID: GPU-68e053ec-d3d2-92c5-5e19-3a12d2444d4a)

So the new shell correctly inherits from the device access restriction when its PID is added to the cgroup's list of processes.

That makes me think that there is something different in the was Slurm initiates and or sets the cgroups up.

Any insight would be much appreciated.

Thanks,
-- 
Kilian


PS: Something that may be related (or not), is that the devices.list file in the cgroups set by Slurm always only contains "a *:* rwm", whatever the access restrictions are:

xs-0057 # tree /cgroup/devices/slurm/
/cgroup/devices/slurm/
├── cgroup.event_control
├── cgroup.procs
├── devices.allow
├── devices.deny
├── devices.list
├── notify_on_release
├── tasks
└── uid_215845
    ├── cgroup.event_control
    ├── cgroup.procs
    ├── devices.allow
    ├── devices.deny
    ├── devices.list
    ├── job_5738
    │   ├── cgroup.event_control
    │   ├── cgroup.procs
    │   ├── devices.allow
    │   ├── devices.deny
    │   ├── devices.list
    │   ├── notify_on_release
    │   ├── step_0
    │   │   ├── cgroup.event_control
    │   │   ├── cgroup.procs
    │   │   ├── devices.allow
    │   │   ├── devices.deny
    │   │   ├── devices.list
    │   │   ├── notify_on_release
    │   │   └── tasks
    │   ├── step_extern
    │   │   ├── cgroup.event_control
    │   │   ├── cgroup.procs
    │   │   ├── devices.allow
    │   │   ├── devices.deny
    │   │   ├── devices.list
    │   │   ├── notify_on_release
    │   │   └── tasks
    │   └── tasks
    ├── notify_on_release
    └── tasks
# find /cgroup/devices/slurm/ -iname devices.list -exec cat {} \;
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm
a *:* rwm

This is not the case when setting the cgroups manually, as shown above: the devices.list correctly contains the list of whitelisted devices.

Comment 1 Kilian Cavalotti 2015-11-10 09:44:23 MST

One more thing: I just noticed that if I manually add the PID of the external SSH shell to the step_0 cgroup instead of the step_extern one, the device access restriction works correctly,

So it looks like, despite what the logs seem to indicate, the devices cgroup is not correctly configured for the step_extern cgroup.

Cheers,
Kilian

Comment 2 Danny Auble 2015-11-10 09:46:50 MST

Kilian, I am working with Ryan right now on making this module work correctly.  You can follow along through bug 2097 if you would like.  I am not sure if it will fix your issue or not, but currently things aren't correct on many fronts.

I would strongly suggest using jobacct_gather/linux FYI, cgroup doesn't buy you anything (expect for slowing things down).

Comment 3 Kilian Cavalotti 2015-11-10 09:48:52 MST

Hi Danny, 

Noted for jobacct_gather/linux. I'll take a look at #2097.

Thanks!
Kilian

Comment 4 Kilian Cavalotti 2015-11-10 09:49:04 MST

Hi Danny, 

Noted for jobacct_gather/linux. I'll take a look at #2097.

Thanks!
Kilian

Comment 5 Brian Christiansen 2015-12-04 10:01:36 MST

This is fixed in the following commits:

https://github.com/SchedMD/slurm/commit/3101754f5074c56408d9a2f62afe42b857b7c296
https://github.com/SchedMD/slurm/commit/7f39ab4f1e4ab182ac65230b292759563e9a56e7

A lot has changed in how pam_slurm_adopt works, but what you were mostly likely experiencing was that the step_extern cgroup was explicitly denying access to the gpus. We've changed it so that the devices step_extern cgroup inherits the attributes of the parent job_<jobid> cgroup -- the first commit.

Please reopen if you have any issues.

Thanks,
Brian

Comment 6 Stephane Thiell 2015-12-11 04:26:53 MST

Hi,
Just upgraded to 15.08.5, slurm_pam_adopt does indeed correctly set cpuset and freezer contraints to the step_extern, but we're not able to make it work for devices (nor ram). We added ConstrainDevices=yes in cgroup.conf.

JobID 12380 running, and I connect to the running node using pam_slurm_adopt:
  6464 ?        Ss     0:00  \_ sshd: sthiell [priv]
  6469 ?        S      0:00  |   \_ sshd: sthiell@pts/5
  6470 pts/5    Ss+    0:00  |       \_ -bash

Only the sshd PID running under root is found in /cgroup/devices/slurm/uid_282232/job_12380/step_extern/tasks

It looks like the sshd user process and other children PIDs are not added to the step_extern devices and memory cgroups.

[root@xs-0060 ~]# cat /proc/6464/cgroup 
4:devices:/slurm/uid_282232/job_12380/step_extern
3:cpuset:/slurm/uid_282232/job_12380/step_extern
2:freezer:/slurm/uid_282232/job_12380/step_extern
1:memory:/slurm/uid_282232/job_12380/step_extern

[root@xs-0060 ~]# cat /proc/6469/cgroup 
4:devices:/
3:cpuset:/slurm/uid_282232/job_12380/step_extern
2:freezer:/slurm/uid_282232/job_12380/step_extern
1:memory:/

[root@xs-0060 ~]# cat /proc/6470/cgroup 
4:devices:/
3:cpuset:/slurm/uid_282232/job_12380/step_extern
2:freezer:/slurm/uid_282232/job_12380/step_extern
1:memory:/

Any ideas?
Thanks,
Stephane

Comment 7 Brian Christiansen 2015-12-14 17:40:55 MST

I'm able to reproduce a similar situation as well. In my case, it appears to be a race condition where the child processes are being forked before the parent process is being added to the cgroup. We'll work on a patch and get back to you. 

It is odd that in your case it is only happening for memory and devices. Does this happen every time for you?

Thanks,
Brian

Comment 8 Kilian Cavalotti 2015-12-15 00:27:34 MST

Hi,

I am currently out of office, returning on Thursday, January 14. Please
expect a delay in response.   

If you need to reach Research Computing, please email
research-computing-support@stanford.edu

Cheers,

Comment 9 Brian Christiansen 2015-12-18 07:28:34 MST

The situation that I found is fixed by commit:
https://github.com/SchedMD/slurm/commit/c7fa3f8f08695502a0076ac1085797570aaaa525

Will you try this commit, or 15.05.6, and see if you still see the same behavior?

Thanks,
Brian

Comment 10 Stephane Thiell 2015-12-18 16:29:17 MST

Hi Brian,

That's great, I just upgraded from 15.05.5 to 15.05.6 and the problem is solved! cgroups on GPU devices and memory are now set correctly, for both job step and step_extern PIDs with pam_slurm_adopt.

Thank you for the Christmas gift!

Stephane

Comment 11 Brian Christiansen 2015-12-19 01:41:00 MST

Great! Let us know if you see anything else. 

Thanks,
Brian