Summary: | slurm_pam_adopt and cgroup devices susbsystem | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kilian Cavalotti <kilian> |
Component: | Configuration | Assignee: | Brian Christiansen <brian> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | da, ryan_cox, sthiell |
Version: | 15.08.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Stanford | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 15.08.6 16.05.0pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Description
Kilian Cavalotti
2015-11-10 09:42:05 MST
One more thing: I just noticed that if I manually add the PID of the external SSH shell to the step_0 cgroup instead of the step_extern one, the device access restriction works correctly, So it looks like, despite what the logs seem to indicate, the devices cgroup is not correctly configured for the step_extern cgroup. Cheers, Kilian Kilian, I am working with Ryan right now on making this module work correctly. You can follow along through bug 2097 if you would like. I am not sure if it will fix your issue or not, but currently things aren't correct on many fronts. I would strongly suggest using jobacct_gather/linux FYI, cgroup doesn't buy you anything (expect for slowing things down). Hi Danny, Noted for jobacct_gather/linux. I'll take a look at #2097. Thanks! Kilian Hi Danny, Noted for jobacct_gather/linux. I'll take a look at #2097. Thanks! Kilian This is fixed in the following commits: https://github.com/SchedMD/slurm/commit/3101754f5074c56408d9a2f62afe42b857b7c296 https://github.com/SchedMD/slurm/commit/7f39ab4f1e4ab182ac65230b292759563e9a56e7 A lot has changed in how pam_slurm_adopt works, but what you were mostly likely experiencing was that the step_extern cgroup was explicitly denying access to the gpus. We've changed it so that the devices step_extern cgroup inherits the attributes of the parent job_<jobid> cgroup -- the first commit. Please reopen if you have any issues. Thanks, Brian Hi, Just upgraded to 15.08.5, slurm_pam_adopt does indeed correctly set cpuset and freezer contraints to the step_extern, but we're not able to make it work for devices (nor ram). We added ConstrainDevices=yes in cgroup.conf. JobID 12380 running, and I connect to the running node using pam_slurm_adopt: 6464 ? Ss 0:00 \_ sshd: sthiell [priv] 6469 ? S 0:00 | \_ sshd: sthiell@pts/5 6470 pts/5 Ss+ 0:00 | \_ -bash Only the sshd PID running under root is found in /cgroup/devices/slurm/uid_282232/job_12380/step_extern/tasks It looks like the sshd user process and other children PIDs are not added to the step_extern devices and memory cgroups. [root@xs-0060 ~]# cat /proc/6464/cgroup 4:devices:/slurm/uid_282232/job_12380/step_extern 3:cpuset:/slurm/uid_282232/job_12380/step_extern 2:freezer:/slurm/uid_282232/job_12380/step_extern 1:memory:/slurm/uid_282232/job_12380/step_extern [root@xs-0060 ~]# cat /proc/6469/cgroup 4:devices:/ 3:cpuset:/slurm/uid_282232/job_12380/step_extern 2:freezer:/slurm/uid_282232/job_12380/step_extern 1:memory:/ [root@xs-0060 ~]# cat /proc/6470/cgroup 4:devices:/ 3:cpuset:/slurm/uid_282232/job_12380/step_extern 2:freezer:/slurm/uid_282232/job_12380/step_extern 1:memory:/ Any ideas? Thanks, Stephane I'm able to reproduce a similar situation as well. In my case, it appears to be a race condition where the child processes are being forked before the parent process is being added to the cgroup. We'll work on a patch and get back to you. It is odd that in your case it is only happening for memory and devices. Does this happen every time for you? Thanks, Brian Hi, I am currently out of office, returning on Thursday, January 14. Please expect a delay in response. If you need to reach Research Computing, please email research-computing-support@stanford.edu Cheers, The situation that I found is fixed by commit: https://github.com/SchedMD/slurm/commit/c7fa3f8f08695502a0076ac1085797570aaaa525 Will you try this commit, or 15.05.6, and see if you still see the same behavior? Thanks, Brian Hi Brian, That's great, I just upgraded from 15.05.5 to 15.05.6 and the problem is solved! cgroups on GPU devices and memory are now set correctly, for both job step and step_extern PIDs with pam_slurm_adopt. Thank you for the Christmas gift! Stephane Great! Let us know if you see anything else. Thanks, Brian |