Ticket 11255

Summary: ssh session escapes pam_slurm_adopt after some time passes
Product: Slurm Reporter: John Hanks <john.b.hanks>
Component: OtherAssignee: Marshall Garey <marshall>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: lyeager, marshall
Version: 20.11.4   
Hardware: Linux   
OS: Linux   
Site: GSK Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description John Hanks 2021-03-30 11:06:45 MDT
Hi,

This may be related to 11128 but seems different enough to warrant a new bug.

Summary: when ssh'd in to a running job/node, the initial adoptionion to the job cgroup works as expected, but after some time the session gets moved out of teh devices cgroup and all GPUs become visible:

Example:

jbh41678@login1 ~$ sbatch --partition=gpu --gpus=1 --time=7-00:00:00 --wrap "sleep 7d"
Submitted batch job 38804088

jbh41678@login1 ~$ squeue --format=%N -j 38804088
NODELIST
gpu-004

jbh41678@login1 ~$ ssh gpu-004
jbh41678@gpu-004 ~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-3e5d3417-a9ef-c850-8cca-f841628e1f10)

jbh41678@gpu-004 ~$ date
Tue Mar 30 12:03:22 EDT 2021

jbh41678@gpu-004 ~$ while [[ $(nvidia-smi -L | wc -l) == "1" ]]; do sleep 5m; done; date
Tue Mar 30 12:18:31 EDT 2021

jbh41678@gpu-004 ~$ nvidia-smi -L
GPU 0: GeForce GTX 1080 Ti (UUID: GPU-9da323a4-ec69-6393-9e0f-85e20e2032a4)
GPU 1: GeForce GTX 1080 Ti (UUID: GPU-3e5d3417-a9ef-c850-8cca-f841628e1f10)
GPU 2: GeForce GTX 1080 Ti (UUID: GPU-e78192b2-2407-1db4-c29e-05c29d76d83f)
GPU 3: GeForce GTX 1080 Ti (UUID: GPU-7e36ddb2-56cb-08de-5841-ece4643956b1)
jbh41678@gpu-004 ~$


View of ssh cgroup before/after:

jbh41678@gpu-004 ~$ cat /proc/$$/cgroup
22:devices:/slurm/uid_32305/job_38804088/step_extern
21:blkio:/system.slice/sshd.service
20:perf_event:/
19:cpuset:/slurm/uid_32305/job_38804088/step_extern
18:hugetlb:/
17:memory:/slurm/uid_32305/job_38804088/step_extern
16:cpuacct,cpu:/system.slice/sshd.service
15:freezer:/slurm/uid_32305/job_38804088/step_extern
14:pids:/system.slice/sshd.service
13:net_prio,net_cls:/
1:name=systemd:/system.slice/sshd.service

jbh41678@gpu-004 ~$ cat /proc/$$/cgroup
22:devices:/system.slice/sshd.service
21:blkio:/system.slice/sshd.service
20:perf_event:/
19:cpuset:/slurm/uid_32305/job_38804088/step_extern
18:hugetlb:/
17:memory:/system.slice/sshd.service
16:cpuacct,cpu:/system.slice/sshd.service
15:freezer:/slurm/uid_32305/job_38804088/step_extern
14:pids:/system.slice/sshd.service
13:net_prio,net_cls:/
1:name=systemd:/system.slice/sshd.service


We lose the devices and memory cgroups after about 15 minutes. 

Please let me know what additional information I should collect and send for this.
Comment 2 Jason Booth 2021-03-30 12:58:15 MDT
Would you please check to see if you have all the “pam_systemd.so”, commented out for example:

/etc/pam.d/fingerprint-auth:#-session     optional      pam_systemd.so
/etc/pam.d/password-auth:#-session     optional      pam_systemd.so
/etc/pam.d/runuser-l:#-session  optional        pam_systemd.so
/etc/pam.d/smartcard-auth:#-session     optional      pam_systemd.so
/etc/pam.d/system-auth:#-session     optional      pam_systemd.so

We think that systemd might be stealing back the cgroups from Slurm.
Comment 3 John Hanks 2021-03-30 13:09:09 MDT
I have it commented out in system-auth and password-auth. 

We do not use runuser-l, fingerprint-auth or smartcard-auth, but I can comment it out there for the sake of completeness.

@kilian pointed me to an old thread here: https://lists.schedmd.com/pipermail/slurm-users/2018-August/001804.html which implies I also need to stop/disable/mask systemd-logind on the nodes. Testing this now.
Comment 4 Marshall Garey 2021-03-30 13:36:49 MDT
(In reply to John Hanks from comment #3)
> I have it commented out in system-auth and password-auth. 
> 
> We do not use runuser-l, fingerprint-auth or smartcard-auth, but I can
> comment it out there for the sake of completeness.
> 
> @kilian pointed me to an old thread here:
> https://lists.schedmd.com/pipermail/slurm-users/2018-August/001804.html
> which implies I also need to stop/disable/mask systemd-logind on the nodes.
> Testing this now.

Yes, you definitely need to do that as well. Let me know if it works for you.

For what it's worth, we do warn about this in our pam_slurm_adopt documentation page (it's just a couple of sentences about systemd):

https://slurm.schedmd.com/pam_slurm_adopt.html

"The pam_systemd module will conflict with pam_slurm_adopt, so you need to disable it in all files that are included in sshd or system-auth (e.g. password-auth, common-session, etc.). You should also stop and mask systemd-logind."


systemd wants to own all cgroups everywhere, so unless it's disabled then it can steal back cgroups from Slurm. We've talked to systemd about this. They are of the opinion that only systemd should ever modify cgroups, so they won't change. Of course that's not how Slurm works, so we have to tell people to disable systemd. This is also why the patches in bug 5920 haven't been included in Slurm - it's an attempt to workaround systemd-logind, but there's no guarantee that systemd won't steal back cgroups somewhere else.
Comment 5 John Hanks 2021-03-31 06:14:45 MDT
Removing `systemd-logind` does solve this on compute nodes, a job left overnight with an ssh session did not have the ssh session consumed by systemd.

We are still left with a slight problem, we use all our nodes to run desktop sessions:

On "pure" compute nodes, via OnDemand and VNC
On shared workstations, via both OnDemand VNC and direct login/vnc/ssh tunnel (where we want adoption to work) 
On Dedicated workstations, via NoMachine Enterprise Desktop (where we do not want adoption in the NoMachine desktop, but do allow a subset of each nodes resources to be used by jobs)

It's not entirely clear to me yet what effect (if any) disabling systemd-logind will have on these desktop sessions, so we have a lot more testing to do. Hopefully there is no impact.

A long term solution for Slurm and systemd living in harmony would be nice. I totally get that it's deeply soul satisfying to blame systemd, I personally scream "POETTERING!!!!" and shake my fist at the sky if I find I've run out of milk for my Cinnamon Toast Crunch in the morning. But that doesn't fill my bowl with milk, unfortunately. 

I've read that page many times and ran into and subsequently forgotten about this issue before, I should have made a checklist somewhere. It might also be nice if that short paragraph were expanded into a more complete discussion of the problem and current solutions/workarounds to it? I mean, can the evils of systemd really be explained in a paragraph that short? I think that effort to explain those evils starts with Volume 1...

Can probably mark this resolved as INFOGIVEN, if I find any more issues around getting our Desktop needs sorted I'll post them here for posterity, but there is no more search for the guilty to be done as far as I can tell. 

Thanks for your help.
Comment 6 Marshall Garey 2021-03-31 10:08:24 MDT
I'm glad things are working now.

As for Slurm cgroups not playing nice with systemd - we aren't planning to change the current cgroup plugins (task/cgroup and proctrack/cgroup) to play nice with systemd, since that would be a total change in how they work. We are currently working on a plugin for cgroups v2, but that plugin will work similarly to cgroups v1 and won't play nice with systemd. After that, we hope to develop a "systemd" cgroups plugin that will play nice with systemd, but don't have a timeline for it yet.

I laughed at your response about it needing volumes to write about the problems with systemd. :) One thing to remember is that Slurm's development started in 2002. Although I don't think the cgroup plugins were developed all the way back in 2002, they certainly were developed before systemd existed and was adopted on most Linux distributions. So in this case, I think I can say with fairness that systemd broke Slurm's behavior, not the other way around. That said, we do hope to develop that systemd cgroup plugin as a long-term solution to this problem.

> if I find any more issues around getting our Desktop needs sorted I'll post them
> here for posterity

Thanks, I'll definitely see them and it will be good information for us.

Per your response I'm closing this as infogiven.
Comment 7 John Hanks 2021-04-02 05:53:19 MDT
I straced this and I can see 

 1. scrontab opening the temp file and writing to it
 2. execve of vi on that temp file
 3. vi write and close the temp file 
 4. scrontab read the file it has open

Is it possible that when vi does:

[pid 94890] unlink("/local_scratch/scrontab-EsHnBW~") = -1 ENOENT (No such file or directory)
[pid 94890] rename("/local_scratch/scrontab-EsHnBW", "/local_scratch/scrontab-EsHnBW~") = 0
[pid 94890] open("/local_scratch/scrontab-EsHnBW", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
[pid 94890] write(4, "# Welcome to scrontab, Slurm's c"..., 1060) = 1060
[pid 94890] fsync(4)                    = 0

that scrontab's filehandle now points at the old deleted file so that it reads from it? Earlier I see scrontab open the file, write it but not close() before calling vi:

open("/local_scratch/scrontab-EsHnBW", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
write(3, "# Welcome to scrontab, Slurm's c"..., 1027) = 1027
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaaafa150) = 94890
strace: Process 94890 attached
[pid 94890] set_robust_list(0x2aaaaaafa160, 24 <unfinished ...>
[pid 94889] wait4(94890,  <unfinished ...>
[pid 94890] <... set_robust_list resumed>) = 0
[pid 94890] execve("/cm/shared/slurm/current/bin/vi", ["vi", "/local_scratch/scrontab-EsHnBW"], 0x607130 /* 51 vars */) = -1 ENOENT (No such file or directory)
[pid 94890] execve("/usr/local/bin/vi", ["vi", "/local_scratch/scrontab-EsHnBW"], 0x607130 /* 51 vars */) = -1 ENOENT (No such file or directory)
[pid 94890] execve("/usr/bin/vi", ["vi", "/local_scratch/scrontab-EsHnBW"], 0x607130 /* 51 vars */) = 0

Then vi writes a new file, unlinking the one scrontab has open:

[pid 94890] unlink("/local_scratch/scrontab-EsHnBW~") = -1 ENOENT (No such file or directory)
[pid 94890] rename("/local_scratch/scrontab-EsHnBW", "/local_scratch/scrontab-EsHnBW~") = 0
[pid 94890] open("/local_scratch/scrontab-EsHnBW", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
[pid 94890] write(4, "# Welcome to scrontab, Slurm's c"..., 1060) = 1060
[pid 94890] fsync(4)                    = 0
[pid 94890] stat("/local_scratch/scrontab-EsHnBW", {st_mode=S_IFREG|0600, st_size=1060, ...}) = 0
[pid 94890] stat("/local_scratch/scrontab-EsHnBW", {st_mode=S_IFREG|0600, st_size=1060, ...}) = 0
[pid 94890] close(4)                    = 0
[pid 94890] chmod("/local_scratch/scrontab-EsHnBW", 0100600) = 0
[pid 94890] setxattr("/local_scratch/scrontab-EsHnBW", "system.posix_acl_access", "\2\0\0\0\1\0\6\0\377\377\377\377\4\0\0\0\377\377\377\377 \0\0\0\377\377\377\377", 28, 0) = 0
[pid 94890] write(1, " 35L, 1060C written", 19) = 19
[pid 94890] lseek(5, 0, SEEK_SET)       = 0
[pid 94890] write(5, "b0VIM 7.4\0\0\0\0\20\0\0\234\372f`a|\0\0\252r\1\0jbh4"..., 4096) = 4096
[pid 94890] stat("/local_scratch/scrontab-EsHnBW", {st_mode=S_IFREG|0600, st_size=1060, ...}) = 0
[pid 94890] unlink("/local_scratch/scrontab-EsHnBW~") = 0
[pid 94890] write(1, "\r\r\n\33[?1l\33>", 10) = 10
[pid 94890] write(1, "\33[?12l\33[?25h\33[?1049l", 20) = 20
[pid 94890] close(5)                    = 0
[pid 94890] unlink("/local_scratch/.scrontab-EsHnBW.swp") = 0

And after vi exits, looks like scrontab reads from it's still open fd pointing to the now unlinked original scrontab file:

[pid 94890] brk(NULL)                   = 0x738000
[pid 94890] brk(NULL)                   = 0x738000
[pid 94890] brk(0x72b000)               = 0x72b000
[pid 94890] brk(NULL)                   = 0x72b000
[pid 94890] exit_group(0)               = ?
[pid 94890] +++ exited with 0 +++
<... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 94890
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=94890, si_uid=32305, si_status=0, si_utime=0, si_stime=1} ---
lseek(3, 0, SEEK_SET)                   = 0
read(3, "# Welcome to scrontab, Slurm's c"..., 4096) = 1027
read(3, "", 3069)                       = 0
close(3)                                = 0
unlink("/local_scratch/scrontab-EsHnBW") = 0
getgid()                                = 1028
getuid()                                = 32305


Actually, after writing all that I realized I could test by just making vi write out the file. I made a little vi wrapper to give me time to look:

jbh41678@login1 ~$ cat tmpvi
#!/bin/bash
vi $@
sleep 120

Then

EDITOR=$PWD/tmpvi scrontab

Start scrontab and the open files are:

jbh41678@login1 $ lsof 2> /dev/null | grep scrontab | grep local
scrontab  102067              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ
tmpvi     102068              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ
vi        102069              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ
vi        102069              jbh41678    5u      REG   8,17     12288     31842 /local_scratch/.scrontab-tYBgOJ.swp

Make vi write, but not quit and they are:

jbh41678@login1 $ lsof 2> /dev/null | grep scrontab | grep local
scrontab  102067              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
tmpvi     102068              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
vi        102069              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
vi        102069              jbh41678    5u      REG   8,17     12288     31842 /local_scratch/.scrontab-tYBgOJ.swp

Cat contents of jbh41678@login1 ~$ cat /proc/102067/fd/3 (scrontab PID) and it is still the old file.

:wq in vi and they are:

jbh41678@login1 $ lsof 2> /dev/null | grep scrontab | grep local
scrontab  102067              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
tmpvi     102068              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
sleep     102370              jbh41678    3u      REG   8,17      1027     31841 /local_scratch/scrontab-tYBgOJ~ (deleted)

with /proc/102067/fd/3 still having contents of the original scrontab, no changes in there. 

If I then diff what scrontab has open against the actual temp file, I can see the temp file did get updated:

jbh41678@login1 $ diff /proc/103497/fd/3 /local_scratch/scrontab-1IyfAv
32a33
> # NEW CONTENTS ADDED

Seems like scrontab should be closing that file before handing it to vi, then re-opening it after vi exits. Although this seems overly simplistic to explain this as I can't see why it ever worked before?

griznog
Comment 8 John Hanks 2021-04-02 05:54:52 MDT
Sorry, wrong bug for the above comment. Too many open bugs ...