| Summary: | ssh session escapes pam_slurm_adopt after some time passes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | John Hanks <john.b.hanks> |
| Component: | Other | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | lyeager, marshall |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | GSK | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
John Hanks
2021-03-30 11:06:45 MDT
Would you please check to see if you have all the “pam_systemd.so”, commented out for example: /etc/pam.d/fingerprint-auth:#-session optional pam_systemd.so /etc/pam.d/password-auth:#-session optional pam_systemd.so /etc/pam.d/runuser-l:#-session optional pam_systemd.so /etc/pam.d/smartcard-auth:#-session optional pam_systemd.so /etc/pam.d/system-auth:#-session optional pam_systemd.so We think that systemd might be stealing back the cgroups from Slurm. I have it commented out in system-auth and password-auth. We do not use runuser-l, fingerprint-auth or smartcard-auth, but I can comment it out there for the sake of completeness. @kilian pointed me to an old thread here: https://lists.schedmd.com/pipermail/slurm-users/2018-August/001804.html which implies I also need to stop/disable/mask systemd-logind on the nodes. Testing this now. (In reply to John Hanks from comment #3) > I have it commented out in system-auth and password-auth. > > We do not use runuser-l, fingerprint-auth or smartcard-auth, but I can > comment it out there for the sake of completeness. > > @kilian pointed me to an old thread here: > https://lists.schedmd.com/pipermail/slurm-users/2018-August/001804.html > which implies I also need to stop/disable/mask systemd-logind on the nodes. > Testing this now. Yes, you definitely need to do that as well. Let me know if it works for you. For what it's worth, we do warn about this in our pam_slurm_adopt documentation page (it's just a couple of sentences about systemd): https://slurm.schedmd.com/pam_slurm_adopt.html "The pam_systemd module will conflict with pam_slurm_adopt, so you need to disable it in all files that are included in sshd or system-auth (e.g. password-auth, common-session, etc.). You should also stop and mask systemd-logind." systemd wants to own all cgroups everywhere, so unless it's disabled then it can steal back cgroups from Slurm. We've talked to systemd about this. They are of the opinion that only systemd should ever modify cgroups, so they won't change. Of course that's not how Slurm works, so we have to tell people to disable systemd. This is also why the patches in bug 5920 haven't been included in Slurm - it's an attempt to workaround systemd-logind, but there's no guarantee that systemd won't steal back cgroups somewhere else. Removing `systemd-logind` does solve this on compute nodes, a job left overnight with an ssh session did not have the ssh session consumed by systemd. We are still left with a slight problem, we use all our nodes to run desktop sessions: On "pure" compute nodes, via OnDemand and VNC On shared workstations, via both OnDemand VNC and direct login/vnc/ssh tunnel (where we want adoption to work) On Dedicated workstations, via NoMachine Enterprise Desktop (where we do not want adoption in the NoMachine desktop, but do allow a subset of each nodes resources to be used by jobs) It's not entirely clear to me yet what effect (if any) disabling systemd-logind will have on these desktop sessions, so we have a lot more testing to do. Hopefully there is no impact. A long term solution for Slurm and systemd living in harmony would be nice. I totally get that it's deeply soul satisfying to blame systemd, I personally scream "POETTERING!!!!" and shake my fist at the sky if I find I've run out of milk for my Cinnamon Toast Crunch in the morning. But that doesn't fill my bowl with milk, unfortunately. I've read that page many times and ran into and subsequently forgotten about this issue before, I should have made a checklist somewhere. It might also be nice if that short paragraph were expanded into a more complete discussion of the problem and current solutions/workarounds to it? I mean, can the evils of systemd really be explained in a paragraph that short? I think that effort to explain those evils starts with Volume 1... Can probably mark this resolved as INFOGIVEN, if I find any more issues around getting our Desktop needs sorted I'll post them here for posterity, but there is no more search for the guilty to be done as far as I can tell. Thanks for your help. I'm glad things are working now.
As for Slurm cgroups not playing nice with systemd - we aren't planning to change the current cgroup plugins (task/cgroup and proctrack/cgroup) to play nice with systemd, since that would be a total change in how they work. We are currently working on a plugin for cgroups v2, but that plugin will work similarly to cgroups v1 and won't play nice with systemd. After that, we hope to develop a "systemd" cgroups plugin that will play nice with systemd, but don't have a timeline for it yet.
I laughed at your response about it needing volumes to write about the problems with systemd. :) One thing to remember is that Slurm's development started in 2002. Although I don't think the cgroup plugins were developed all the way back in 2002, they certainly were developed before systemd existed and was adopted on most Linux distributions. So in this case, I think I can say with fairness that systemd broke Slurm's behavior, not the other way around. That said, we do hope to develop that systemd cgroup plugin as a long-term solution to this problem.
> if I find any more issues around getting our Desktop needs sorted I'll post them
> here for posterity
Thanks, I'll definitely see them and it will be good information for us.
Per your response I'm closing this as infogiven.
I straced this and I can see
1. scrontab opening the temp file and writing to it
2. execve of vi on that temp file
3. vi write and close the temp file
4. scrontab read the file it has open
Is it possible that when vi does:
[pid 94890] unlink("/local_scratch/scrontab-EsHnBW~") = -1 ENOENT (No such file or directory)
[pid 94890] rename("/local_scratch/scrontab-EsHnBW", "/local_scratch/scrontab-EsHnBW~") = 0
[pid 94890] open("/local_scratch/scrontab-EsHnBW", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
[pid 94890] write(4, "# Welcome to scrontab, Slurm's c"..., 1060) = 1060
[pid 94890] fsync(4) = 0
that scrontab's filehandle now points at the old deleted file so that it reads from it? Earlier I see scrontab open the file, write it but not close() before calling vi:
open("/local_scratch/scrontab-EsHnBW", O_RDWR|O_CREAT|O_EXCL, 0600) = 3
write(3, "# Welcome to scrontab, Slurm's c"..., 1027) = 1027
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2aaaaaafa150) = 94890
strace: Process 94890 attached
[pid 94890] set_robust_list(0x2aaaaaafa160, 24 <unfinished ...>
[pid 94889] wait4(94890, <unfinished ...>
[pid 94890] <... set_robust_list resumed>) = 0
[pid 94890] execve("/cm/shared/slurm/current/bin/vi", ["vi", "/local_scratch/scrontab-EsHnBW"], 0x607130 /* 51 vars */) = -1 ENOENT (No such file or directory)
[pid 94890] execve("/usr/local/bin/vi", ["vi", "/local_scratch/scrontab-EsHnBW"], 0x607130 /* 51 vars */) = -1 ENOENT (No such file or directory)
[pid 94890] execve("/usr/bin/vi", ["vi", "/local_scratch/scrontab-EsHnBW"], 0x607130 /* 51 vars */) = 0
Then vi writes a new file, unlinking the one scrontab has open:
[pid 94890] unlink("/local_scratch/scrontab-EsHnBW~") = -1 ENOENT (No such file or directory)
[pid 94890] rename("/local_scratch/scrontab-EsHnBW", "/local_scratch/scrontab-EsHnBW~") = 0
[pid 94890] open("/local_scratch/scrontab-EsHnBW", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4
[pid 94890] write(4, "# Welcome to scrontab, Slurm's c"..., 1060) = 1060
[pid 94890] fsync(4) = 0
[pid 94890] stat("/local_scratch/scrontab-EsHnBW", {st_mode=S_IFREG|0600, st_size=1060, ...}) = 0
[pid 94890] stat("/local_scratch/scrontab-EsHnBW", {st_mode=S_IFREG|0600, st_size=1060, ...}) = 0
[pid 94890] close(4) = 0
[pid 94890] chmod("/local_scratch/scrontab-EsHnBW", 0100600) = 0
[pid 94890] setxattr("/local_scratch/scrontab-EsHnBW", "system.posix_acl_access", "\2\0\0\0\1\0\6\0\377\377\377\377\4\0\0\0\377\377\377\377 \0\0\0\377\377\377\377", 28, 0) = 0
[pid 94890] write(1, " 35L, 1060C written", 19) = 19
[pid 94890] lseek(5, 0, SEEK_SET) = 0
[pid 94890] write(5, "b0VIM 7.4\0\0\0\0\20\0\0\234\372f`a|\0\0\252r\1\0jbh4"..., 4096) = 4096
[pid 94890] stat("/local_scratch/scrontab-EsHnBW", {st_mode=S_IFREG|0600, st_size=1060, ...}) = 0
[pid 94890] unlink("/local_scratch/scrontab-EsHnBW~") = 0
[pid 94890] write(1, "\r\r\n\33[?1l\33>", 10) = 10
[pid 94890] write(1, "\33[?12l\33[?25h\33[?1049l", 20) = 20
[pid 94890] close(5) = 0
[pid 94890] unlink("/local_scratch/.scrontab-EsHnBW.swp") = 0
And after vi exits, looks like scrontab reads from it's still open fd pointing to the now unlinked original scrontab file:
[pid 94890] brk(NULL) = 0x738000
[pid 94890] brk(NULL) = 0x738000
[pid 94890] brk(0x72b000) = 0x72b000
[pid 94890] brk(NULL) = 0x72b000
[pid 94890] exit_group(0) = ?
[pid 94890] +++ exited with 0 +++
<... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 94890
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=94890, si_uid=32305, si_status=0, si_utime=0, si_stime=1} ---
lseek(3, 0, SEEK_SET) = 0
read(3, "# Welcome to scrontab, Slurm's c"..., 4096) = 1027
read(3, "", 3069) = 0
close(3) = 0
unlink("/local_scratch/scrontab-EsHnBW") = 0
getgid() = 1028
getuid() = 32305
Actually, after writing all that I realized I could test by just making vi write out the file. I made a little vi wrapper to give me time to look:
jbh41678@login1 ~$ cat tmpvi
#!/bin/bash
vi $@
sleep 120
Then
EDITOR=$PWD/tmpvi scrontab
Start scrontab and the open files are:
jbh41678@login1 $ lsof 2> /dev/null | grep scrontab | grep local
scrontab 102067 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ
tmpvi 102068 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ
vi 102069 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ
vi 102069 jbh41678 5u REG 8,17 12288 31842 /local_scratch/.scrontab-tYBgOJ.swp
Make vi write, but not quit and they are:
jbh41678@login1 $ lsof 2> /dev/null | grep scrontab | grep local
scrontab 102067 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
tmpvi 102068 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
vi 102069 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
vi 102069 jbh41678 5u REG 8,17 12288 31842 /local_scratch/.scrontab-tYBgOJ.swp
Cat contents of jbh41678@login1 ~$ cat /proc/102067/fd/3 (scrontab PID) and it is still the old file.
:wq in vi and they are:
jbh41678@login1 $ lsof 2> /dev/null | grep scrontab | grep local
scrontab 102067 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
tmpvi 102068 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
sleep 102370 jbh41678 3u REG 8,17 1027 31841 /local_scratch/scrontab-tYBgOJ~ (deleted)
with /proc/102067/fd/3 still having contents of the original scrontab, no changes in there.
If I then diff what scrontab has open against the actual temp file, I can see the temp file did get updated:
jbh41678@login1 $ diff /proc/103497/fd/3 /local_scratch/scrontab-1IyfAv
32a33
> # NEW CONTENTS ADDED
Seems like scrontab should be closing that file before handing it to vi, then re-opening it after vi exits. Although this seems overly simplistic to explain this as I can't see why it ever worked before?
griznog
Sorry, wrong bug for the above comment. Too many open bugs ... |