When using pam_slurm_adopt, if users have SSH keys configured, my ssh session is adopted into the job cgroup: $ ssh p-sc-2404 Last login: Wed Jul 31 14:37:55 2024 from 146.186.150.15 [gbs35@p-sc-2404 ~]$ cat /proc/self/cgroup 12:pids:/system.slice/sshd.service 11:cpuset:/slurm/uid_767041/job_17386189/step_extern 10:blkio:/system.slice/sshd.service 9:perf_event:/ 8:memory:/slurm/uid_767041/job_17386189/step_extern/task_0 7:hugetlb:/ 6:net_cls,net_prio:/ 5:rdma:/ 4:devices:/slurm/uid_767041/job_17386189/step_extern 3:cpu,cpuacct:/slurm/uid_767041/job_17386189/step_extern/task_0 2:freezer:/slurm/uid_767041/job_17386189/step_extern 1:name=systemd:/system.slice/sshd.service However, if I move the authorized_keys file, a password is required and my session isn't part of the job: [gbs35@p-sc-2404 ~]$ mv .ssh/authorized_keys .ssh/authorized_keys.save [gbs35@p-sc-2404 ~]$ exit logout Connection to p-sc-2404 closed. [gbs35@submit05 .ssh]$ ssh p-sc-2404 Password: Last login: Wed Jul 31 14:43:48 2024 from 146.186.150.15 [gbs35@p-sc-2404 ~]$ cat /proc/self/cgroup 12:pids:/system.slice/sshd.service 11:cpuset:/ 10:blkio:/system.slice/sshd.service 9:perf_event:/ 8:memory:/system.slice/sshd.service 7:hugetlb:/ 6:net_cls,net_prio:/ 5:rdma:/ 4:devices:/system.slice/sshd.service 3:cpu,cpuacct:/system.slice/sshd.service 2:freezer:/ 1:name=systemd:/system.slice/sshd.service Looking at /var/log/secure for the ssh logs I see: Jul 31 14:47:40 p-sc-2404 sshd[2207654]: pam_succeed_if(sshd:account): requirement "uid eq 0" not met by user "gbs35" Jul 31 14:47:40 p-sc-2404 sshd[2207654]: pam_slurm_adopt(sshd:account): ignoring unrecognized option 'debug' Jul 31 14:47:40 p-sc-2404 pam_slurm_adopt[2207654]: Connection by user gbs35: user has only one job 17386189 Jul 31 14:47:40 p-sc-2404 pam_slurm_adopt[2207654]: Process 2207654 adopted into job 17386189 Jul 31 14:47:40 p-sc-2404 sshd[2207654]: Accepted publickey for gbs35 from 146.186.150.15 port 37906 ssh2: RSA SHA256:CbliA1NrKqc/o1bSwSQq9uVMisMJB5J/xKMw1lxiIQ0 Jul 31 14:47:40 p-sc-2404 sshd[2207654]: pam_unix(sshd:session): session opened for user gbs35 by (uid=0) Jul 31 14:47:46 p-sc-2404 sshd[2207659]: Received disconnect from 146.186.150.15 port 37906:11: disconnected by user Jul 31 14:47:46 p-sc-2404 sshd[2207659]: Disconnected from user gbs35 146.186.150.15 port 37906 Jul 31 14:47:46 p-sc-2404 sshd[2207654]: pam_unix(sshd:session): session closed for user gbs35 Jul 31 14:47:51 p-sc-2404 sshd[2207742]: pam_sss(sshd:auth): authentication success; logname= uid=0 euid=0 tty=ssh ruser= rhost=146.186.150.15 user=gbs35 Jul 31 14:47:51 p-sc-2404 sshd[2207742]: pam_succeed_if(sshd:account): requirement "uid eq 0" not met by user "gbs35" Jul 31 14:47:51 p-sc-2404 sshd[2207742]: pam_slurm_adopt(sshd:account): ignoring unrecognized option 'debug' Jul 31 14:47:51 p-sc-2404 pam_slurm_adopt[2207742]: Connection by user gbs35: user has only one job 17386189 Jul 31 14:47:51 p-sc-2404 pam_slurm_adopt[2207742]: Process 2207742 adopted into job 17386189 Jul 31 14:47:51 p-sc-2404 sshd[2207740]: Accepted keyboard-interactive/pam for gbs35 from 146.186.150.15 port 57622 ssh2 Jul 31 14:47:51 p-sc-2404 sshd[2207740]: pam_unix(sshd:session): session opened for user gbs35 by (uid=0) Jul 31 14:47:59 p-sc-2404 sshd[2207756]: Received disconnect from 146.186.150.15 port 57622:11: disconnected by user Jul 31 14:47:59 p-sc-2404 sshd[2207756]: Disconnected from user gbs35 146.186.150.15 port 57622 Jul 31 14:47:59 p-sc-2404 sshd[2207740]: pam_unix(sshd:session): session closed for user gbs35 It seems like when I use my password, sshd or pam will spawn a new process that isn't the "adopted" one and the adopted process goes away. In this case my ssh session has full access to the node resources, which I'd like to avoid. I've tried messing with the /etc/pam.d/sshd file to remove most stuff from the "account" section, but couldn't find something that works. Everything seems fine for the ssh key based authentication. For the compute nodes we've turned off the pam_systemd stuff and the systemd-logind service is off and masked. It's odd to me that the adopting seems to work fine for ssh key based logins but not for password logins. In either case the pam_slurm_adopt seems to keep users without jobs on the system from logging in. Any other ideas on what I'm missing?
Hello, I am curious about what sort of authentication setup you may have for your cluster. When you use password sign in, is there some sort of two factor authentication involved? 2FA can cause issues with pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html#LIMITATIONS >Alternate authentication methods such as multi-factor authentication may break >process adoption with pam_slurm_adopt. Can you also attach here your pam configuration files? It would help for me to be able to look at those directly. Thanks so much, Michael Norris
We're using sssd for auth on these nodes, so it's pretty simple here. No two factor needed for these nodes. $ cat /etc/pam.d/sshd #%PAM-1.0 auth substack password-auth auth include postlogin account required pam_sepermit.so account required pam_nologin.so account [success=1 default=ignore] pam_succeed_if.so uid eq 0 account sufficient pam_slurm_adopt.so account include password-auth account required pam_access.so password include password-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open env_params session required pam_namespace.so session optional pam_keyinit.so force revoke session optional pam_motd.so session include password-auth session include postlogin $ cat /etc/pam.d/password-auth # Generated by authselect on Wed Aug 25 15:51:54 2021 # Do not modify this file manually. auth required pam_env.so auth required pam_faildelay.so delay=2000000 auth [default=1 ignore=ignore success=ok] pam_usertype.so isregular auth [default=1 ignore=ignore success=ok] pam_localuser.so auth sufficient pam_unix.so try_first_pass auth [default=1 ignore=ignore success=ok] pam_usertype.so isregular auth sufficient pam_sss.so forward_pass auth required pam_deny.so account required pam_unix.so account required pam_access.so account sufficient pam_localuser.so account sufficient pam_usertype.so issystem account [default=bad success=ok user_unknown=ignore] pam_sss.so account required pam_permit.so password requisite pam_pwquality.so try_first_pass local_users_only password sufficient pam_unix.so sha512 shadow try_first_pass use_authtok password sufficient pam_sss.so use_authtok password required pam_deny.so session optional pam_keyinit.so revoke session required pam_limits.so #-session optional pam_systemd.so session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid session required pam_unix.so session optional pam_sss.so
I tried stripping out most of the "account" stuff from the sshd config leaving only the pam_slurm_adopt, but didn't have much luck finding something that worked.
Hello, First, would you be able to modify your /etc/pam.d/sshd file line from this: >account sufficient pam_slurm_adopt.so To this: >account sufficient pam_slurm_adopt.so log_level=debug5 And then once you have that up would you be able to attempt to use your interactive password login again and send me the logs? I just want to confirm the behavior. Secondly, unfortunately it appears that interactive password auth causes issues with pam_slurm_adopt due to how ssh and systemd currently work. There is a more detailed explanation on this ticket https://support.schedmd.com/show_bug.cgi?id=5920 >I can still see the benefits of splitting the module in account and session, for >example I know that when using keyboard-interactive autenticacion in ssh, a >subprocess of ssh is launched to manage the account modules, which causes >pam_slurm_adopt to adopt the wrong pid since when account modules actions >complete, the parent process is the one that ends up with the user, so adopting >in session is what must be done instead At this point I would recommend sticking with SSH key authentication for pam due to how interactive keyboard auth breaks process adoption. We will also look into updating the documentation to reflect this. Let me know if you have any further questions Michael Norris
Thanks for the reply. It turns out that sshd spawns a process to deal with the password stuff, so you're correct that the proper process doesn't get adopted in that case. An attempted patch at https://support.schedmd.com/show_bug.cgi?id=5026 shows an option to try to adopt the parent process too. This has mixed results if you also have people using ssh keys to login, as it tryes to include the sshd process in the cgroup which is bad when the job ends, since it kills sshd. I tried a slightly different patch: --- contribs/pam_slurm_adopt/pam_slurm_adopt.c.orig 2024-08-01 14:42:41.909048273 -0400 +++ contribs/pam_slurm_adopt/pam_slurm_adopt.c 2024-08-01 16:37:05.345886568 -0400 @@ -118,7 +118,10 @@ { int fd; uint16_t protocol_version; - int rc; + int rc,rc2; + pid_t ppid; + FILE *pidfile; + long sshd_pid; if (!stepd) return -1; @@ -135,6 +138,13 @@ rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid); + rc2 = SLURM_ERROR; + ppid = getppid(); + pidfile = fopen("/var/run/sshd.pid","r"); + fscanf(pidfile, "%ld", &sshd_pid); + if ((pid_t) sshd_pid != ppid ) + rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, ppid); + if (rc == SLURM_SUCCESS) { char *env; env = xstrdup_printf("SLURM_JOB_ID=%u", stepd->step_id.job_id); @@ -192,6 +202,13 @@ info("Process %d adoption FAILED for job %u", pid, stepd->step_id.job_id); + if (rc2 == SLURM_SUCCESS) + info("Process %d adopted into job %u", + ppid, stepd->step_id.job_id); + else + info("Process %d adoption SKIPPED for sshd server pid for job %u", + ppid, stepd->step_id.job_id); + return rc; } This tries to avoid including the main sshd process by checking /var/run/sshd.pid for the pid of that process. This would be quirky to use across different OS versions, since they don't all necessarily use the same pidfile location. Maybe it sould have better to check for the ppid of the ppid and skip addoption if the parent of the parent is 1.
Slightly different patch that checks the ppid of the ppid to make sure we don't try to adopt the sshd daemon rather than checking /var/run/sshd.pid which may not be standard. I feel like there are still a couple of loopholes in this one, but seems more sane than just not adopting password logins. This could solve the problem for people that require 2FA for all logins. I didn't try this with other pam modules, nor do we use systemd-login for the compute nodes. Thoughts? --- pam_slurm_adopt.c.orig 2024-04-16 15:10:24.000000000 -0400 +++ pam_slurm_adopt.c 2024-08-02 09:38:28.766709172 -0400 @@ -110,6 +110,39 @@ opts.join_container = true; } +#define MAXBUF (BUFSIZ * 2) +static pid_t _pgetppid(pid_t pid) { + int ppid; + char buf[MAXBUF]; + char procname[32]; + FILE *fp; + + snprintf(procname, sizeof(procname), "/proc/%u/status", (int) pid); + fp = fopen(procname, "r"); + + if (fp != NULL) { + size_t ret = fread(buf,sizeof(char), MAXBUF-1,fp); + if (!ret) { + return(0); + } else { + buf[ret++] = '\0'; + } + } else { + return(0); + } + fclose(fp); + char *ppid_loc = strstr(buf, "\nPPid:"); + if (ppid_loc) { + int ret = sscanf(ppid_loc, "\nPPid:%d", &ppid); + if (!ret || ret == EOF) { + return(0); + } + return((pid_t) ppid); + } else { + return(0); + } +} + /* Adopts a process into the given step. Returns SLURM_SUCCESS if * opts.action_adopt_failure == CALLERID_ACTION_ALLOW or if the process was * successfully adopted. @@ -118,7 +151,8 @@ { int fd; uint16_t protocol_version; - int rc; + int rc,rc2; + pid_t ppid,pppid; if (!stepd) return -1; @@ -135,6 +169,16 @@ rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid); + rc2 = SLURM_ERROR; + ppid = getppid(); + + pppid = (pid_t) _pgetppid(ppid); + info("Process %d has ppid of %d", ppid, pppid); + if ((int) pppid != 1) + rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, ppid); + else + info("Not adopting %d since it's the sshd server", ppid); + if (rc == SLURM_SUCCESS) { char *env; env = xstrdup_printf("SLURM_JOB_ID=%u", stepd->step_id.job_id); @@ -192,6 +236,13 @@ info("Process %d adoption FAILED for job %u", pid, stepd->step_id.job_id); + if (rc2 == SLURM_SUCCESS) + info("Process %d adopted into job %u", + ppid, stepd->step_id.job_id); + else + info("Process %d adoption SKIPPED for sshd server pid for job %u", + ppid, stepd->step_id.job_id); + return rc; }
(In reply to Gary Skouson from comment #10) > Slightly different patch that checks the ppid of the ppid to make sure we > don't try to adopt the sshd daemon rather than checking /var/run/sshd.pid > which may not be standard. I feel like there are still a couple of > loopholes in this one, but seems more sane than just not adopting password > logins. This could solve the problem for people that require 2FA for all > logins. I didn't try this with other pam modules, nor do we use > systemd-login for the compute nodes. > > Thoughts? > > --- pam_slurm_adopt.c.orig 2024-04-16 15:10:24.000000000 -0400 > +++ pam_slurm_adopt.c 2024-08-02 09:38:28.766709172 -0400 > @@ -110,6 +110,39 @@ > opts.join_container = true; > } > > +#define MAXBUF (BUFSIZ * 2) > +static pid_t _pgetppid(pid_t pid) { > + int ppid; > + char buf[MAXBUF]; > + char procname[32]; > + FILE *fp; > + > + snprintf(procname, sizeof(procname), "/proc/%u/status", (int) pid); > + fp = fopen(procname, "r"); > + > + if (fp != NULL) { > + size_t ret = fread(buf,sizeof(char), MAXBUF-1,fp); > + if (!ret) { > + return(0); > + } else { > + buf[ret++] = '\0'; > + } > + } else { > + return(0); > + } > + fclose(fp); > + char *ppid_loc = strstr(buf, "\nPPid:"); > + if (ppid_loc) { > + int ret = sscanf(ppid_loc, "\nPPid:%d", &ppid); > + if (!ret || ret == EOF) { > + return(0); > + } > + return((pid_t) ppid); > + } else { > + return(0); > + } > +} > + > /* Adopts a process into the given step. Returns SLURM_SUCCESS if > * opts.action_adopt_failure == CALLERID_ACTION_ALLOW or if the process was > * successfully adopted. > @@ -118,7 +151,8 @@ > { > int fd; > uint16_t protocol_version; > - int rc; > + int rc,rc2; > + pid_t ppid,pppid; > > if (!stepd) > return -1; > @@ -135,6 +169,16 @@ > > rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid); > > + rc2 = SLURM_ERROR; > + ppid = getppid(); > + > + pppid = (pid_t) _pgetppid(ppid); > + info("Process %d has ppid of %d", ppid, pppid); > + if ((int) pppid != 1) > + rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, > ppid); > + else > + info("Not adopting %d since it's the sshd server", ppid); > + > if (rc == SLURM_SUCCESS) { > char *env; > env = xstrdup_printf("SLURM_JOB_ID=%u", > stepd->step_id.job_id); > @@ -192,6 +236,13 @@ > info("Process %d adoption FAILED for job %u", > pid, stepd->step_id.job_id); > > + if (rc2 == SLURM_SUCCESS) > + info("Process %d adopted into job %u", > + ppid, stepd->step_id.job_id); > + else > + info("Process %d adoption SKIPPED for sshd server pid for > job %u", > + ppid, stepd->step_id.job_id); > + > return rc; > } Could you submit this patch file as an attachment to the ticket?
Created attachment 38139 [details] patch for pam_slurm_adopt to deal with additional parent procs in some cases. In some cases ssh sessions using the pam stack will fork and the process that pam_slurm_adopt adopts turns out to be a child process that disappears, leaving the actual ssh session unadopted into the session. This patch tries to adopt the original proc as before, but also checks the ppid to see if the ppid has a parent of 1 (init or systemd). If not it assumes that the parent also needs to be adopted into the job. This resolves some issues we've run into with keyboard-interactive ssh auth. This code requires that we get the ppid of the parent. The _pgetppid function is similar to other functions found online that parse the /proc/<pid>/status file to pull out the ppid. Other options were to use another lib with a function like get_proc_stats or shell escape, neither of which appealed to me at the time. While it worked for me, I didn't do an extensive review to check for bad return codes in all functions.
Created attachment 38319 [details] ticket 20569 24.11 (v1) Hi Gary, Thank you for looking at this. I've converted this into the patch form we usually work with and set you as the author based on the username here. I'm going to be taking a look at this as a potential contribution and will let you know as I make progress on that! Thanks! --Tim