Summary: | pam_slurm_adopt with keyboard-interactive authentication failing | ||
---|---|---|---|
Product: | Slurm | Reporter: | Gary Skouson <gbs35> |
Component: | Other | Assignee: | Tim McMullan <mcmullan> |
Status: | OPEN --- | QA Contact: | |
Severity: | C - Contributions | ||
Priority: | --- | CC: | foufou33, marshall |
Version: | 23.11.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | PSU | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
patch for pam_slurm_adopt to deal with additional parent procs in some cases.
ticket 20569 24.11 (v1) |
Description
Gary Skouson
2024-07-31 12:57:48 MDT
Hello, I am curious about what sort of authentication setup you may have for your cluster. When you use password sign in, is there some sort of two factor authentication involved? 2FA can cause issues with pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html#LIMITATIONS >Alternate authentication methods such as multi-factor authentication may break >process adoption with pam_slurm_adopt. Can you also attach here your pam configuration files? It would help for me to be able to look at those directly. Thanks so much, Michael Norris We're using sssd for auth on these nodes, so it's pretty simple here. No two factor needed for these nodes. $ cat /etc/pam.d/sshd #%PAM-1.0 auth substack password-auth auth include postlogin account required pam_sepermit.so account required pam_nologin.so account [success=1 default=ignore] pam_succeed_if.so uid eq 0 account sufficient pam_slurm_adopt.so account include password-auth account required pam_access.so password include password-auth # pam_selinux.so close should be the first session rule session required pam_selinux.so close session required pam_loginuid.so # pam_selinux.so open should only be followed by sessions to be executed in the user context session required pam_selinux.so open env_params session required pam_namespace.so session optional pam_keyinit.so force revoke session optional pam_motd.so session include password-auth session include postlogin $ cat /etc/pam.d/password-auth # Generated by authselect on Wed Aug 25 15:51:54 2021 # Do not modify this file manually. auth required pam_env.so auth required pam_faildelay.so delay=2000000 auth [default=1 ignore=ignore success=ok] pam_usertype.so isregular auth [default=1 ignore=ignore success=ok] pam_localuser.so auth sufficient pam_unix.so try_first_pass auth [default=1 ignore=ignore success=ok] pam_usertype.so isregular auth sufficient pam_sss.so forward_pass auth required pam_deny.so account required pam_unix.so account required pam_access.so account sufficient pam_localuser.so account sufficient pam_usertype.so issystem account [default=bad success=ok user_unknown=ignore] pam_sss.so account required pam_permit.so password requisite pam_pwquality.so try_first_pass local_users_only password sufficient pam_unix.so sha512 shadow try_first_pass use_authtok password sufficient pam_sss.so use_authtok password required pam_deny.so session optional pam_keyinit.so revoke session required pam_limits.so #-session optional pam_systemd.so session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid session required pam_unix.so session optional pam_sss.so I tried stripping out most of the "account" stuff from the sshd config leaving only the pam_slurm_adopt, but didn't have much luck finding something that worked. Hello, First, would you be able to modify your /etc/pam.d/sshd file line from this: >account sufficient pam_slurm_adopt.so To this: >account sufficient pam_slurm_adopt.so log_level=debug5 And then once you have that up would you be able to attempt to use your interactive password login again and send me the logs? I just want to confirm the behavior. Secondly, unfortunately it appears that interactive password auth causes issues with pam_slurm_adopt due to how ssh and systemd currently work. There is a more detailed explanation on this ticket https://support.schedmd.com/show_bug.cgi?id=5920 >I can still see the benefits of splitting the module in account and session, for >example I know that when using keyboard-interactive autenticacion in ssh, a >subprocess of ssh is launched to manage the account modules, which causes >pam_slurm_adopt to adopt the wrong pid since when account modules actions >complete, the parent process is the one that ends up with the user, so adopting >in session is what must be done instead At this point I would recommend sticking with SSH key authentication for pam due to how interactive keyboard auth breaks process adoption. We will also look into updating the documentation to reflect this. Let me know if you have any further questions Michael Norris Thanks for the reply. It turns out that sshd spawns a process to deal with the password stuff, so you're correct that the proper process doesn't get adopted in that case. An attempted patch at https://support.schedmd.com/show_bug.cgi?id=5026 shows an option to try to adopt the parent process too. This has mixed results if you also have people using ssh keys to login, as it tryes to include the sshd process in the cgroup which is bad when the job ends, since it kills sshd. I tried a slightly different patch: --- contribs/pam_slurm_adopt/pam_slurm_adopt.c.orig 2024-08-01 14:42:41.909048273 -0400 +++ contribs/pam_slurm_adopt/pam_slurm_adopt.c 2024-08-01 16:37:05.345886568 -0400 @@ -118,7 +118,10 @@ { int fd; uint16_t protocol_version; - int rc; + int rc,rc2; + pid_t ppid; + FILE *pidfile; + long sshd_pid; if (!stepd) return -1; @@ -135,6 +138,13 @@ rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid); + rc2 = SLURM_ERROR; + ppid = getppid(); + pidfile = fopen("/var/run/sshd.pid","r"); + fscanf(pidfile, "%ld", &sshd_pid); + if ((pid_t) sshd_pid != ppid ) + rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, ppid); + if (rc == SLURM_SUCCESS) { char *env; env = xstrdup_printf("SLURM_JOB_ID=%u", stepd->step_id.job_id); @@ -192,6 +202,13 @@ info("Process %d adoption FAILED for job %u", pid, stepd->step_id.job_id); + if (rc2 == SLURM_SUCCESS) + info("Process %d adopted into job %u", + ppid, stepd->step_id.job_id); + else + info("Process %d adoption SKIPPED for sshd server pid for job %u", + ppid, stepd->step_id.job_id); + return rc; } This tries to avoid including the main sshd process by checking /var/run/sshd.pid for the pid of that process. This would be quirky to use across different OS versions, since they don't all necessarily use the same pidfile location. Maybe it sould have better to check for the ppid of the ppid and skip addoption if the parent of the parent is 1. Slightly different patch that checks the ppid of the ppid to make sure we don't try to adopt the sshd daemon rather than checking /var/run/sshd.pid which may not be standard. I feel like there are still a couple of loopholes in this one, but seems more sane than just not adopting password logins. This could solve the problem for people that require 2FA for all logins. I didn't try this with other pam modules, nor do we use systemd-login for the compute nodes. Thoughts? --- pam_slurm_adopt.c.orig 2024-04-16 15:10:24.000000000 -0400 +++ pam_slurm_adopt.c 2024-08-02 09:38:28.766709172 -0400 @@ -110,6 +110,39 @@ opts.join_container = true; } +#define MAXBUF (BUFSIZ * 2) +static pid_t _pgetppid(pid_t pid) { + int ppid; + char buf[MAXBUF]; + char procname[32]; + FILE *fp; + + snprintf(procname, sizeof(procname), "/proc/%u/status", (int) pid); + fp = fopen(procname, "r"); + + if (fp != NULL) { + size_t ret = fread(buf,sizeof(char), MAXBUF-1,fp); + if (!ret) { + return(0); + } else { + buf[ret++] = '\0'; + } + } else { + return(0); + } + fclose(fp); + char *ppid_loc = strstr(buf, "\nPPid:"); + if (ppid_loc) { + int ret = sscanf(ppid_loc, "\nPPid:%d", &ppid); + if (!ret || ret == EOF) { + return(0); + } + return((pid_t) ppid); + } else { + return(0); + } +} + /* Adopts a process into the given step. Returns SLURM_SUCCESS if * opts.action_adopt_failure == CALLERID_ACTION_ALLOW or if the process was * successfully adopted. @@ -118,7 +151,8 @@ { int fd; uint16_t protocol_version; - int rc; + int rc,rc2; + pid_t ppid,pppid; if (!stepd) return -1; @@ -135,6 +169,16 @@ rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid); + rc2 = SLURM_ERROR; + ppid = getppid(); + + pppid = (pid_t) _pgetppid(ppid); + info("Process %d has ppid of %d", ppid, pppid); + if ((int) pppid != 1) + rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, ppid); + else + info("Not adopting %d since it's the sshd server", ppid); + if (rc == SLURM_SUCCESS) { char *env; env = xstrdup_printf("SLURM_JOB_ID=%u", stepd->step_id.job_id); @@ -192,6 +236,13 @@ info("Process %d adoption FAILED for job %u", pid, stepd->step_id.job_id); + if (rc2 == SLURM_SUCCESS) + info("Process %d adopted into job %u", + ppid, stepd->step_id.job_id); + else + info("Process %d adoption SKIPPED for sshd server pid for job %u", + ppid, stepd->step_id.job_id); + return rc; } (In reply to Gary Skouson from comment #10) > Slightly different patch that checks the ppid of the ppid to make sure we > don't try to adopt the sshd daemon rather than checking /var/run/sshd.pid > which may not be standard. I feel like there are still a couple of > loopholes in this one, but seems more sane than just not adopting password > logins. This could solve the problem for people that require 2FA for all > logins. I didn't try this with other pam modules, nor do we use > systemd-login for the compute nodes. > > Thoughts? > > --- pam_slurm_adopt.c.orig 2024-04-16 15:10:24.000000000 -0400 > +++ pam_slurm_adopt.c 2024-08-02 09:38:28.766709172 -0400 > @@ -110,6 +110,39 @@ > opts.join_container = true; > } > > +#define MAXBUF (BUFSIZ * 2) > +static pid_t _pgetppid(pid_t pid) { > + int ppid; > + char buf[MAXBUF]; > + char procname[32]; > + FILE *fp; > + > + snprintf(procname, sizeof(procname), "/proc/%u/status", (int) pid); > + fp = fopen(procname, "r"); > + > + if (fp != NULL) { > + size_t ret = fread(buf,sizeof(char), MAXBUF-1,fp); > + if (!ret) { > + return(0); > + } else { > + buf[ret++] = '\0'; > + } > + } else { > + return(0); > + } > + fclose(fp); > + char *ppid_loc = strstr(buf, "\nPPid:"); > + if (ppid_loc) { > + int ret = sscanf(ppid_loc, "\nPPid:%d", &ppid); > + if (!ret || ret == EOF) { > + return(0); > + } > + return((pid_t) ppid); > + } else { > + return(0); > + } > +} > + > /* Adopts a process into the given step. Returns SLURM_SUCCESS if > * opts.action_adopt_failure == CALLERID_ACTION_ALLOW or if the process was > * successfully adopted. > @@ -118,7 +151,8 @@ > { > int fd; > uint16_t protocol_version; > - int rc; > + int rc,rc2; > + pid_t ppid,pppid; > > if (!stepd) > return -1; > @@ -135,6 +169,16 @@ > > rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid); > > + rc2 = SLURM_ERROR; > + ppid = getppid(); > + > + pppid = (pid_t) _pgetppid(ppid); > + info("Process %d has ppid of %d", ppid, pppid); > + if ((int) pppid != 1) > + rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, > ppid); > + else > + info("Not adopting %d since it's the sshd server", ppid); > + > if (rc == SLURM_SUCCESS) { > char *env; > env = xstrdup_printf("SLURM_JOB_ID=%u", > stepd->step_id.job_id); > @@ -192,6 +236,13 @@ > info("Process %d adoption FAILED for job %u", > pid, stepd->step_id.job_id); > > + if (rc2 == SLURM_SUCCESS) > + info("Process %d adopted into job %u", > + ppid, stepd->step_id.job_id); > + else > + info("Process %d adoption SKIPPED for sshd server pid for > job %u", > + ppid, stepd->step_id.job_id); > + > return rc; > } Could you submit this patch file as an attachment to the ticket? Created attachment 38139 [details]
patch for pam_slurm_adopt to deal with additional parent procs in some cases.
In some cases ssh sessions using the pam stack will fork and the process that pam_slurm_adopt adopts turns out to be a child process that disappears, leaving the actual ssh session unadopted into the session.
This patch tries to adopt the original proc as before, but also checks the ppid to see if the ppid has a parent of 1 (init or systemd). If not it assumes that the parent also needs to be adopted into the job.
This resolves some issues we've run into with keyboard-interactive ssh auth.
This code requires that we get the ppid of the parent. The _pgetppid function is similar to other functions found online that parse the /proc/<pid>/status file to pull out the ppid. Other options were to use another lib with a function like get_proc_stats or shell escape, neither of which appealed to me at the time.
While it worked for me, I didn't do an extensive review to check for bad return codes in all functions.
Created attachment 38319 [details] ticket 20569 24.11 (v1) Hi Gary, Thank you for looking at this. I've converted this into the patch form we usually work with and set you as the author based on the username here. I'm going to be taking a look at this as a potential contribution and will let you know as I make progress on that! Thanks! --Tim |