Ticket 20569 - pam_slurm_adopt with keyboard-interactive authentication failing
Summary: pam_slurm_adopt with keyboard-interactive authentication failing
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 23.11.4
Hardware: Linux Linux
: C - Contributions
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-07-31 12:57 MDT by Gary Skouson
Modified: 2024-10-31 17:30 MDT (History)
2 users (show)

See Also:
Site: PSU
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
patch for pam_slurm_adopt to deal with additional parent procs in some cases. (1.92 KB, patch)
2024-08-02 09:43 MDT, Gary Skouson
Details | Diff
ticket 20569 24.11 (v1) (2.65 KB, patch)
2024-08-12 09:55 MDT, Tim McMullan
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Gary Skouson 2024-07-31 12:57:48 MDT
When using pam_slurm_adopt, if users have SSH keys configured, my ssh session is adopted into the job cgroup:
$ ssh p-sc-2404
Last login: Wed Jul 31 14:37:55 2024 from 146.186.150.15
[gbs35@p-sc-2404 ~]$ cat /proc/self/cgroup 
12:pids:/system.slice/sshd.service
11:cpuset:/slurm/uid_767041/job_17386189/step_extern
10:blkio:/system.slice/sshd.service
9:perf_event:/
8:memory:/slurm/uid_767041/job_17386189/step_extern/task_0
7:hugetlb:/
6:net_cls,net_prio:/
5:rdma:/
4:devices:/slurm/uid_767041/job_17386189/step_extern
3:cpu,cpuacct:/slurm/uid_767041/job_17386189/step_extern/task_0
2:freezer:/slurm/uid_767041/job_17386189/step_extern
1:name=systemd:/system.slice/sshd.service

However, if I move the authorized_keys file, a password is required and my session isn't part of the job:

[gbs35@p-sc-2404 ~]$ mv .ssh/authorized_keys .ssh/authorized_keys.save
[gbs35@p-sc-2404 ~]$ exit
logout
Connection to p-sc-2404 closed.
[gbs35@submit05 .ssh]$ ssh p-sc-2404
Password: 
Last login: Wed Jul 31 14:43:48 2024 from 146.186.150.15
[gbs35@p-sc-2404 ~]$ cat /proc/self/cgroup 
12:pids:/system.slice/sshd.service
11:cpuset:/
10:blkio:/system.slice/sshd.service
9:perf_event:/
8:memory:/system.slice/sshd.service
7:hugetlb:/
6:net_cls,net_prio:/
5:rdma:/
4:devices:/system.slice/sshd.service
3:cpu,cpuacct:/system.slice/sshd.service
2:freezer:/
1:name=systemd:/system.slice/sshd.service

Looking at /var/log/secure for the ssh logs I see:

Jul 31 14:47:40 p-sc-2404 sshd[2207654]: pam_succeed_if(sshd:account): requirement "uid eq 0" not met by user "gbs35"
Jul 31 14:47:40 p-sc-2404 sshd[2207654]: pam_slurm_adopt(sshd:account): ignoring unrecognized option 'debug'
Jul 31 14:47:40 p-sc-2404 pam_slurm_adopt[2207654]: Connection by user gbs35: user has only one job 17386189
Jul 31 14:47:40 p-sc-2404 pam_slurm_adopt[2207654]: Process 2207654 adopted into job 17386189
Jul 31 14:47:40 p-sc-2404 sshd[2207654]: Accepted publickey for gbs35 from 146.186.150.15 port 37906 ssh2: RSA SHA256:CbliA1NrKqc/o1bSwSQq9uVMisMJB5J/xKMw1lxiIQ0
Jul 31 14:47:40 p-sc-2404 sshd[2207654]: pam_unix(sshd:session): session opened for user gbs35 by (uid=0)
Jul 31 14:47:46 p-sc-2404 sshd[2207659]: Received disconnect from 146.186.150.15 port 37906:11: disconnected by user
Jul 31 14:47:46 p-sc-2404 sshd[2207659]: Disconnected from user gbs35 146.186.150.15 port 37906
Jul 31 14:47:46 p-sc-2404 sshd[2207654]: pam_unix(sshd:session): session closed for user gbs35
Jul 31 14:47:51 p-sc-2404 sshd[2207742]: pam_sss(sshd:auth): authentication success; logname= uid=0 euid=0 tty=ssh ruser= rhost=146.186.150.15 user=gbs35
Jul 31 14:47:51 p-sc-2404 sshd[2207742]: pam_succeed_if(sshd:account): requirement "uid eq 0" not met by user "gbs35"
Jul 31 14:47:51 p-sc-2404 sshd[2207742]: pam_slurm_adopt(sshd:account): ignoring unrecognized option 'debug'
Jul 31 14:47:51 p-sc-2404 pam_slurm_adopt[2207742]: Connection by user gbs35: user has only one job 17386189
Jul 31 14:47:51 p-sc-2404 pam_slurm_adopt[2207742]: Process 2207742 adopted into job 17386189
Jul 31 14:47:51 p-sc-2404 sshd[2207740]: Accepted keyboard-interactive/pam for gbs35 from 146.186.150.15 port 57622 ssh2
Jul 31 14:47:51 p-sc-2404 sshd[2207740]: pam_unix(sshd:session): session opened for user gbs35 by (uid=0)
Jul 31 14:47:59 p-sc-2404 sshd[2207756]: Received disconnect from 146.186.150.15 port 57622:11: disconnected by user
Jul 31 14:47:59 p-sc-2404 sshd[2207756]: Disconnected from user gbs35 146.186.150.15 port 57622
Jul 31 14:47:59 p-sc-2404 sshd[2207740]: pam_unix(sshd:session): session closed for user gbs35


It seems like when I use my password, sshd or pam will spawn a new process that isn't the "adopted" one and the adopted process goes away.  In this case my ssh session has full access to the node resources, which I'd like to avoid.

I've tried messing with the /etc/pam.d/sshd file to remove most stuff from the "account" section, but couldn't find something that works.  Everything seems fine for the ssh key based authentication.

For the compute nodes we've turned off the pam_systemd stuff and the systemd-logind service is off and masked.

It's odd to me that the adopting seems to work fine for ssh key based logins but not for password logins. In either case the pam_slurm_adopt seems to keep users without jobs on the system from logging in.  

Any other ideas on what I'm missing?
Comment 5 Michael Norris 2024-08-01 10:48:23 MDT
Hello,

I am curious about what sort of authentication setup you may have for your cluster. When you use password sign in, is there some sort of two factor authentication involved? 2FA can cause issues with pam_slurm_adopt
https://slurm.schedmd.com/pam_slurm_adopt.html#LIMITATIONS
>Alternate authentication methods such as multi-factor authentication may break >process adoption with pam_slurm_adopt.

Can you also attach here your pam configuration files? It would help for me to be able to look at those directly.

Thanks so much,
Michael Norris
Comment 6 Gary Skouson 2024-08-01 11:10:34 MDT
We're using sssd for auth on these nodes, so it's pretty simple here. No two factor needed for these nodes.

$ cat /etc/pam.d/sshd 
#%PAM-1.0
auth       substack     password-auth
auth       include      postlogin
account    required     pam_sepermit.so
account    required     pam_nologin.so
account  [success=1 default=ignore] pam_succeed_if.so uid eq 0
account    sufficient   pam_slurm_adopt.so
account    include      password-auth
account    required     pam_access.so
password   include      password-auth
# pam_selinux.so close should be the first session rule
session    required     pam_selinux.so close
session    required     pam_loginuid.so
# pam_selinux.so open should only be followed by sessions to be executed in the user context
session    required     pam_selinux.so open env_params
session    required     pam_namespace.so
session    optional     pam_keyinit.so force revoke
session    optional     pam_motd.so
session    include      password-auth
session    include      postlogin

$ cat /etc/pam.d/password-auth 
# Generated by authselect on Wed Aug 25 15:51:54 2021
# Do not modify this file manually.

auth        required                                     pam_env.so
auth        required                                     pam_faildelay.so delay=2000000
auth        [default=1 ignore=ignore success=ok]         pam_usertype.so isregular
auth        [default=1 ignore=ignore success=ok]         pam_localuser.so
auth        sufficient                                   pam_unix.so try_first_pass
auth        [default=1 ignore=ignore success=ok]         pam_usertype.so isregular
auth        sufficient                                   pam_sss.so forward_pass
auth        required                                     pam_deny.so

account     required                                     pam_unix.so
account     required                                     pam_access.so
account     sufficient                                   pam_localuser.so
account     sufficient                                   pam_usertype.so issystem
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required                                     pam_permit.so

password    requisite                                    pam_pwquality.so try_first_pass local_users_only
password    sufficient                                   pam_unix.so sha512 shadow try_first_pass use_authtok
password    sufficient                                   pam_sss.so use_authtok
password    required                                     pam_deny.so

session     optional                                     pam_keyinit.so revoke
session     required                                     pam_limits.so
#-session    optional                                     pam_systemd.so
session     [success=1 default=ignore]                   pam_succeed_if.so service in crond quiet use_uid
session     required                                     pam_unix.so
session     optional                                     pam_sss.so
Comment 7 Gary Skouson 2024-08-01 11:13:18 MDT
I tried stripping out most of the "account" stuff from the sshd config leaving only the pam_slurm_adopt, but didn't have much luck finding something that worked.
Comment 8 Michael Norris 2024-08-01 14:07:34 MDT
Hello,

First, would you be able to modify your /etc/pam.d/sshd file line from this:
>account    sufficient   pam_slurm_adopt.so
To this:
>account    sufficient   pam_slurm_adopt.so log_level=debug5

And then once you have that up would you be able to attempt to use your interactive password login again and send me the logs? I just want to confirm the behavior.

Secondly, unfortunately it appears that interactive password auth causes issues with pam_slurm_adopt due to how ssh and systemd currently work. There is a more detailed explanation on this ticket
https://support.schedmd.com/show_bug.cgi?id=5920
>I can still see the benefits of splitting the module in account and session, for >example I know that when using keyboard-interactive autenticacion in ssh, a >subprocess of ssh is launched to manage the account modules, which causes >pam_slurm_adopt to adopt the wrong pid since when account modules actions >complete, the parent process is the one that ends up with the user, so adopting >in session is what must be done instead

At this point I would recommend sticking with SSH key authentication for pam due to how interactive keyboard auth breaks process adoption. We will also look into updating the documentation to reflect this. 

Let me know if you have any further questions
Michael Norris
Comment 9 Gary Skouson 2024-08-01 14:56:28 MDT
Thanks for the reply.  It turns out that sshd spawns a process to deal with the password stuff, so you're correct that the proper process doesn't get adopted in that case.  An attempted patch at https://support.schedmd.com/show_bug.cgi?id=5026  shows an option to try to adopt the parent process too.  This has mixed results if you also have people using ssh keys to login, as it tryes to include the sshd process in the cgroup which is bad when the job ends, since it kills sshd.

I tried a slightly different patch:

--- contribs/pam_slurm_adopt/pam_slurm_adopt.c.orig     2024-08-01 14:42:41.909048273 -0400
+++ contribs/pam_slurm_adopt/pam_slurm_adopt.c  2024-08-01 16:37:05.345886568 -0400
@@ -118,7 +118,10 @@
 {
        int fd;
        uint16_t protocol_version;
-       int rc;
+       int rc,rc2;
+       pid_t ppid;
+       FILE *pidfile;
+       long sshd_pid;
 
        if (!stepd)
                return -1;
@@ -135,6 +138,13 @@
 
        rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid);
 
+       rc2 = SLURM_ERROR;
+       ppid = getppid();
+       pidfile = fopen("/var/run/sshd.pid","r");
+       fscanf(pidfile, "%ld", &sshd_pid);
+       if ((pid_t) sshd_pid != ppid )
+               rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, ppid);
+
        if (rc == SLURM_SUCCESS) {
                char *env;
                env = xstrdup_printf("SLURM_JOB_ID=%u", stepd->step_id.job_id);
@@ -192,6 +202,13 @@
                info("Process %d adoption FAILED for job %u",
                     pid, stepd->step_id.job_id);
 
+       if (rc2 == SLURM_SUCCESS)
+               info("Process %d adopted into job %u",
+                    ppid, stepd->step_id.job_id);
+       else
+               info("Process %d adoption SKIPPED for sshd server pid for job %u",
+                    ppid, stepd->step_id.job_id);
+
        return rc;
 }
 
This tries to avoid including the main sshd process by checking /var/run/sshd.pid for the pid of that process.  This would be quirky to use across different OS versions, since they don't all necessarily use the same pidfile location.  Maybe it sould have better to check for the ppid of the ppid and skip addoption if the parent of the parent is 1.
Comment 10 Gary Skouson 2024-08-02 07:42:27 MDT
Slightly different patch that checks the ppid of the ppid to make sure we don't try to adopt the sshd daemon rather than checking /var/run/sshd.pid which may not be standard.  I feel like there are still a couple of loopholes in this one, but seems more sane than just not adopting password logins.  This could solve the problem for people that require 2FA for all logins.  I didn't try this with other pam modules, nor do we use systemd-login for the compute nodes.

Thoughts?

--- pam_slurm_adopt.c.orig      2024-04-16 15:10:24.000000000 -0400
+++ pam_slurm_adopt.c   2024-08-02 09:38:28.766709172 -0400
@@ -110,6 +110,39 @@
        opts.join_container = true;
 }
 
+#define MAXBUF (BUFSIZ * 2)
+static pid_t _pgetppid(pid_t pid) {
+       int ppid;
+       char buf[MAXBUF];
+       char procname[32];
+       FILE *fp;
+
+       snprintf(procname, sizeof(procname), "/proc/%u/status", (int) pid);
+       fp = fopen(procname, "r");
+
+       if (fp != NULL) {
+               size_t ret = fread(buf,sizeof(char), MAXBUF-1,fp);
+               if (!ret) {
+                       return(0);
+               } else {
+                       buf[ret++] = '\0';
+               }
+       } else {
+               return(0);
+       }
+       fclose(fp);
+       char *ppid_loc = strstr(buf, "\nPPid:");
+       if (ppid_loc) {
+               int ret = sscanf(ppid_loc, "\nPPid:%d", &ppid);
+               if (!ret || ret == EOF) {
+                       return(0);
+               }
+               return((pid_t) ppid);
+       } else {
+               return(0);
+       }
+}
+
 /* Adopts a process into the given step. Returns SLURM_SUCCESS if
  * opts.action_adopt_failure == CALLERID_ACTION_ALLOW or if the process was
  * successfully adopted.
@@ -118,7 +151,8 @@
 {
        int fd;
        uint16_t protocol_version;
-       int rc;
+       int rc,rc2;
+       pid_t ppid,pppid;
 
        if (!stepd)
                return -1;
@@ -135,6 +169,16 @@
 
        rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid);
 
+       rc2 = SLURM_ERROR;
+       ppid = getppid();
+
+       pppid = (pid_t) _pgetppid(ppid);
+       info("Process %d has ppid of %d", ppid, pppid);
+       if ((int) pppid != 1)
+               rc2 = stepd_add_extern_pid(fd, stepd->protocol_version, ppid);
+       else 
+               info("Not adopting %d since it's the sshd server", ppid);
+
        if (rc == SLURM_SUCCESS) {
                char *env;
                env = xstrdup_printf("SLURM_JOB_ID=%u", stepd->step_id.job_id);
@@ -192,6 +236,13 @@
                info("Process %d adoption FAILED for job %u",
                     pid, stepd->step_id.job_id);
 
+       if (rc2 == SLURM_SUCCESS)
+               info("Process %d adopted into job %u",
+                    ppid, stepd->step_id.job_id);
+       else
+               info("Process %d adoption SKIPPED for sshd server pid for job %u",
+                    ppid, stepd->step_id.job_id);
+
        return rc;
 }
Comment 11 Michael Norris 2024-08-02 09:03:39 MDT
(In reply to Gary Skouson from comment #10)
> Slightly different patch that checks the ppid of the ppid to make sure we
> don't try to adopt the sshd daemon rather than checking /var/run/sshd.pid
> which may not be standard.  I feel like there are still a couple of
> loopholes in this one, but seems more sane than just not adopting password
> logins.  This could solve the problem for people that require 2FA for all
> logins.  I didn't try this with other pam modules, nor do we use
> systemd-login for the compute nodes.
> 
> Thoughts?
> 
> --- pam_slurm_adopt.c.orig      2024-04-16 15:10:24.000000000 -0400
> +++ pam_slurm_adopt.c   2024-08-02 09:38:28.766709172 -0400
> @@ -110,6 +110,39 @@
>         opts.join_container = true;
>  }
>  
> +#define MAXBUF (BUFSIZ * 2)
> +static pid_t _pgetppid(pid_t pid) {
> +       int ppid;
> +       char buf[MAXBUF];
> +       char procname[32];
> +       FILE *fp;
> +
> +       snprintf(procname, sizeof(procname), "/proc/%u/status", (int) pid);
> +       fp = fopen(procname, "r");
> +
> +       if (fp != NULL) {
> +               size_t ret = fread(buf,sizeof(char), MAXBUF-1,fp);
> +               if (!ret) {
> +                       return(0);
> +               } else {
> +                       buf[ret++] = '\0';
> +               }
> +       } else {
> +               return(0);
> +       }
> +       fclose(fp);
> +       char *ppid_loc = strstr(buf, "\nPPid:");
> +       if (ppid_loc) {
> +               int ret = sscanf(ppid_loc, "\nPPid:%d", &ppid);
> +               if (!ret || ret == EOF) {
> +                       return(0);
> +               }
> +               return((pid_t) ppid);
> +       } else {
> +               return(0);
> +       }
> +}
> +
>  /* Adopts a process into the given step. Returns SLURM_SUCCESS if
>   * opts.action_adopt_failure == CALLERID_ACTION_ALLOW or if the process was
>   * successfully adopted.
> @@ -118,7 +151,8 @@
>  {
>         int fd;
>         uint16_t protocol_version;
> -       int rc;
> +       int rc,rc2;
> +       pid_t ppid,pppid;
>  
>         if (!stepd)
>                 return -1;
> @@ -135,6 +169,16 @@
>  
>         rc = stepd_add_extern_pid(fd, stepd->protocol_version, pid);
>  
> +       rc2 = SLURM_ERROR;
> +       ppid = getppid();
> +
> +       pppid = (pid_t) _pgetppid(ppid);
> +       info("Process %d has ppid of %d", ppid, pppid);
> +       if ((int) pppid != 1)
> +               rc2 = stepd_add_extern_pid(fd, stepd->protocol_version,
> ppid);
> +       else 
> +               info("Not adopting %d since it's the sshd server", ppid);
> +
>         if (rc == SLURM_SUCCESS) {
>                 char *env;
>                 env = xstrdup_printf("SLURM_JOB_ID=%u",
> stepd->step_id.job_id);
> @@ -192,6 +236,13 @@
>                 info("Process %d adoption FAILED for job %u",
>                      pid, stepd->step_id.job_id);
>  
> +       if (rc2 == SLURM_SUCCESS)
> +               info("Process %d adopted into job %u",
> +                    ppid, stepd->step_id.job_id);
> +       else
> +               info("Process %d adoption SKIPPED for sshd server pid for
> job %u",
> +                    ppid, stepd->step_id.job_id);
> +
>         return rc;
>  }

Could you submit this patch file as an attachment to the ticket?
Comment 12 Gary Skouson 2024-08-02 09:43:17 MDT
Created attachment 38139 [details]
patch for pam_slurm_adopt to deal with additional parent procs in some cases.

In some cases ssh sessions using the pam stack will fork and the process that pam_slurm_adopt adopts turns out to be a child process that disappears, leaving the actual ssh session unadopted into the session.

This patch tries to adopt the original proc as before, but also checks the ppid to see if the ppid has a parent of 1 (init or systemd).  If not it assumes that the parent also needs to be adopted into the job.

This resolves some issues we've run into with keyboard-interactive ssh auth.

This code requires that we get the ppid of the parent.  The _pgetppid function is similar to other functions found online that parse the /proc/<pid>/status file to pull out the ppid.  Other options were to use another lib with a function like get_proc_stats or shell escape, neither of which appealed to me at the time.

While it worked for me, I didn't do an extensive review to check for bad return codes in all functions.
Comment 13 Tim McMullan 2024-08-12 09:55:43 MDT
Created attachment 38319 [details]
ticket 20569 24.11 (v1)

Hi Gary,

Thank you for looking at this.  I've converted this into the patch form we usually work with and set you as the author based on the username here.

I'm going to be taking a look at this as a potential contribution and will let you know as I make progress on that!

Thanks!
--Tim