A couple of times now, people have (inadvertently) used --enable-multiple-slurmd and then pam_slurm_adopt didn't work because they didn't have the nodename parameter. nodename was never documented, even in the README, but it's time to document it in the pam_slurm_adopt.shtml page.
See also bug 5380, bug 5026.
Created attachment 7299 [details] Document the pam_slurm_adopt nodename parameter. In my testing, this is how it is, whether or not you're using multiple-slurmd - you need to use nodename if the node hostname doesn't match the NodeName in slurm.conf. NEWS: does there need to be a NEWS entry for a doc patch? Commit message: Document the pam_slurm_adopt nodename parameter.
Brian, can you review this patch? Alex, I'm also adding you to CC since on bug 5380 you asked me to test out nodename some more. I found that even when I wasn't using multiple-slurmd, if the hostname didn't match the slurm.conf NodeName, I needed to use nodename in pam_slurm_adopt. Can you verify? I tested that by using a virtual machine as a node, with a different hostname than the host OS (host OS hostname is voyager, VM hostname is voyager2), and the NodeName in slurm.conf different than that. It works correctly if I use nodename in pam_slurm_adopt.
Created attachment 7300 [details] Send an error message when pam_slurm_adopt can't find nodename. Brian, can you also review this one? Commit message: When nodename can't be determined, pam_slurm_adopt prints and error to the syslog, then exits with whatever action_generic_failure says ("ignore" by default). No error messages are sent to the shell trying to ssh in. For example, if you get denied (action_generic_failure=deny), then all you see is "Authentication failed." Or if you succeed, you get in without being part of a cgroup, but don't know that anything went wrong. This patch sends an error message to the ssh shell so that the user knows if something went wrong, and thus can investigate the cause by looking at syslog (assuming they've enabled PAM debugging in syslog or elsewhere). It also adds an error message for when nodename isn't found.
I actually have yet another question on this: If the slurmdspooldir can't be read for some reason, it returns a non-NULL but empty list of steps. Then, the user would get denied access with the message, "Access denied by pam_slurm_adopt: you have no active jobs on this node." This is the wrong message. I almost included a patch that calls FREE_NULL_LIST(l) in each spot before returning (in stepd_available() (step_api.c). However, there are a number of places that don't handle stepd_available() returning NULL. That's a bug even currently, in my opinion. I'd like to fix all these places, then add in the FREE_NULL_LIST(l) at each error in stepd_available(). What are your thoughts?
Removing Brian from CC, adding Tim.
Ping. See also bug 5490 - make everywhere check return code of stepd_available(). (It's not done, I'm just linking it.)
Comment on attachment 7299 [details] Document the pam_slurm_adopt nodename parameter. Thanks Marshall, this one is in: commit 6d6ecacb7ccec2d40df1b643ef106f4c079bcf0e Author: Marshall Garey <marshall@schedmd.com> Date: Wed Oct 3 14:32:23 2018 -0600 Docs - document nodename option to pam_slurm_adopt PAM module. Fix formatting on the log_level documentation block as well. Bug 5399.
Closing as resolved/fixed in commit 6d6ecacb7ccec2d40df1b643ef106f4c079bcf0e Thanks, Tim
Un-resolving, I still have one patch to review here...
Thanks Marshall. I made a few minor tweaks to the error message - keep in mind that the most likely explanation for the failure is that no jobs are running, so I removed some of the elaboration to the comment block instead: commit 9fb15b4a899333c4c437309a569099b8f1143152 Author: Marshall Garey <marshall@schedmd.com> Date: Wed Dec 5 15:19:58 2018 -0700 pam_slurm_adopt - send an error message to the user if no Jobs found. Also throw an error message within stepd_available() if the nodename is not set or cannot be inferred correctly. Bug 5399.