Ticket 5399

Summary: Document pam_slurm_adopt nodename parameter
Product: Slurm Reporter: Marshall Garey <marshall>
Component: ContributionsAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, tim
Version: 17.11.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5490
Site: SchedMD Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 18.08.1 / 18.08.4, 19.05.0pre2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Document the pam_slurm_adopt nodename parameter.
Send an error message when pam_slurm_adopt can't find nodename.

Description Marshall Garey 2018-07-09 15:20:39 MDT
A couple of times now, people have (inadvertently) used --enable-multiple-slurmd and then pam_slurm_adopt didn't work because they didn't have the nodename parameter. nodename was never documented, even in the README, but it's time to document it in the pam_slurm_adopt.shtml page.
Comment 1 Marshall Garey 2018-07-09 15:21:45 MDT
See also bug 5380, bug 5026.
Comment 2 Marshall Garey 2018-07-13 13:38:36 MDT
Created attachment 7299 [details]
Document the pam_slurm_adopt nodename parameter.

In my testing, this is how it is, whether or not you're using multiple-slurmd - you need to use nodename if the node hostname doesn't match the NodeName in slurm.conf.


NEWS: does there need to be a NEWS entry for a doc patch?

Commit message:

Document the pam_slurm_adopt nodename parameter.
Comment 3 Marshall Garey 2018-07-13 13:42:58 MDT
Brian, can you review this patch?

Alex, I'm also adding you to CC since on bug 5380 you asked me to test out nodename some more. I found that even when I wasn't using multiple-slurmd, if the hostname didn't match the slurm.conf NodeName, I needed to use nodename in pam_slurm_adopt. Can you verify? I tested that by using a virtual machine as a node, with a different hostname than the host OS (host OS hostname is voyager, VM hostname is voyager2), and the NodeName in slurm.conf different than that. It works correctly if I use nodename in pam_slurm_adopt.
Comment 4 Marshall Garey 2018-07-13 15:04:12 MDT
Created attachment 7300 [details]
Send an error message when pam_slurm_adopt can't find nodename.

Brian, can you also review this one?

Commit message:

When nodename can't be determined, pam_slurm_adopt prints and error to the syslog, then exits with whatever action_generic_failure says ("ignore" by default). No error messages are sent to the shell trying to ssh in. For example, if you get denied (action_generic_failure=deny), then all you see is "Authentication failed." Or if you succeed, you get in without being part of a cgroup, but don't know that anything went wrong.

This patch sends an error message to the ssh shell so that the user knows if something went wrong, and thus can investigate the cause by looking at syslog (assuming they've enabled PAM debugging in syslog or elsewhere). It also adds an error message for when nodename isn't found.
Comment 5 Marshall Garey 2018-07-13 15:15:12 MDT
I actually have yet another question on this:

If the slurmdspooldir can't be read for some reason, it returns a non-NULL but empty list of steps. Then, the user would get denied access with the message, "Access denied by pam_slurm_adopt: you have no active jobs on this node." This is the wrong message.

I almost included a patch that calls FREE_NULL_LIST(l) in each spot before returning (in stepd_available() (step_api.c). However, there are a number of places that don't handle stepd_available() returning NULL. That's a bug even currently, in my opinion.

I'd like to fix all these places, then add in the FREE_NULL_LIST(l) at each error in stepd_available(). What are your thoughts?
Comment 6 Marshall Garey 2018-07-17 10:04:49 MDT
Removing Brian from CC, adding Tim.
Comment 7 Marshall Garey 2018-09-26 14:48:46 MDT
Ping.

See also bug 5490 - make everywhere check return code of stepd_available(). (It's not done, I'm just linking it.)
Comment 8 Tim Wickberg 2018-10-03 14:33:29 MDT
Comment on attachment 7299 [details]
Document the pam_slurm_adopt nodename parameter.

Thanks Marshall, this one is in:

commit 6d6ecacb7ccec2d40df1b643ef106f4c079bcf0e
Author: Marshall Garey <marshall@schedmd.com>
Date:   Wed Oct 3 14:32:23 2018 -0600

    Docs - document nodename option to pam_slurm_adopt PAM module.
    
    Fix formatting on the log_level documentation block as well.
    
    Bug 5399.
Comment 9 Marshall Garey 2018-10-03 14:39:37 MDT
Closing as resolved/fixed in commit 6d6ecacb7ccec2d40df1b643ef106f4c079bcf0e

Thanks, Tim
Comment 10 Tim Wickberg 2018-10-03 14:46:54 MDT
Un-resolving, I still have one patch to review here...
Comment 11 Tim Wickberg 2018-12-05 15:22:47 MST
Thanks Marshall. I made a few minor tweaks to the error message - keep in mind that the most likely explanation for the failure is that no jobs are running, so I removed some of the elaboration to the comment block instead:

commit 9fb15b4a899333c4c437309a569099b8f1143152
Author: Marshall Garey <marshall@schedmd.com>
Date:   Wed Dec 5 15:19:58 2018 -0700

    pam_slurm_adopt - send an error message to the user if no Jobs found.
    
    Also throw an error message within stepd_available() if the nodename
    is not set or cannot be inferred correctly.
    
    Bug 5399.