Ticket 9877 - How to continue use of pam_slurm_adopt without the new config-less feature
Summary: How to continue use of pam_slurm_adopt without the new config-less feature
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.02.5
Hardware: Linux Linux
: 2 - High Impact
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-09-22 13:38 MDT by rl303f
Modified: 2020-09-25 10:05 MDT (History)
3 users (show)

See Also:
Site: NIH
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
root logins with no config (762 bytes, patch)
2020-09-22 15:42 MDT, Tim McMullan
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description rl303f 2020-09-22 13:38:21 MDT
We are currently at slurm version 19.05.6 and are trying to upgrade to 20.02.5
without success.

The issue seems to revolve around our use of pam_slurm_adopt and election to
NOT use the new "config-less" feature.

We find that when the new pam_slurm_adopt module is placed on the compute
node, ssh to the node is no longer allowed (whereas this was not an issue
previously using pam_slurm_adopt.so from version 19.05.6):

$ ssh node01
Authentication failed.

The node log file contains the following errors:

pam_slurm_adopt[31245]: error: resolve_ctls_from_dns_srv: res_nsearch error: Connection refused
pam_slurm_adopt[31245]: error: fetch_config: DNS SRV lookup failed
pam_slurm_adopt[31245]: error: _establish_config_source: failed to fetch config
pam_slurm_adopt[31245]: fatal: Could not establish a configuration source

We see the addition of an extra line in the new pam_slurm_adopt.c file:

>
>       slurm_conf_init(NULL);

The previous 19.05.6 pam_slurm_adopt.c file does not contain this line and
works fine allowing ssh connections.

Can you tell us how we may continue using pam_slurm_adopt without the new
"config-less" feature?

Thank you and stay safe!
Comment 1 Nate Rini 2020-09-22 13:43:26 MDT
As a workaround, please try setting the path to your slurm.conf via the SLURM_CONF environment variable in /etc/profile.d/slurm.[cs]h.
Comment 2 Tim McMullan 2020-09-22 14:01:01 MDT
When you encounter this, is the config file available and the node up?  Are you trying to log in as root or a different user?
Comment 3 rl303f 2020-09-22 14:32:26 MDT
(In reply to Tim McMullan from comment #2)
> When you encounter this, is the config file available and the node up?  Are
> you trying to log in as root or a different user?

Yes, config file is available (as normal).
Yes, the node is up but slurmd is not (as normal).
Attempting to login as root (as normal).

Nothing is changed except pam_slurm_adopt.so appears broken on 20.02.5.

Thanks!
Comment 4 Tim McMullan 2020-09-22 15:42:39 MDT
Created attachment 15992 [details]
root logins with no config

From the logs you have provided, it looks like pam_slurm_adopt is likely failing to stat the config file.

Would you be able to run "strace sinfo --version" and see if sinfo is able to stat the conf file from a node that is having issues?  If you can attach the output too that would be great.

If it is helpful for you, I've also attached a patch to pam_slurm_adopt that should permit root to log in even when the config file can't be found.

Thanks!
Comment 5 Tim McMullan 2020-09-22 17:14:27 MDT
I just wanted to check in and see if you were able to get the output from the strace?

I also wanted to expand a little on my thought process here.  The slurm_conf_init that we added in pam_slurm_adopt will in effect be called again later in the module, so if the new one isn't working there will likely be a similar problem just further along in the process.  (Also FYI: the patch I attached earlier is slated for 20.02.6).

Thanks and let me know!
--Tim
Comment 6 rl303f 2020-09-23 13:48:25 MDT
Thank you for those suggestions, Tim.

We are doing some testing and will update you soon.

Thanks again and be safe!
Comment 7 Tim McMullan 2020-09-23 13:53:03 MDT
Sounds good,  Keep me posted on how the testing goes and let me know how I can help!

Thanks, and stay safe!
--Tim
Comment 8 Jason Booth 2020-09-24 10:16:26 MDT
Bumping this down to a severity 2.
Comment 9 rl303f 2020-09-25 07:36:37 MDT
Thank you, Tim.

I need to clarify that our upgrade was incomplete. We had not finished
all of the required steps and that is why pam_slurm_adopt was unable
to find the slurm.conf.  Also, it was the ordering of the upgrade steps
that created the problem.  This is explained more below.

Your idea of doing an strace led us to the solution.  We performed an
strace of the sshd on the node and then tried to establish an ssh
connection to the node.  We found this:

$ grep slurm.conf /scratch/strace_sshd.out
14704 stat("/usr/local/slurm-20.02/slurm-20.02.5-dev/etc/slurm.conf", 0x7fffffffdf70) = -1 ENOENT (No such file or directory)
14704 stat("/run/slurm/conf/slurm.conf", 0x7fffffffdf70) = -1 ENOENT (No such file or directory)

At that point we realized that pam_slurm_adopt was now trying to access
slurm.conf.  However, we had not yet created the symlink for slurm/etc
that points to the config dir.  After completing all of the upgrade
steps including creating the symlink pointing to slurm/etc, pam_slurm_adopt
was able to find the slurm.conf just fine and ssh access was restored.

This threw me off because up to now this had not been an issue and
ssh connections worked fine after updating pam_slurm_adopt.so but
things are different now since introducing the config-less feature.

So, we're going to reorder our upgrade procedure to push out the
new pam_slurm_adopt.so _after_ the install tree is fully configured.
That should avoid any further ssh disruptions.

Sorry for the false alarm.  This was more of a PEBCAK situation than
any kind of bug.

Thanks again and STAY SAFE!
Comment 10 Tim McMullan 2020-09-25 10:05:18 MDT
Thank you for the update! I had been thinking it might be something like that.

I just wanted to add that the patch I attached (and will be included in 20.02.6) should permit your original upgrade procedure to work (though changing it shouldn't cause an issue either).

Data from the slurm config isn't actually needed to permit root logins, but the new slurm_conf_init() call you noticed was being done too soon and would block the root login.  We've now moved it right after we check the root user to avoid situations like the one you encountered.  Previously this call was well hidden in the middle of the user logins which is likely why you didn't see this in the past.

Thank you for the report and I'm glad I could help!
Stay safe!
--Tim