| Summary: | How to continue use of pam_slurm_adopt without the new config-less feature | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | rl303f |
| Component: | Other | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | mcmullan, sfellini, susanc |
| Version: | 20.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NIH | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | CentOS |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | root logins with no config | ||
As a workaround, please try setting the path to your slurm.conf via the SLURM_CONF environment variable in /etc/profile.d/slurm.[cs]h. When you encounter this, is the config file available and the node up? Are you trying to log in as root or a different user? (In reply to Tim McMullan from comment #2) > When you encounter this, is the config file available and the node up? Are > you trying to log in as root or a different user? Yes, config file is available (as normal). Yes, the node is up but slurmd is not (as normal). Attempting to login as root (as normal). Nothing is changed except pam_slurm_adopt.so appears broken on 20.02.5. Thanks! Created attachment 15992 [details]
root logins with no config
From the logs you have provided, it looks like pam_slurm_adopt is likely failing to stat the config file.
Would you be able to run "strace sinfo --version" and see if sinfo is able to stat the conf file from a node that is having issues? If you can attach the output too that would be great.
If it is helpful for you, I've also attached a patch to pam_slurm_adopt that should permit root to log in even when the config file can't be found.
Thanks!
I just wanted to check in and see if you were able to get the output from the strace? I also wanted to expand a little on my thought process here. The slurm_conf_init that we added in pam_slurm_adopt will in effect be called again later in the module, so if the new one isn't working there will likely be a similar problem just further along in the process. (Also FYI: the patch I attached earlier is slated for 20.02.6). Thanks and let me know! --Tim Thank you for those suggestions, Tim. We are doing some testing and will update you soon. Thanks again and be safe! Sounds good, Keep me posted on how the testing goes and let me know how I can help! Thanks, and stay safe! --Tim Bumping this down to a severity 2. Thank you, Tim.
I need to clarify that our upgrade was incomplete. We had not finished
all of the required steps and that is why pam_slurm_adopt was unable
to find the slurm.conf. Also, it was the ordering of the upgrade steps
that created the problem. This is explained more below.
Your idea of doing an strace led us to the solution. We performed an
strace of the sshd on the node and then tried to establish an ssh
connection to the node. We found this:
$ grep slurm.conf /scratch/strace_sshd.out
14704 stat("/usr/local/slurm-20.02/slurm-20.02.5-dev/etc/slurm.conf", 0x7fffffffdf70) = -1 ENOENT (No such file or directory)
14704 stat("/run/slurm/conf/slurm.conf", 0x7fffffffdf70) = -1 ENOENT (No such file or directory)
At that point we realized that pam_slurm_adopt was now trying to access
slurm.conf. However, we had not yet created the symlink for slurm/etc
that points to the config dir. After completing all of the upgrade
steps including creating the symlink pointing to slurm/etc, pam_slurm_adopt
was able to find the slurm.conf just fine and ssh access was restored.
This threw me off because up to now this had not been an issue and
ssh connections worked fine after updating pam_slurm_adopt.so but
things are different now since introducing the config-less feature.
So, we're going to reorder our upgrade procedure to push out the
new pam_slurm_adopt.so _after_ the install tree is fully configured.
That should avoid any further ssh disruptions.
Sorry for the false alarm. This was more of a PEBCAK situation than
any kind of bug.
Thanks again and STAY SAFE!
Thank you for the update! I had been thinking it might be something like that. I just wanted to add that the patch I attached (and will be included in 20.02.6) should permit your original upgrade procedure to work (though changing it shouldn't cause an issue either). Data from the slurm config isn't actually needed to permit root logins, but the new slurm_conf_init() call you noticed was being done too soon and would block the root login. We've now moved it right after we check the root user to avoid situations like the one you encountered. Previously this call was well hidden in the middle of the user logins which is likely why you didn't see this in the past. Thank you for the report and I'm glad I could help! Stay safe! --Tim |
We are currently at slurm version 19.05.6 and are trying to upgrade to 20.02.5 without success. The issue seems to revolve around our use of pam_slurm_adopt and election to NOT use the new "config-less" feature. We find that when the new pam_slurm_adopt module is placed on the compute node, ssh to the node is no longer allowed (whereas this was not an issue previously using pam_slurm_adopt.so from version 19.05.6): $ ssh node01 Authentication failed. The node log file contains the following errors: pam_slurm_adopt[31245]: error: resolve_ctls_from_dns_srv: res_nsearch error: Connection refused pam_slurm_adopt[31245]: error: fetch_config: DNS SRV lookup failed pam_slurm_adopt[31245]: error: _establish_config_source: failed to fetch config pam_slurm_adopt[31245]: fatal: Could not establish a configuration source We see the addition of an extra line in the new pam_slurm_adopt.c file: > > slurm_conf_init(NULL); The previous 19.05.6 pam_slurm_adopt.c file does not contain this line and works fine allowing ssh connections. Can you tell us how we may continue using pam_slurm_adopt without the new "config-less" feature? Thank you and stay safe!