SchedMD, I am encountering issues with pam_slurm.so and pam_slurm_adopt.so after upgrading from 20.02.6 to 21.08.4/21.08.5. With pam_slurm.so active, normal sbatch jobs run to completion just fine, however interactive sessions (i.e. srun --pty bash) get stuck and do not complete/cancel cleanly. With pam_slurm_adopt.so active and all the configuration adjustments for it, any job submitted get cancelled immediately. Here is an example of journalctl output on a compute node being targeted for job submission: Feb 09 09:34:00 slurmd[21559]: slurmd: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error Feb 09 09:34:00 slurmd[21559]: slurmd: Could not launch job 48 and not able to requeue it, cancelling job Another symptom is that no StdOut or StdErr files are written. I saw these symptoms both when upgrading to 21.08.4 and 21.08.5, so it feels like a broader issue. I've followed the instructions provided in the Quick Start#Upgrade and pam_slurm_adopt documentation. Thanks, -Jason Kim
Hi Jason, Just for some added clarity on this - if you don't require pam_slurm/pam_slurm_adopt, everything works fine? Would you please provide your current slurm.conf, the file(s) that you modified from /etc/pam.d/, the output of ldd /lib/security/pam_slurm_adopt.so, what OS you are running on, and a more complete log from the slurmd? Thanks! --Tim
Created attachment 23466 [details] slurm.conf used during jobid test 92
Tim, Unfortunately, I just tested after disabling both pam options and the error persists. OS: Red Hat Enterprise Linux Server release 7.6 (Maipo) Attachments for all the other files requested incoming Thanks, -Jason Kim
Created attachment 23467 [details] /etc/pam.d/sshd used during test jobid 92
Created attachment 23468 [details] /etc/pam.d/password-auth used during test jobid 92
Created attachment 23469 [details] ldd /lib/security/pam_slurm_adopt.so output
Created attachment 23470 [details] slurmd output (level debug5) used during test jobid 92
Created attachment 23471 [details] slurmctld output (level debug5) during test jobid 92
Thank you for all the logs/information! There seems to be some information missing from the slurmd (no stepd logs?) but... I did notice in the ctld logs the path "/weka/users/...". Are you using WekaIO?
(In reply to Tim McMullan from comment #11) > Thank you for all the logs/information! > > There seems to be some information missing from the slurmd (no stepd logs?) > but... I did notice in the ctld logs the path "/weka/users/...". Are you > using WekaIO? For the slurmd logs, I'm pulling them from /var/log/messages directly off the target compute node that was requested for the job. If there is a way to get the stepd logs let me know, but I'm not even sure the job gets to that point before cancelling itself and draining the node. Yes, we are running WekaIO, hopefully this doesn't complicate the troubleshooting. Thanks, -Jason
(In reply to Jason Kim from comment #12) > (In reply to Tim McMullan from comment #11) > > Thank you for all the logs/information! > > > > There seems to be some information missing from the slurmd (no stepd logs?) > > but... I did notice in the ctld logs the path "/weka/users/...". Are you > > using WekaIO? > > For the slurmd logs, I'm pulling them from /var/log/messages directly off > the target compute node that was requested for the job. If there is a way to > get the stepd logs let me know, but I'm not even sure the job gets to that > point before cancelling itself and draining the node. It feels like a familiar issue to me, but at least for not it might be worth setting "SlurmdLogFile" to something since that generally works. > Yes, we are running WekaIO, hopefully this doesn't complicate the > troubleshooting. It doesn't make troubleshooting harder, I actually ask because there are known issues with slurm and WekaIO that manifest as job launch failures so this might be the real culprit. There are a couple options here for confirming it. One is to use the "dedicated_mode=none" option for Weka (noted here https://bugs.schedmd.com/show_bug.cgi?id=12393#c102). The other option is to test a slurm build with the patch from bug12393 or the slurm-21.08 branch from git. Since this is you running on a test system I'm comfortable recommending either, but I'm not recommending running that branch in production. Note that the bug12393 patch should be in in 21.08.6 which has not been released yet. Thanks! --Tim
(In reply to Tim McMullan from comment #13) > > It doesn't make troubleshooting harder, I actually ask because there are > known issues with slurm and WekaIO that manifest as job launch failures so > this might be the real culprit. > > There are a couple options here for confirming it. One is to use the > "dedicated_mode=none" option for Weka (noted here > https://bugs.schedmd.com/show_bug.cgi?id=12393#c102). The other option is > to test a slurm build with the patch from bug12393 or the slurm-21.08 branch > from git. Since this is you running on a test system I'm comfortable > recommending either, but I'm not recommending running that branch in > production. Note that the bug12393 patch should be in in 21.08.6 which has > not been released yet. > > Thanks! > --Tim Tim, I'm in the middle of testing the "dedicated_mode=none" weka mount option workaround right now, but so far the results are promising and it is working. There are still a bunch of testing configuration changes that I need to undo, but I will update again later when testing is complete. Thanks, -Jason
Thanks for the update Jason! I'm glad that initial results look good, let me know how everything goes! --Tim
Tim, While the weka workaround does function, we have decided to wait until the 21.08.6 release with the fix due to other complications. Thank you for the help and for the reference to the other bug entry, you may close this bug. Thanks, -Jason
(In reply to Jason Kim from comment #18) > Tim, > > While the weka workaround does function, we have decided to wait until the > 21.08.6 release with the fix due to other complications. > > Thank you for the help and for the reference to the other bug entry, you may > close this bug. > > Thanks, > -Jason Thanks Jason, sounds good. I'm glad we were able to confirm that this was ultimately the issue! I'll close this now. Thank you! --Tim