Ticket 13388 - Jobs cancelling after upgrading from 20.02 to 21.08, weka.io conflict
Summary: Jobs cancelling after upgrading from 20.02 to 21.08, weka.io conflict
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Tim McMullan
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-02-09 08:15 MST by Jason Kim
Modified: 2022-02-15 09:25 MST (History)
1 user (show)

See Also:
Site: EM
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf used during jobid test 92 (4.51 KB, text/plain)
2022-02-14 10:36 MST, Jason Kim
Details
/etc/pam.d/sshd used during test jobid 92 (1.10 KB, text/plain)
2022-02-14 10:39 MST, Jason Kim
Details
/etc/pam.d/password-auth used during test jobid 92 (1.45 KB, text/plain)
2022-02-14 10:39 MST, Jason Kim
Details
ldd /lib/security/pam_slurm_adopt.so output (660 bytes, text/plain)
2022-02-14 10:40 MST, Jason Kim
Details
slurmd output (level debug5) used during test jobid 92 (2.60 KB, text/plain)
2022-02-14 10:41 MST, Jason Kim
Details
slurmctld output (level debug5) during test jobid 92 (10.90 KB, text/plain)
2022-02-14 10:42 MST, Jason Kim
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jason Kim 2022-02-09 08:15:55 MST
SchedMD,

I am encountering issues with pam_slurm.so and pam_slurm_adopt.so after upgrading from 20.02.6 to 21.08.4/21.08.5.

With pam_slurm.so active, normal sbatch jobs run to completion just fine, however interactive sessions (i.e. srun --pty bash) get stuck and do not complete/cancel cleanly.

With pam_slurm_adopt.so active and all the configuration adjustments for it, any job submitted get cancelled immediately. Here is an example of journalctl output on a compute node being targeted for job submission:

Feb 09 09:34:00 slurmd[21559]: slurmd: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
Feb 09 09:34:00 slurmd[21559]: slurmd: Could not launch job 48 and not able to requeue it, cancelling job

Another symptom is that no StdOut or StdErr files are written.

I saw these symptoms both when upgrading to 21.08.4 and 21.08.5, so it feels like a broader issue. I've followed the instructions provided in the Quick Start#Upgrade and pam_slurm_adopt documentation.

Thanks,
-Jason Kim
Comment 2 Tim McMullan 2022-02-14 09:55:50 MST
Hi Jason,

Just for some added clarity on this - if you don't require pam_slurm/pam_slurm_adopt, everything works fine?

Would you please provide your current slurm.conf, the file(s) that you modified from /etc/pam.d/, the output of ldd /lib/security/pam_slurm_adopt.so, what OS you are running on, and a more complete log from the slurmd?

Thanks!
--Tim
Comment 4 Jason Kim 2022-02-14 10:36:32 MST
Created attachment 23466 [details]
slurm.conf used during jobid test 92
Comment 5 Jason Kim 2022-02-14 10:37:49 MST
Tim,

Unfortunately, I just tested after disabling both pam options and the error persists.

OS: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Attachments for all the other files requested incoming

Thanks,
-Jason Kim
Comment 6 Jason Kim 2022-02-14 10:39:13 MST
Created attachment 23467 [details]
/etc/pam.d/sshd used during test jobid 92
Comment 7 Jason Kim 2022-02-14 10:39:44 MST
Created attachment 23468 [details]
/etc/pam.d/password-auth used during test jobid 92
Comment 8 Jason Kim 2022-02-14 10:40:29 MST
Created attachment 23469 [details]
ldd /lib/security/pam_slurm_adopt.so output
Comment 9 Jason Kim 2022-02-14 10:41:08 MST
Created attachment 23470 [details]
slurmd output (level debug5) used during test jobid 92
Comment 10 Jason Kim 2022-02-14 10:42:23 MST
Created attachment 23471 [details]
slurmctld output (level debug5) during test jobid 92
Comment 11 Tim McMullan 2022-02-14 10:56:59 MST
Thank you for all the logs/information!

There seems to be some information missing from the slurmd (no stepd logs?) but... I did notice in the ctld logs the path "/weka/users/...".  Are you using WekaIO?
Comment 12 Jason Kim 2022-02-14 11:13:48 MST
(In reply to Tim McMullan from comment #11)
> Thank you for all the logs/information!
> 
> There seems to be some information missing from the slurmd (no stepd logs?)
> but... I did notice in the ctld logs the path "/weka/users/...".  Are you
> using WekaIO?

For the slurmd logs, I'm pulling them from /var/log/messages directly off the target compute node that was requested for the job. If there is a way to get the stepd logs let me know, but I'm not even sure the job gets to that point before cancelling itself and draining the node.

Yes, we are running WekaIO, hopefully this doesn't complicate the troubleshooting.

Thanks,
-Jason
Comment 13 Tim McMullan 2022-02-14 11:31:10 MST
(In reply to Jason Kim from comment #12)
> (In reply to Tim McMullan from comment #11)
> > Thank you for all the logs/information!
> > 
> > There seems to be some information missing from the slurmd (no stepd logs?)
> > but... I did notice in the ctld logs the path "/weka/users/...".  Are you
> > using WekaIO?
> 
> For the slurmd logs, I'm pulling them from /var/log/messages directly off
> the target compute node that was requested for the job. If there is a way to
> get the stepd logs let me know, but I'm not even sure the job gets to that
> point before cancelling itself and draining the node.

It feels like a familiar issue to me, but at least for not it might be worth setting "SlurmdLogFile" to something since that generally works.

> Yes, we are running WekaIO, hopefully this doesn't complicate the
> troubleshooting.

It doesn't make troubleshooting harder, I actually ask because there are known issues with slurm and WekaIO that manifest as job launch failures so this might be the real culprit.

There are a couple options here for confirming it.  One is to use the "dedicated_mode=none" option for Weka (noted here https://bugs.schedmd.com/show_bug.cgi?id=12393#c102).  The other option is to test a slurm build with the patch from bug12393 or the slurm-21.08 branch from git.  Since this is you running on a test system I'm comfortable recommending either, but I'm not recommending running that branch in production.   Note that the bug12393 patch should be in in 21.08.6 which has not been released yet.

Thanks!
--Tim
Comment 16 Jason Kim 2022-02-14 13:02:28 MST
(In reply to Tim McMullan from comment #13)
> 
> It doesn't make troubleshooting harder, I actually ask because there are
> known issues with slurm and WekaIO that manifest as job launch failures so
> this might be the real culprit.
> 
> There are a couple options here for confirming it.  One is to use the
> "dedicated_mode=none" option for Weka (noted here
> https://bugs.schedmd.com/show_bug.cgi?id=12393#c102).  The other option is
> to test a slurm build with the patch from bug12393 or the slurm-21.08 branch
> from git.  Since this is you running on a test system I'm comfortable
> recommending either, but I'm not recommending running that branch in
> production.   Note that the bug12393 patch should be in in 21.08.6 which has
> not been released yet.
> 
> Thanks!
> --Tim

Tim,

I'm in the middle of testing the "dedicated_mode=none" weka mount option workaround right now, but so far the results are promising and it is working. There are still a bunch of testing configuration changes that I need to undo, but I will update again later when testing is complete.

Thanks,
-Jason
Comment 17 Tim McMullan 2022-02-14 13:05:05 MST
Thanks for the update Jason!  I'm glad that initial results look good, let me know how everything goes!

--Tim
Comment 18 Jason Kim 2022-02-14 15:19:28 MST
Tim,

While the weka workaround does function, we have decided to wait until the 21.08.6 release with the fix due to other complications.

Thank you for the help and for the reference to the other bug entry, you may close this bug.

Thanks,
-Jason
Comment 19 Tim McMullan 2022-02-15 09:25:20 MST
(In reply to Jason Kim from comment #18)
> Tim,
> 
> While the weka workaround does function, we have decided to wait until the
> 21.08.6 release with the fix due to other complications.
> 
> Thank you for the help and for the reference to the other bug entry, you may
> close this bug.
> 
> Thanks,
> -Jason

Thanks Jason, sounds good.  I'm glad we were able to confirm that this was ultimately the issue!  I'll close this now.

Thank you!
--Tim