13388 – Jobs cancelling after upgrading from 20.02 to 21.08, weka.io conflict

Ticket 13388 - Jobs cancelling after upgrading from 20.02 to 21.08, weka.io conflict

Summary: Jobs cancelling after upgrading from 20.02 to 21.08, weka.io conflict

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	21.08.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim McMullan
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-02-09 08:15 MST by Jason Kim
Modified:	2022-02-15 09:25 MST (History)
CC List:	1 user (show)

See Also:	12393
Site:	EM
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf used during jobid test 92 (4.51 KB, text/plain) 2022-02-14 10:36 MST, Jason Kim	Details
/etc/pam.d/sshd used during test jobid 92 (1.10 KB, text/plain) 2022-02-14 10:39 MST, Jason Kim	Details
/etc/pam.d/password-auth used during test jobid 92 (1.45 KB, text/plain) 2022-02-14 10:39 MST, Jason Kim	Details
ldd /lib/security/pam_slurm_adopt.so output (660 bytes, text/plain) 2022-02-14 10:40 MST, Jason Kim	Details
slurmd output (level debug5) used during test jobid 92 (2.60 KB, text/plain) 2022-02-14 10:41 MST, Jason Kim	Details
slurmctld output (level debug5) during test jobid 92 (10.90 KB, text/plain) 2022-02-14 10:42 MST, Jason Kim	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Jason Kim 2022-02-09 08:15:55 MST

SchedMD,

I am encountering issues with pam_slurm.so and pam_slurm_adopt.so after upgrading from 20.02.6 to 21.08.4/21.08.5.

With pam_slurm.so active, normal sbatch jobs run to completion just fine, however interactive sessions (i.e. srun --pty bash) get stuck and do not complete/cancel cleanly.

With pam_slurm_adopt.so active and all the configuration adjustments for it, any job submitted get cancelled immediately. Here is an example of journalctl output on a compute node being targeted for job submission:

Feb 09 09:34:00 slurmd[21559]: slurmd: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
Feb 09 09:34:00 slurmd[21559]: slurmd: Could not launch job 48 and not able to requeue it, cancelling job

Another symptom is that no StdOut or StdErr files are written.

I saw these symptoms both when upgrading to 21.08.4 and 21.08.5, so it feels like a broader issue. I've followed the instructions provided in the Quick Start#Upgrade and pam_slurm_adopt documentation.

Thanks,
-Jason Kim

Comment 2 Tim McMullan 2022-02-14 09:55:50 MST

Hi Jason,

Just for some added clarity on this - if you don't require pam_slurm/pam_slurm_adopt, everything works fine?

Would you please provide your current slurm.conf, the file(s) that you modified from /etc/pam.d/, the output of ldd /lib/security/pam_slurm_adopt.so, what OS you are running on, and a more complete log from the slurmd?

Thanks!
--Tim

Comment 4 Jason Kim 2022-02-14 10:36:32 MST

Created attachment 23466 [details]
slurm.conf used during jobid test 92

Comment 5 Jason Kim 2022-02-14 10:37:49 MST

Tim,

Unfortunately, I just tested after disabling both pam options and the error persists.

OS: Red Hat Enterprise Linux Server release 7.6 (Maipo)

Attachments for all the other files requested incoming

Thanks,
-Jason Kim

Comment 6 Jason Kim 2022-02-14 10:39:13 MST

Created attachment 23467 [details]
/etc/pam.d/sshd used during test jobid 92

Comment 7 Jason Kim 2022-02-14 10:39:44 MST

Created attachment 23468 [details]
/etc/pam.d/password-auth used during test jobid 92

Comment 8 Jason Kim 2022-02-14 10:40:29 MST

Created attachment 23469 [details]
ldd /lib/security/pam_slurm_adopt.so output

Comment 9 Jason Kim 2022-02-14 10:41:08 MST

Created attachment 23470 [details]
slurmd output (level debug5) used during test jobid 92

Comment 10 Jason Kim 2022-02-14 10:42:23 MST

Created attachment 23471 [details]
slurmctld output (level debug5) during test jobid 92

Comment 11 Tim McMullan 2022-02-14 10:56:59 MST

Thank you for all the logs/information!

There seems to be some information missing from the slurmd (no stepd logs?) but... I did notice in the ctld logs the path "/weka/users/...".  Are you using WekaIO?

Comment 12 Jason Kim 2022-02-14 11:13:48 MST

(In reply to Tim McMullan from comment #11)
> Thank you for all the logs/information!
> 
> There seems to be some information missing from the slurmd (no stepd logs?)
> but... I did notice in the ctld logs the path "/weka/users/...".  Are you
> using WekaIO?

For the slurmd logs, I'm pulling them from /var/log/messages directly off the target compute node that was requested for the job. If there is a way to get the stepd logs let me know, but I'm not even sure the job gets to that point before cancelling itself and draining the node.

Yes, we are running WekaIO, hopefully this doesn't complicate the troubleshooting.

Thanks,
-Jason

Comment 13 Tim McMullan 2022-02-14 11:31:10 MST

(In reply to Jason Kim from comment #12)
> (In reply to Tim McMullan from comment #11)
> > Thank you for all the logs/information!
> > 
> > There seems to be some information missing from the slurmd (no stepd logs?)
> > but... I did notice in the ctld logs the path "/weka/users/...".  Are you
> > using WekaIO?
> 
> For the slurmd logs, I'm pulling them from /var/log/messages directly off
> the target compute node that was requested for the job. If there is a way to
> get the stepd logs let me know, but I'm not even sure the job gets to that
> point before cancelling itself and draining the node.

It feels like a familiar issue to me, but at least for not it might be worth setting "SlurmdLogFile" to something since that generally works.

> Yes, we are running WekaIO, hopefully this doesn't complicate the
> troubleshooting.

It doesn't make troubleshooting harder, I actually ask because there are known issues with slurm and WekaIO that manifest as job launch failures so this might be the real culprit.

There are a couple options here for confirming it.  One is to use the "dedicated_mode=none" option for Weka (noted here https://bugs.schedmd.com/show_bug.cgi?id=12393#c102).  The other option is to test a slurm build with the patch from bug12393 or the slurm-21.08 branch from git.  Since this is you running on a test system I'm comfortable recommending either, but I'm not recommending running that branch in production.   Note that the bug12393 patch should be in in 21.08.6 which has not been released yet.

Thanks!
--Tim

Comment 16 Jason Kim 2022-02-14 13:02:28 MST

(In reply to Tim McMullan from comment #13)
> 
> It doesn't make troubleshooting harder, I actually ask because there are
> known issues with slurm and WekaIO that manifest as job launch failures so
> this might be the real culprit.
> 
> There are a couple options here for confirming it.  One is to use the
> "dedicated_mode=none" option for Weka (noted here
> https://bugs.schedmd.com/show_bug.cgi?id=12393#c102).  The other option is
> to test a slurm build with the patch from bug12393 or the slurm-21.08 branch
> from git.  Since this is you running on a test system I'm comfortable
> recommending either, but I'm not recommending running that branch in
> production.   Note that the bug12393 patch should be in in 21.08.6 which has
> not been released yet.
> 
> Thanks!
> --Tim

Tim,

I'm in the middle of testing the "dedicated_mode=none" weka mount option workaround right now, but so far the results are promising and it is working. There are still a bunch of testing configuration changes that I need to undo, but I will update again later when testing is complete.

Thanks,
-Jason

Comment 17 Tim McMullan 2022-02-14 13:05:05 MST

Thanks for the update Jason!  I'm glad that initial results look good, let me know how everything goes!

--Tim

Comment 18 Jason Kim 2022-02-14 15:19:28 MST

Tim,

While the weka workaround does function, we have decided to wait until the 21.08.6 release with the fix due to other complications.

Thank you for the help and for the reference to the other bug entry, you may close this bug.

Thanks,
-Jason

Comment 19 Tim McMullan 2022-02-15 09:25:20 MST

(In reply to Jason Kim from comment #18)
> Tim,
> 
> While the weka workaround does function, we have decided to wait until the
> 21.08.6 release with the fix due to other complications.
> 
> Thank you for the help and for the reference to the other bug entry, you may
> close this bug.
> 
> Thanks,
> -Jason

Thanks Jason, sounds good.  I'm glad we were able to confirm that this was ultimately the issue!  I'll close this now.

Thank you!
--Tim