Ticket 3660

Summary:	NYU pam_slurm_adopt problem
Product:	Slurm	Reporter:	Tim Wickberg <tim>
Component:	Other	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	hpc-staff
Version:	17.02.1
Hardware:	Linux
OS:	Linux
Site:	NYU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log

Description Tim Wickberg 2017-04-04 10:33:50 MDT

From bug 3550 comment 18:

pam_slurm_adopt failed for us too. I also saw message like the above. I added a debug statement in pam_slurm_adopt function _get_job_uid(...), after calling the function stepd_get_uid(...), saw uid is that number too: 
uid 4294967295

Is it somehow -1 interpreted as 4294967295 ?  
2^32 = 4294967296

We are now running on 17.02.1-2.


Thanks!

Comment 1 Tim Wickberg 2017-04-04 10:41:18 MDT

Hey folks -

Please file separate issues in cases like this in the future; bug 3550 is still marked closed, and we might overlook responding to it. (We don't monitor the resolved issues as closely, and they don't show up as outstanding in our workflows.)

It's also unrelated to the original NERSC problem as best I can tell.

On to my response:

-1, when viewed as an unsigned 32-bit integer, is 4294967295. That special value shows up in a few different ways in Slurm.

The most common use is as a special case of NO_VAL; used to indicate that an field is unset. It is also used as a "step" number to indicate the resources assigned to the "external" step. That "external" step is used to manage the cgroups that pam_slurm_adopt need to have access to, and is setup when PrologFlags=contain is set in slurm.conf.

In your particular case, it looks like the job that pam_slurm_adopt was attempting to query was not responding or had just exited.

Are you seeing any failures directly related to users not being able to connect? Logs from slurmd on the node from that time may also shed some light on what is happening.

- Tim

Comment 2 NYU HPC Team 2017-04-04 11:19:23 MDT

Created attachment 4289 [details]
slurmd log

Thank you Tim, slurmd is attached. Yes there is error, but srun sends us to computing node:
$ srun -w c25-03 --x11 --pty /bin/bash
srun: error: x11: unable to connect node c25-03
$ hostname
c25-03

The related rsyslog message is below:
Apr  4 13:09:33 c25-03 pam_slurm_adopt[149594]: Unable to dlopen libslurm.so.31.0.0: libslurm.so.31.0.0: cannot open shared object file: No such file or directory
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug:  Reading cgroup.conf file /opt/slurm/etc/cgroup.conf
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug:  Reading slurm.conf file: /opt/slurm/etc/slurm.conf
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243977, stepid = 4294967294
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243995, stepid = 4294967295
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug4: found jobid = 243977, stepid = 4294967295
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: Trying to load plugin /opt/slurm/lib64/slurm/auth_munge.so
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug:  Munge authentication plugin loaded
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: Success.
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug:  ME=== uid 4294967295
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: unable to determine uid of step 243995.4294967295 on c25-03
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug:  ME=== uid 4294967295
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: debug3: unable to determine uid of step 243977.4294967295 on c25-03
Apr  4 13:09:34 c25-03 pam_slurm_adopt[149594]: send_user_msg: Access denied by pam_slurm_adopt: you have no active jobs on this node

Comment 3 NYU HPC Team 2017-04-04 12:04:51 MDT

In /etc pam.d configuration, when replacing pam_slurm_adopt.so with pam_slurm.so, that srun command runs without throwing error, and also there is no error as "error: x11: unable to read DISPLAY value" in slurmd log.

Comment 4 NYU HPC Team 2017-04-05 07:43:38 MDT

(In reply to NYU HPC Team from comment #3)
> In /etc pam.d configuration, when replacing pam_slurm_adopt.so with
> pam_slurm.so, that srun command runs without throwing error, and also there
> is no error as "error: x11: unable to read DISPLAY value" in slurmd log.

Tim, adding the second patch as suggested in 
https://bugs.schedmd.com/show_bug.cgi?id=3550,
the problem is now fixed. 

Thank you for help!