Ticket 3550 - pam_slurm_adopt can not access s_p_get_uint64
Summary: pam_slurm_adopt can not access s_p_get_uint64
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 17.02.1
Hardware: Cray XC Linux
: 3 - Medium Impact
Assignee: Tim Wickberg
QA Contact:
URL:
: 3565 3660 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2017-03-08 01:28 MST by Doug Jacobsen
Modified: 2017-04-18 20:31 MDT (History)
4 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.02.2
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Fix that makes pam_slurm_adopt.so working again (476 bytes, patch)
2017-03-08 05:46 MST, Thomas Opfer
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2017-03-08 01:28:57 MST
Hello,

pam_slurm_adopt stopped functioning on cori after the upgrade to slurm 17.02.  After much headscratching since cray compute nodes don't run syslog (thus disabling typical pam_debug methods), it finally turned out that journalctl can still expose it.  Anyway:

Mar 08 00:02:55 nid00116 sshd[10671]: PAM adding faulty module: /lib64/security/pam_slurm_adopt.so
Mar 08 00:04:16 nid00116 sshd[10807]: PAM unable to dlopen(/lib64/security/pam_slurm_adopt.so): /lib64/security/pam_slurm_adopt.so: undefined symbol: s_p_get_uint64
Mar 08 00:04:16 nid00116 sshd[10807]: PAM adding faulty module: /lib64/security/pam_slurm_adopt.so
Mar 08 00:15:01 nid00116 cron[11552]: PAM unable to dlopen(/lib64/security/pam_slurm_adopt.so): /lib64/security/pam_slurm_adopt.so: undefined symbol: s_p_get_uint64
Mar 08 00:15:01 nid00116 cron[11552]: PAM adding faulty module: /lib64/security/pam_slurm_adopt.so


nid00116:~ # nm /usr/lib64/libslurm.so.31.0.0  | grep s_p_get_uint64
00000000000b057b t s_p_get_uint64
00000000000b057b T slurm_s_p_get_uint64
nid00116:~ # 


I suppose in 16.05 s_p_get_uint64 was exported, but no longer in slurm 17.02.  My guess this is one of many symbols impacted.

nid00116:~ # ldd /lib64/security/pam_slurm_adopt.so 
	linux-vdso.so.1 (0x00007ffff7ffe000)
	libslurm.so.31 => /usr/lib64/libslurm.so.31 (0x00007ffff79ad000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007ffff77a9000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffff758b000)
	libc.so.6 => /lib64/libc.so.6 (0x00007ffff71e3000)
	/lib64/ld-linux-x86-64.so.2 (0x0000555555554000)
nid00116:~ # 


-Doug
Comment 1 Thomas Opfer 2017-03-08 05:46:37 MST
Created attachment 4169 [details]
Fix that makes pam_slurm_adopt.so working again

Same problem her. Fix is attached.
Comment 2 Tim Wickberg 2017-03-08 07:42:42 MST
Thanks Thomas - fix committed as a7699ba462e, and will be in 17.02.2 when released.

The dependency on s_p_get_uint64 is new in 17.02 as a result of converting all the memory types to 64 bit; and that function didn't exist previously. It looks like I overlooked the addition to slurm_xlator.h when adding that back in commit d487c16e2d0.

- Tim
Comment 3 Doug Jacobsen 2017-03-08 07:47:50 MST
Is there any complication to updating functions redefined in slurm_xlator?
Upon recompiling can I just update the pam_slurm rpm or all slurm rpms?

On Mar 8, 2017 6:42 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 3550
> <https://bugs.schedmd.com/show_bug.cgi?id=3550>
> What Removed Added
> Version Fixed   17.02.2
> Resolution --- FIXED
> Status UNCONFIRMED RESOLVED
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3550#c2> on bug
> 3550 <https://bugs.schedmd.com/show_bug.cgi?id=3550> from Tim Wickberg
> <tim@schedmd.com> *
>
> Thanks Thomas - fix committed as a7699ba462e, and will be in 17.02.2 when
> released.
>
> The dependency on s_p_get_uint64 is new in 17.02 as a result of converting all
> the memory types to 64 bit; and that function didn't exist previously. It looks
> like I overlooked the addition to slurm_xlator.h when adding that back in
> commit d487c16e2d0.
>
> - Tim
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 4 Tim Wickberg 2017-03-08 07:52:40 MST
Good question, that. I believe just the pam_slurm RPM needs to be rebuilt.

libslurm.so should already be exporting the symbol for slurm_s_p_get_uint64; you just need pam_slurm_adopt.c to pick up the translation to use it from slurm_xlator.h.
Comment 5 Doug Jacobsen 2017-03-08 11:35:17 MST
hmmm, on my test system pam_slurm_adopt.so now loads, but does not appear
to function correctly:

dmj@gerty:~> salloc  -C haswell
salloc: Granted job allocation 19212
salloc: Waiting for resource configuration
salloc: Nodes nid00022 are ready for job
dmj@nid00022:~> ssh nid00022
...
...
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.128.0.23
dmj@nid00022:~>


----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Mar 8, 2017 at 6:52 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 3550
> <https://bugs.schedmd.com/show_bug.cgi?id=3550>
> What Removed Added
> Assignee support@schedmd.com tim@schedmd.com
>
> *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3550#c4> on bug
> 3550 <https://bugs.schedmd.com/show_bug.cgi?id=3550> from Tim Wickberg
> <tim@schedmd.com> *
>
> Good question, that. I believe just the pam_slurm RPM needs to be rebuilt.
>
> libslurm.so should already be exporting the symbol for slurm_s_p_get_uint64;
> you just need pam_slurm_adopt.c to pick up the translation to use it from
> slurm_xlator.h.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 6 Tim Wickberg 2017-03-08 11:42:21 MST
Well... some progress at least.

I'll test it out here shortly and see if I can narrow down the problem. Unfortunately, the contribs/ aren't covered in our regression tests, and pam_slurm/pam_slurm_adopt themselves we don't usually have enabled on our test systems (as we inevitably lock ourselves out).

I'm making a note in our release testing docs that these need to be verified. (commit 89816cefee in addition to some internal documentation).
Comment 7 Thomas Opfer 2017-03-08 13:03:57 MST
That explains why I could not get it running, yet (we did not use this before). I thought it was a misconfiguration at our site, but yes, it seems broken.

(In reply to Doug Jacobsen from comment #5)
> hmmm, on my test system pam_slurm_adopt.so now loads, but does not appear
> to function correctly:
> 
> dmj@gerty:~> salloc  -C haswell
> salloc: Granted job allocation 19212
> salloc: Waiting for resource configuration
> salloc: Nodes nid00022 are ready for job
> dmj@nid00022:~> ssh nid00022
> ...
> ...
> Access denied by pam_slurm_adopt: you have no active jobs on this node
> Connection closed by 10.128.0.23
> dmj@nid00022:~>
Comment 8 Thomas Opfer 2017-03-08 13:16:27 MST
After having found out how to change the log level, here is the error that I get:

pam_slurm_adopt[1592]: debug3: unable to determine uid of step 24600.4294967295 on tpb0002
Comment 9 Tim Wickberg 2017-03-08 13:53:10 MST
Fixed in commit 68e64e699696.

You should only need to rebuild and reinstall slurm-pam, this second issue was entirely within the pam_slurm_adopt plugin. Thanks for your patience. As mentioned, pam_slurm and pam_slurm_adopt had been left off our release testing plans; they're on there now.

- Tim
Comment 10 Thomas Opfer 2017-03-08 14:01:24 MST
(In reply to Tim Wickberg from comment #9)
> Fixed in commit 68e64e699696.
> 
> You should only need to rebuild and reinstall slurm-pam, this second issue
> was entirely within the pam_slurm_adopt plugin. Thanks for your patience. As
> mentioned, pam_slurm and pam_slurm_adopt had been left off our release
> testing plans; they're on there now.
> 
> - Tim

Thanks Tim, this seems to work.
Comment 11 Thomas Opfer 2017-03-09 02:53:42 MST
There is still something wrong. The memory cgroup is not set.

Here is what happens inside salloc:

[to86cola@tlb0001 etc]# srun hostname
tai0001
[to86cola@tlb0001 etc]# srun cat /proc/self/cgroup
11:hugetlb:/
10:devices:/
9:blkio:/
8:net_prio,net_cls:/
7:pids:/
6:memory:/slurm/uid_256360334/job_24615/step_3
5:cpuset:/slurm/uid_256360334/job_24615/step_3
4:cpuacct,cpu:/
3:freezer:/slurm/uid_256360334/job_24615/step_3
2:perf_event:/
1:name=systemd:/system.slice/slurmd.service
[to86cola@tlb0001 etc]#

And here is what happens inside ssh:

[to86cola@tlb0001 ~]# ssh tai0001
Last login: Thu Mar  9 10:44:38 2017 from tlb0001.tcluster
[to86cola@tai0001 ~]$ cat /proc/self/cgroup
11:hugetlb:/
10:devices:/
9:blkio:/
8:net_prio,net_cls:/
7:pids:/
6:memory:/
5:cpuset:/slurm/uid_256360334/job_24615/step_extern
4:cpuacct,cpu:/
3:freezer:/slurm/uid_256360334/job_24615/step_extern
2:perf_event:/
1:name=systemd:/user.slice/user-256360334.slice/session-19.scope
[to86cola@tai0001 ~]$

The restriction to cpu cores works as intended, but I can use as much memory as I want to.

I will re-check our configuration, but there might be something wrong with the "extern" step.
Comment 12 Thomas Opfer 2017-03-09 04:23:36 MST
I double-checked our configuration and don't see any obvious mistake. On the other hand, I don't see any obvious reason in the Slurm source code why it is not working. Could somebody please check whether it works correctly for him?
Comment 13 Thomas Opfer 2017-03-09 08:18:44 MST
Another update: The process ID of the sshd is written to .../step_extern/cgroup.procs, but removed in src/slurmd/slurmstepd/req.c during the call of

	/* Send the return code */
	safe_write(fd, &rc, sizeof(int));

. I don't really understand what is going on there. Any ideas?
Comment 15 Thomas Opfer 2017-03-09 08:44:16 MST
I just realized that this is the return code to the caller.

Maybe, this is a systemd issue? I'll dig into this next week.
Comment 16 Tim Wickberg 2017-03-09 08:50:08 MST
Re-marking as resolved.

Thomas - it looks like you're seeing a separate issue; you're welcome to file a different bug on that if you notice a problem, but I suspect it's a configuration issue. In the future, please do not re-open bugs filed by other customers.

- Tim
Comment 17 Tim Wickberg 2017-03-09 19:42:14 MST
*** Ticket 3565 has been marked as a duplicate of this ticket. ***
Comment 18 NYU HPC Team 2017-04-04 10:19:19 MDT
(In reply to Thomas Opfer from comment #8)
> After having found out how to change the log level, here is the error that I
> get:
> 
> pam_slurm_adopt[1592]: debug3: unable to determine uid of step
> 24600.4294967295 on tpb0002

pam_slurm_adopt failed for us too. I also saw message like the above. I added a debug statement in pam_slurm_adopt function _get_job_uid(...), after calling the function stepd_get_uid(...), saw uid is that number too: 
uid 4294967295

Is it somehow -1 interpreted as 4294967295 ?  
2^32 = 4294967296

We are now running on 17.02.1-2.


Thanks!
Comment 19 NYU HPC Team 2017-04-04 10:25:40 MDT
(In reply to NYU HPC Team from comment #18)
> (In reply to Thomas Opfer from comment #8)
> > After having found out how to change the log level, here is the error that I
> > get:
> > 
> > pam_slurm_adopt[1592]: debug3: unable to determine uid of step
> > 24600.4294967295 on tpb0002
> 
> pam_slurm_adopt failed for us too. I also saw message like the above. I
> added a debug statement in pam_slurm_adopt function _get_job_uid(...), after
> calling the function stepd_get_uid(...), saw uid is that number too: 
> uid 4294967295
> 
> Is it somehow -1 interpreted as 4294967295 ?  
> 2^32 = 4294967296
> 
> We are now running on 17.02.1-2.
> 
> 
> Thanks!

Sorry this is after adding the line to src/common/slurm_xlator.h:
#define s_p_get_uint64          slurm_s_p_get_uint64

recompiled and redeployed as shown in earlier comments in this ticket.
Comment 20 Thomas Opfer 2017-04-04 13:43:03 MDT
(In reply to NYU HPC Team from comment #18)
> (In reply to Thomas Opfer from comment #8)
> > After having found out how to change the log level, here is the error that I
> > get:
> > 
> > pam_slurm_adopt[1592]: debug3: unable to determine uid of step
> > 24600.4294967295 on tpb0002
> 
> pam_slurm_adopt failed for us too. I also saw message like the above. I
> added a debug statement in pam_slurm_adopt function _get_job_uid(...), after
> calling the function stepd_get_uid(...), saw uid is that number too: 
> uid 4294967295
> 
> Is it somehow -1 interpreted as 4294967295 ?  
> 2^32 = 4294967296
> 
> We are now running on 17.02.1-2.
> 
> 
> Thanks!

Did you set

PrologFlags = Contain

in slurm.conf?
Comment 21 NYU HPC Team 2017-04-04 13:48:13 MDT
(In reply to Thomas Opfer from comment #20)
> (In reply to NYU HPC Team from comment #18)
> > (In reply to Thomas Opfer from comment #8)
> > > After having found out how to change the log level, here is the error that I
> > > get:
> > > 
> > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step
> > > 24600.4294967295 on tpb0002
> > 
> > pam_slurm_adopt failed for us too. I also saw message like the above. I
> > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after
> > calling the function stepd_get_uid(...), saw uid is that number too: 
> > uid 4294967295
> > 
> > Is it somehow -1 interpreted as 4294967295 ?  
> > 2^32 = 4294967296
> > 
> > We are now running on 17.02.1-2.
> > 
> > 
> > Thanks!
> 
> Did you set
> 
> PrologFlags = Contain
> 
> in slurm.conf?

Yes we do set PrologFlags = Contain. thanks!
Comment 22 Thomas Opfer 2017-04-04 14:22:32 MDT
(In reply to NYU HPC Team from comment #21)
> (In reply to Thomas Opfer from comment #20)
> > (In reply to NYU HPC Team from comment #18)
> > > (In reply to Thomas Opfer from comment #8)
> > > > After having found out how to change the log level, here is the error that I
> > > > get:
> > > > 
> > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step
> > > > 24600.4294967295 on tpb0002
> > > 
> > > pam_slurm_adopt failed for us too. I also saw message like the above. I
> > > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after
> > > calling the function stepd_get_uid(...), saw uid is that number too: 
> > > uid 4294967295
> > > 
> > > Is it somehow -1 interpreted as 4294967295 ?  
> > > 2^32 = 4294967296
> > > 
> > > We are now running on 17.02.1-2.
> > > 
> > > 
> > > Thanks!
> > 
> > Did you set
> > 
> > PrologFlags = Contain
> > 
> > in slurm.conf?
> 
> Yes we do set PrologFlags = Contain. thanks!

I'm sorry. Forget about my above post. Did you apply this patch:

https://github.com/SchedMD/slurm/commit/68e64e699696

?
Comment 23 NYU HPC Team 2017-04-04 17:23:45 MDT
(In reply to Thomas Opfer from comment #22)
> (In reply to NYU HPC Team from comment #21)
> > (In reply to Thomas Opfer from comment #20)
> > > (In reply to NYU HPC Team from comment #18)
> > > > (In reply to Thomas Opfer from comment #8)
> > > > > After having found out how to change the log level, here is the error that I
> > > > > get:
> > > > > 
> > > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step
> > > > > 24600.4294967295 on tpb0002
> > > > 
> > > > pam_slurm_adopt failed for us too. I also saw message like the above. I
> > > > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after
> > > > calling the function stepd_get_uid(...), saw uid is that number too: 
> > > > uid 4294967295
> > > > 
> > > > Is it somehow -1 interpreted as 4294967295 ?  
> > > > 2^32 = 4294967296
> > > > 
> > > > We are now running on 17.02.1-2.
> > > > 
> > > > 
> > > > Thanks!
> > > 
> > > Did you set
> > > 
> > > PrologFlags = Contain
> > > 
> > > in slurm.conf?
> > 
> > Yes we do set PrologFlags = Contain. thanks!
> 
> I'm sorry. Forget about my above post. Did you apply this patch:
> 
> https://github.com/SchedMD/slurm/commit/68e64e699696
> 
> ?

This fixes the problem, very nice! The command now works:
$ srun -w <nodename> --x11 --pty /bin/bash
Comment 24 Tim Wickberg 2017-04-18 20:31:33 MDT
*** Ticket 3660 has been marked as a duplicate of this ticket. ***