Hello, pam_slurm_adopt stopped functioning on cori after the upgrade to slurm 17.02. After much headscratching since cray compute nodes don't run syslog (thus disabling typical pam_debug methods), it finally turned out that journalctl can still expose it. Anyway: Mar 08 00:02:55 nid00116 sshd[10671]: PAM adding faulty module: /lib64/security/pam_slurm_adopt.so Mar 08 00:04:16 nid00116 sshd[10807]: PAM unable to dlopen(/lib64/security/pam_slurm_adopt.so): /lib64/security/pam_slurm_adopt.so: undefined symbol: s_p_get_uint64 Mar 08 00:04:16 nid00116 sshd[10807]: PAM adding faulty module: /lib64/security/pam_slurm_adopt.so Mar 08 00:15:01 nid00116 cron[11552]: PAM unable to dlopen(/lib64/security/pam_slurm_adopt.so): /lib64/security/pam_slurm_adopt.so: undefined symbol: s_p_get_uint64 Mar 08 00:15:01 nid00116 cron[11552]: PAM adding faulty module: /lib64/security/pam_slurm_adopt.so nid00116:~ # nm /usr/lib64/libslurm.so.31.0.0 | grep s_p_get_uint64 00000000000b057b t s_p_get_uint64 00000000000b057b T slurm_s_p_get_uint64 nid00116:~ # I suppose in 16.05 s_p_get_uint64 was exported, but no longer in slurm 17.02. My guess this is one of many symbols impacted. nid00116:~ # ldd /lib64/security/pam_slurm_adopt.so linux-vdso.so.1 (0x00007ffff7ffe000) libslurm.so.31 => /usr/lib64/libslurm.so.31 (0x00007ffff79ad000) libdl.so.2 => /lib64/libdl.so.2 (0x00007ffff77a9000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ffff758b000) libc.so.6 => /lib64/libc.so.6 (0x00007ffff71e3000) /lib64/ld-linux-x86-64.so.2 (0x0000555555554000) nid00116:~ # -Doug
Created attachment 4169 [details] Fix that makes pam_slurm_adopt.so working again Same problem her. Fix is attached.
Thanks Thomas - fix committed as a7699ba462e, and will be in 17.02.2 when released. The dependency on s_p_get_uint64 is new in 17.02 as a result of converting all the memory types to 64 bit; and that function didn't exist previously. It looks like I overlooked the addition to slurm_xlator.h when adding that back in commit d487c16e2d0. - Tim
Is there any complication to updating functions redefined in slurm_xlator? Upon recompiling can I just update the pam_slurm rpm or all slurm rpms? On Mar 8, 2017 6:42 AM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 3550 > <https://bugs.schedmd.com/show_bug.cgi?id=3550> > What Removed Added > Version Fixed 17.02.2 > Resolution --- FIXED > Status UNCONFIRMED RESOLVED > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3550#c2> on bug > 3550 <https://bugs.schedmd.com/show_bug.cgi?id=3550> from Tim Wickberg > <tim@schedmd.com> * > > Thanks Thomas - fix committed as a7699ba462e, and will be in 17.02.2 when > released. > > The dependency on s_p_get_uint64 is new in 17.02 as a result of converting all > the memory types to 64 bit; and that function didn't exist previously. It looks > like I overlooked the addition to slurm_xlator.h when adding that back in > commit d487c16e2d0. > > - Tim > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Good question, that. I believe just the pam_slurm RPM needs to be rebuilt. libslurm.so should already be exporting the symbol for slurm_s_p_get_uint64; you just need pam_slurm_adopt.c to pick up the translation to use it from slurm_xlator.h.
hmmm, on my test system pam_slurm_adopt.so now loads, but does not appear to function correctly: dmj@gerty:~> salloc -C haswell salloc: Granted job allocation 19212 salloc: Waiting for resource configuration salloc: Nodes nid00022 are ready for job dmj@nid00022:~> ssh nid00022 ... ... Access denied by pam_slurm_adopt: you have no active jobs on this node Connection closed by 10.128.0.23 dmj@nid00022:~> ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Mar 8, 2017 at 6:52 AM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 3550 > <https://bugs.schedmd.com/show_bug.cgi?id=3550> > What Removed Added > Assignee support@schedmd.com tim@schedmd.com > > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=3550#c4> on bug > 3550 <https://bugs.schedmd.com/show_bug.cgi?id=3550> from Tim Wickberg > <tim@schedmd.com> * > > Good question, that. I believe just the pam_slurm RPM needs to be rebuilt. > > libslurm.so should already be exporting the symbol for slurm_s_p_get_uint64; > you just need pam_slurm_adopt.c to pick up the translation to use it from > slurm_xlator.h. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Well... some progress at least. I'll test it out here shortly and see if I can narrow down the problem. Unfortunately, the contribs/ aren't covered in our regression tests, and pam_slurm/pam_slurm_adopt themselves we don't usually have enabled on our test systems (as we inevitably lock ourselves out). I'm making a note in our release testing docs that these need to be verified. (commit 89816cefee in addition to some internal documentation).
That explains why I could not get it running, yet (we did not use this before). I thought it was a misconfiguration at our site, but yes, it seems broken. (In reply to Doug Jacobsen from comment #5) > hmmm, on my test system pam_slurm_adopt.so now loads, but does not appear > to function correctly: > > dmj@gerty:~> salloc -C haswell > salloc: Granted job allocation 19212 > salloc: Waiting for resource configuration > salloc: Nodes nid00022 are ready for job > dmj@nid00022:~> ssh nid00022 > ... > ... > Access denied by pam_slurm_adopt: you have no active jobs on this node > Connection closed by 10.128.0.23 > dmj@nid00022:~>
After having found out how to change the log level, here is the error that I get: pam_slurm_adopt[1592]: debug3: unable to determine uid of step 24600.4294967295 on tpb0002
Fixed in commit 68e64e699696. You should only need to rebuild and reinstall slurm-pam, this second issue was entirely within the pam_slurm_adopt plugin. Thanks for your patience. As mentioned, pam_slurm and pam_slurm_adopt had been left off our release testing plans; they're on there now. - Tim
(In reply to Tim Wickberg from comment #9) > Fixed in commit 68e64e699696. > > You should only need to rebuild and reinstall slurm-pam, this second issue > was entirely within the pam_slurm_adopt plugin. Thanks for your patience. As > mentioned, pam_slurm and pam_slurm_adopt had been left off our release > testing plans; they're on there now. > > - Tim Thanks Tim, this seems to work.
There is still something wrong. The memory cgroup is not set. Here is what happens inside salloc: [to86cola@tlb0001 etc]# srun hostname tai0001 [to86cola@tlb0001 etc]# srun cat /proc/self/cgroup 11:hugetlb:/ 10:devices:/ 9:blkio:/ 8:net_prio,net_cls:/ 7:pids:/ 6:memory:/slurm/uid_256360334/job_24615/step_3 5:cpuset:/slurm/uid_256360334/job_24615/step_3 4:cpuacct,cpu:/ 3:freezer:/slurm/uid_256360334/job_24615/step_3 2:perf_event:/ 1:name=systemd:/system.slice/slurmd.service [to86cola@tlb0001 etc]# And here is what happens inside ssh: [to86cola@tlb0001 ~]# ssh tai0001 Last login: Thu Mar 9 10:44:38 2017 from tlb0001.tcluster [to86cola@tai0001 ~]$ cat /proc/self/cgroup 11:hugetlb:/ 10:devices:/ 9:blkio:/ 8:net_prio,net_cls:/ 7:pids:/ 6:memory:/ 5:cpuset:/slurm/uid_256360334/job_24615/step_extern 4:cpuacct,cpu:/ 3:freezer:/slurm/uid_256360334/job_24615/step_extern 2:perf_event:/ 1:name=systemd:/user.slice/user-256360334.slice/session-19.scope [to86cola@tai0001 ~]$ The restriction to cpu cores works as intended, but I can use as much memory as I want to. I will re-check our configuration, but there might be something wrong with the "extern" step.
I double-checked our configuration and don't see any obvious mistake. On the other hand, I don't see any obvious reason in the Slurm source code why it is not working. Could somebody please check whether it works correctly for him?
Another update: The process ID of the sshd is written to .../step_extern/cgroup.procs, but removed in src/slurmd/slurmstepd/req.c during the call of /* Send the return code */ safe_write(fd, &rc, sizeof(int)); . I don't really understand what is going on there. Any ideas?
I just realized that this is the return code to the caller. Maybe, this is a systemd issue? I'll dig into this next week.
Re-marking as resolved. Thomas - it looks like you're seeing a separate issue; you're welcome to file a different bug on that if you notice a problem, but I suspect it's a configuration issue. In the future, please do not re-open bugs filed by other customers. - Tim
*** Ticket 3565 has been marked as a duplicate of this ticket. ***
(In reply to Thomas Opfer from comment #8) > After having found out how to change the log level, here is the error that I > get: > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step > 24600.4294967295 on tpb0002 pam_slurm_adopt failed for us too. I also saw message like the above. I added a debug statement in pam_slurm_adopt function _get_job_uid(...), after calling the function stepd_get_uid(...), saw uid is that number too: uid 4294967295 Is it somehow -1 interpreted as 4294967295 ? 2^32 = 4294967296 We are now running on 17.02.1-2. Thanks!
(In reply to NYU HPC Team from comment #18) > (In reply to Thomas Opfer from comment #8) > > After having found out how to change the log level, here is the error that I > > get: > > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step > > 24600.4294967295 on tpb0002 > > pam_slurm_adopt failed for us too. I also saw message like the above. I > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after > calling the function stepd_get_uid(...), saw uid is that number too: > uid 4294967295 > > Is it somehow -1 interpreted as 4294967295 ? > 2^32 = 4294967296 > > We are now running on 17.02.1-2. > > > Thanks! Sorry this is after adding the line to src/common/slurm_xlator.h: #define s_p_get_uint64 slurm_s_p_get_uint64 recompiled and redeployed as shown in earlier comments in this ticket.
(In reply to NYU HPC Team from comment #18) > (In reply to Thomas Opfer from comment #8) > > After having found out how to change the log level, here is the error that I > > get: > > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step > > 24600.4294967295 on tpb0002 > > pam_slurm_adopt failed for us too. I also saw message like the above. I > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after > calling the function stepd_get_uid(...), saw uid is that number too: > uid 4294967295 > > Is it somehow -1 interpreted as 4294967295 ? > 2^32 = 4294967296 > > We are now running on 17.02.1-2. > > > Thanks! Did you set PrologFlags = Contain in slurm.conf?
(In reply to Thomas Opfer from comment #20) > (In reply to NYU HPC Team from comment #18) > > (In reply to Thomas Opfer from comment #8) > > > After having found out how to change the log level, here is the error that I > > > get: > > > > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step > > > 24600.4294967295 on tpb0002 > > > > pam_slurm_adopt failed for us too. I also saw message like the above. I > > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after > > calling the function stepd_get_uid(...), saw uid is that number too: > > uid 4294967295 > > > > Is it somehow -1 interpreted as 4294967295 ? > > 2^32 = 4294967296 > > > > We are now running on 17.02.1-2. > > > > > > Thanks! > > Did you set > > PrologFlags = Contain > > in slurm.conf? Yes we do set PrologFlags = Contain. thanks!
(In reply to NYU HPC Team from comment #21) > (In reply to Thomas Opfer from comment #20) > > (In reply to NYU HPC Team from comment #18) > > > (In reply to Thomas Opfer from comment #8) > > > > After having found out how to change the log level, here is the error that I > > > > get: > > > > > > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step > > > > 24600.4294967295 on tpb0002 > > > > > > pam_slurm_adopt failed for us too. I also saw message like the above. I > > > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after > > > calling the function stepd_get_uid(...), saw uid is that number too: > > > uid 4294967295 > > > > > > Is it somehow -1 interpreted as 4294967295 ? > > > 2^32 = 4294967296 > > > > > > We are now running on 17.02.1-2. > > > > > > > > > Thanks! > > > > Did you set > > > > PrologFlags = Contain > > > > in slurm.conf? > > Yes we do set PrologFlags = Contain. thanks! I'm sorry. Forget about my above post. Did you apply this patch: https://github.com/SchedMD/slurm/commit/68e64e699696 ?
(In reply to Thomas Opfer from comment #22) > (In reply to NYU HPC Team from comment #21) > > (In reply to Thomas Opfer from comment #20) > > > (In reply to NYU HPC Team from comment #18) > > > > (In reply to Thomas Opfer from comment #8) > > > > > After having found out how to change the log level, here is the error that I > > > > > get: > > > > > > > > > > pam_slurm_adopt[1592]: debug3: unable to determine uid of step > > > > > 24600.4294967295 on tpb0002 > > > > > > > > pam_slurm_adopt failed for us too. I also saw message like the above. I > > > > added a debug statement in pam_slurm_adopt function _get_job_uid(...), after > > > > calling the function stepd_get_uid(...), saw uid is that number too: > > > > uid 4294967295 > > > > > > > > Is it somehow -1 interpreted as 4294967295 ? > > > > 2^32 = 4294967296 > > > > > > > > We are now running on 17.02.1-2. > > > > > > > > > > > > Thanks! > > > > > > Did you set > > > > > > PrologFlags = Contain > > > > > > in slurm.conf? > > > > Yes we do set PrologFlags = Contain. thanks! > > I'm sorry. Forget about my above post. Did you apply this patch: > > https://github.com/SchedMD/slurm/commit/68e64e699696 > > ? This fixes the problem, very nice! The command now works: $ srun -w <nodename> --x11 --pty /bin/bash
*** Ticket 3660 has been marked as a duplicate of this ticket. ***