| Summary: | slurmrestd unable to authenticate | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jeff Avila <geoffrey_avila> |
| Component: | slurmrestd | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek, nate |
| Version: | 20.02.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Brown Univ | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
config.log from build
nm output ./configure output latest config.log latest config.log output config.log build attempt slurm.conf file |
||
|
Description
Jeff Avila
2021-03-18 12:56:30 MDT
(In reply to Jeff Avila from comment #0) > slurmrestd: error: cannot find auth plugin for auth/munge Is slurmrestd installed along with the full Slurm stack along with Munge? Please also provide:
> systemctl status slurmrestd
Hi Folks,
This is only a submit host; the munge binaries live in an nfs-mounted
/usr/local/sbin, as does slurmrestd.
munge is running; viz:
[root@pslurmctlapicit sbin]# systemctl status munge
● munge.service - MUNGE authentication service
Loaded: loaded (/usr/local/lib/systemd/system/munge.service; enabled;
vendor preset: disabled)
Active: active (running) since Tue 2021-03-16 18:03:31 EDT; 1 day 22h ago
Docs: man:munged(8)
Process: 112180 ExecStart=/usr/local/sbin/munged (code=exited,
status=0/SUCCESS)
Main PID: 112182 (munged)
CGroup: /system.slice/munge.service
└─112182 /usr/local/sbin/munged
Mar 16 18:03:31 pslurmctlapicit systemd[1]: Starting MUNGE authentication
service...
Mar 16 18:03:31 pslurmctlapicit systemd[1]: Started MUNGE authentication
service.
slurmrestd, otoh, is being run directly from the command-line for testing;
I don't have a systemd unit for it.
Thanks,
-Jeff
On Thu, Mar 18, 2021 at 3:00 PM <bugs@schedmd.com> wrote:
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c2> on bug
> 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini
> <nate@schedmd.com> *
>
> Please also provide:> systemctl status slurmrestd
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the bug.
>
>
(In reply to Jeff Avila from comment #3) > This is only a submit host; the munge binaries live in an nfs-mounted > /usr/local/sbin, as does slurmrestd. I generally advise against sites running Slurm (binaries and libraries) in NFS as NFS issues can cause the site to appear down. > slurmrestd, otoh, is being run directly from the command-line for testing; Please call this and attach the output: > echo -e 'GET invalid\r\n\r\n'| LD_DEBUG=all slurmrestd -vvvvv Using screen/tmux/script is suggested since it might get very verbose. > I don't have a systemd unit for it. Okay, the example one was added in slurm-20.11. Created attachment 18538 [details] slurmrestd.txt Here you go: On Thu, Mar 18, 2021 at 4:39 PM <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c4> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #3 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c3>)> This is only a submit host; the munge binaries live in an nfs-mounted > > /usr/local/sbin, as does slurmrestd. > I generally advise against sites running Slurm (binaries and libraries) in NFS > as NFS issues can cause the site to appear down. > > slurmrestd, otoh, is being run directly from the command-line for testing; > Please call this and attach the output:> echo -e 'GET invalid\r\n\r\n'| LD_DEBUG=all slurmrestd -vvvvv > > Using screen/tmux/script is suggested since it might get very verbose. > > I don't have a systemd unit for it. > Okay, the example one was added in slurm-20.11. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #5) > Created attachment 18538 [details] > slurmrestd.txt It doesn't even attempt to load munge which is unexpected. Are you setting SLURM_JWT in the environment before calling slurmrestd? Is it possible to get the config.log from the slurm build? Since this is from an RPM, you will need to pass this to rpmbuild to avoid it deleting the build directory: > rpmbuild -D 'noclean 1' -D 'rel 1' $@ I am not setting SLURM_JWT to anything before calling slurmrestd... As I was trying to say; I didn't know which slurm package srcrpm contained slurmrestd, so I found a prebuilt slurmrestd rpm (for version 20.02-6) on the Scientific Linux repository, and installed it that way. I don't have access to the config.log. Thanks, -Jeff On Thu, Mar 18, 2021 at 5:22 PM <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c6> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #5 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c5>)> Created attachment 18538 [details] <https://bugs.schedmd.com/attachment.cgi?id=18538> [details] <https://bugs.schedmd.com/attachment.cgi?id=18538&action=edit> > > slurmrestd.txt > > It doesn't even attempt to load munge which is unexpected. Are you setting > SLURM_JWT in the environment before calling slurmrestd? > > Is it possible to get the config.log from the slurm build? Since this is from > an RPM, you will need to pass this to rpmbuild to avoid it deleting the build > directory:> rpmbuild -D 'noclean 1' -D 'rel 1' $@ > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #7) > As I was trying to say; I didn't know which slurm package srcrpm contained > slurmrestd, so I found a prebuilt slurmrestd rpm (for version 20.02-6) on > the Scientific Linux repository, and installed it that way. I don't have > access to the config.log. Please attach the slurm.conf your using with the Slurm cluster and with slurmrestd. (In reply to Jeff Avila from comment #9) > Here you go: > AuthType=auth/munge Looks like it should be trying to load munge. (In reply to Jeff Avila from comment #7) > As I was trying to say; I didn't know which slurm package srcrpm contained > slurmrestd, so I found a prebuilt slurmrestd rpm (for version 20.02-6) on > the Scientific Linux repository, and installed it that way. I don't have > access to the config.log. Based on what has been provided: the Slurm provided by the repo was not built correctly. We have zero control over the Slurm packages include in EPEL or in Scientific Linux and strongly suggest against supported sites from using them. I would be happy to assist with instructions on how to compile Slurm for your cluster. I would first suggest trying our general 'building and installing slurm' instructions here: > https://slurm.schedmd.com/quickstart_admin.html Please note that the instructions also include how to build the RPMs for RHEL clones. Hi Nate, So, in light of yr. advice, I went back to our original source tarball and tried to rebuild the whole thing in order to get slurmrestd/libslurmfull libtool: link: gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -std=gnu99 -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -Wl,-rpath -Wl,/usr/lib64 -Wl,--no-as-needed -o .libs/slurmd slurmd.o req.o get_mach_stat.o ../common/libslurmd_common.o -Wl,-rpath=/home/gba/slurm20/lib/slurm -Wl,--export-dynamic -L/usr/lib64 ../../../src/common/.libs/libdaemonize.a ../../../src/bcast/.libs/libfile_bcast.a -L/usr/lib -lz -llz4 ../common/.libs/libslurmd_reverse_tree_math.a -L../../../src/api/.libs /home/gba/slurm-20.02.6/src/api/.libs/libslurmfull.so -ldl -lnuma -lhwloc -lpam -lpam_misc -lutil -lresolv -pthread -Wl,-rpath -Wl,/home/gba/slurm20/lib/slurm ../common/libslurmd_common.o: In function `xcpuinfo_hwloc_topo_load': /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:224: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:226: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:228: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:230: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:232: undefined reference to `hwloc_topology_set_type_filter' ../common/libslurmd_common.o:/home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:234: more undefined references to `hwloc_topology_set_type_filter' follow collect2: error: ld returned 1 exit status make[4]: *** [slurmd] Error 1 make[4]: Leaving directory `/home/gba/slurm-20.02.6/src/slurmd/slurmd' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/home/gba/slurm-20.02.6/src/slurmd' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/gba/slurm-20.02.6/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/gba/slurm-20.02.6' make: *** [all] Error 2 [root@slurmctld slurm-20.02.6]# Any ideas? Thanks, -Jeff (In reply to Jeff Avila from comment #11) > So, in light of yr. advice, I went back to our original source tarball and > tried to rebuild the whole thing in order to get slurmrestd/libslurmfull Yes, in general it is not possible to build a single component of Slurm (but it is possible to exclude). > xcpuinfo.c:234: more undefined references to > `hwloc_topology_set_type_filter' follow The devel package is needed to be installed for hwloc. Here is an example of how to install it from source (and all of Slurm as an example): https://gitlab.com/nate20/slurm-docker-scaleout/-/blob/master/scaleout/Dockerfile#L46-48 If you do install hwloc from source, make sure to pass this by configure to tell Slurm where it is: > --with-hwloc=/usr/local/ Hi Nate, hwloc-devel is indeed installed: [root@slurmctld ~]# rpm -ql hwloc-devel-1.11.2-1.el7.x86_64 /usr/include/hwloc /usr/include/hwloc.h /usr/include/hwloc/autogen /usr/include/hwloc/autogen/config.h /usr/include/hwloc/bitmap.h /usr/include/hwloc/cuda.h /usr/include/hwloc/cudart.h /usr/include/hwloc/deprecated.h /usr/include/hwloc/diff.h /usr/include/hwloc/gl.h /usr/include/hwloc/glibc-sched.h /usr/include/hwloc/helper.h /usr/include/hwloc/inlines.h /usr/include/hwloc/intel-mic.h /usr/include/hwloc/linux-libnuma.h /usr/include/hwloc/linux.h /usr/include/hwloc/myriexpress.h /usr/include/hwloc/nvml.h /usr/include/hwloc/opencl.h /usr/include/hwloc/openfabrics-verbs.h /usr/include/hwloc/plugins.h /usr/include/hwloc/rename.h /usr/lib64/libhwloc.so /usr/lib64/pkgconfig/hwloc.pc ..do I need to pass a specific path to the include dir to the configure script? THanks, -Jeff On Fri, Mar 19, 2021 at 3:31 PM <bugs@schedmd.com> wrote: > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c12> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #11 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c11>)> So, in light of yr. advice, I went back to our original source tarball and > > tried to rebuild the whole thing in order to get slurmrestd/libslurmfull > Yes, in general it is not possible to build a single component of Slurm (but it > is possible to exclude). > > xcpuinfo.c:234: more undefined references to > > `hwloc_topology_set_type_filter' follow > The devel package is needed to be installed for hwloc. > > Here is an example of how to install it from source (and all of Slurm as an > example):https://gitlab.com/nate20/slurm-docker-scaleout/-/blob/master/scaleout/Dockerfile#L46-48 > > If you do install hwloc from source, make sure to pass this by configure to > tell Slurm where it is:> --with-hwloc=/usr/local/ > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > ...just a followup, if I look in the config.log for this build, hwloc is found just fine, i.e. # grep hwloc config.log configure:21271: checking for hwloc installation configure:21305: gcc -o conftest -DNUMA_VERSION1_COMPATIBILITY -g -O2 -std=gnu99 -pthread -I/usr/include conftest.c -L/usr/lib64 -lhwloc -lresolv >&5 x_ac_cv_hwloc_dir=/usr HWLOC_LIBS='-lhwloc' # Please attach your config.log. Created attachment 18635 [details]
config.log from build
(In reply to Jeff Avila from comment #16) > Created attachment 18635 [details] > config.log from build Yes, it looks like it found it correctly: > HWLOC_CPPFLAGS='-I/usr/include' > HWLOC_LDFLAGS='-Wl,-rpath -Wl,/usr/lib64 -L/usr/lib64' > HWLOC_LIBS='-lhwloc' > #define HAVE_HWLOC 1 Is the new compile working? No, same error: ../common/libslurmd_common.o: In function `xcpuinfo_hwloc_topo_load': /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:224: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:226: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:228: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:230: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:232: undefined reference to `hwloc_topology_set_type_filter' ../common/libslurmd_common.o:/home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:234: more undefined references to `hwloc_topology_set_type_filter' follow collect2: error: ld returned 1 exit status gmake[4]: *** [slurmd] Error 1 gmake[4]: Leaving directory `/home/gba/slurm-20.02.6/src/slurmd/slurmd' gmake[3]: *** [all-recursive] Error 1 gmake[3]: Leaving directory `/home/gba/slurm-20.02.6/src/slurmd' gmake[2]: *** [all-recursive] Error 1 gmake[2]: Leaving directory `/home/gba/slurm-20.02.6/src' gmake[1]: *** [all-recursive] Error 1 gmake[1]: Leaving directory `/home/gba/slurm-20.02.6' gmake: *** [all] Error 2 On Wed, Mar 24, 2021 at 3:44 PM <bugs@schedmd.com> wrote: > *Comment # 17 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c17> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #16 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c16>)> Created attachment 18635 [details] <https://bugs.schedmd.com/attachment.cgi?id=18635> [details] <https://bugs.schedmd.com/attachment.cgi?id=18635&action=edit> > > config.log from build > > Yes, it looks like it found it correctly:> HWLOC_CPPFLAGS='-I/usr/include' > > HWLOC_LDFLAGS='-Wl,-rpath -Wl,/usr/lib64 -L/usr/lib64' > > HWLOC_LIBS='-lhwloc' > > #define HAVE_HWLOC 1 > > Is the new compile working? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Looks like hwloc may be too old:
Please call this:
> hwloc-info --version
> hwloc-info
Was EPEL the source for hwloc-devel-1.11.2-1.el7.x86_64 ?
Looks like RH7 extras: [root@slurmctld slurm-20.02.6]# yum info hwloc-devel.x86_64 Loaded plugins: enabled_repos_upload, langpacks, package_upload, product-id, search-disabled-repos, subscription-manager Installed Packages Name : hwloc-devel Arch : x86_64 Version : 1.11.2 Release : 1.el7 Size : 470 k Repo : installed Summary : Headers and shared development libraries for hwloc URL : http://www.open-mpi.org/projects/hwloc/ License : BSD Description : Headers and shared object symbolic links for the hwloc. Available Packages Name : hwloc-devel Arch : x86_64 Version : 1.11.8 Release : 4.el7 Size : 208 k Repo : rhel-7-server-optional-rpms/7Server/x86_64 Summary : Headers and shared development libraries for hwloc URL : http://www.open-mpi.org/projects/hwloc/ License : BSD Description : Headers and shared object symbolic links for the hwloc. Uploading Enabled Repositories Report Loaded plugins: langpacks, product-id [root@slurmctld slurm-20.02.6]# [root@slurmctld slurm-20.02.6]# hwloc-info --version hwloc-info 2.3.0 [root@slurmctld slurm-20.02.6]# hwloc-info depth 0: 1 Machine (type #0) depth 1: 8 Package (type #1) depth 2: 8 L2Cache (type #5) depth 3: 8 L1dCache (type #4) depth 4: 8 L1iCache (type #9) depth 5: 8 Core (type #2) depth 6: 8 PU (type #3) Special depth -3: 1 NUMANode (type #13) Special depth -4: 1 Bridge (type #14) Special depth -5: 4 PCIDev (type #15) Special depth -6: 3 OSDev (type #16) [root@slurmctld slurm-20.02.6]# On Wed, Mar 24, 2021 at 4:26 PM <bugs@schedmd.com> wrote: > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c19> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > Looks like hwloc may be too old: > > Please call this:> hwloc-info --version > > hwloc-info > > Was EPEL the source for hwloc-devel-1.11.2-1.el7.x86_64 ? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #20) > [root@slurmctld slurm-20.02.6]# yum info hwloc-devel.x86_64 > Version : 1.11.2 > > [root@slurmctld slurm-20.02.6]# hwloc-info --version > hwloc-info 2.3.0 Looks like there are 2 different hwloc installs (2.3.0 and 1.11.1) that are competing with each other. Please call this: > ldd $(which slurmd) Please make sure it points to the newly compiled slurmd. Here's the ldd for the latest-successful compilation of slurmd, the one we have in production: [root@slurmctld slurm-20.02.6]# ldd /usr/local/sbin/slurmd linux-vdso.so.1 => (0x00007ffc14df5000) libz.so.1 => /lib64/libz.so.1 (0x00007fcacca28000) liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fcacc813000) libslurmfull.so => /usr/local/lib64/libslurmfull.so (0x00007fcacc404000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fcacc200000) libnuma.so.1 => /lib64/libnuma.so.1 (0x00007fcacbff4000) libhwloc.so.15 => /usr/local/lib64/libhwloc.so.15 (0x00007fcacbda1000) libpam.so.0 => /lib64/libpam.so.0 (0x00007fcacbb92000) libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007fcacb98e000) libutil.so.1 => /lib64/libutil.so.1 (0x00007fcacb78b000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fcacb571000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcacb355000) libc.so.6 => /lib64/libc.so.6 (0x00007fcacaf94000) /lib64/ld-linux-x86-64.so.2 (0x00007fcaccc3e000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fcacad7e000) libm.so.6 => /lib64/libm.so.6 (0x00007fcacaa7c000) libaudit.so.1 => /lib64/libaudit.so.1 (0x00007fcaca854000) libcap-ng.so.0 => /lib64/libcap-ng.so.0 (0x00007fcac Please call the following:
> nm /usr/local/lib64/libhwloc.so.15
> rpm -q --whatprovides /usr/local/lib64/libhwloc.so.15
Interesting.... # rpm -q --whatprovides /usr/local/lib64/libhwloc.so.15 file /usr/local/lib64/libhwloc.so.15 is not owned by any package (nm output is attached) Created attachment 18651 [details]
nm output
(In reply to Jeff Avila from comment #24) > Interesting.... > > # rpm -q --whatprovides /usr/local/lib64/libhwloc.so.15 > file /usr/local/lib64/libhwloc.so.15 is not owned by any package Please rename that file and try the ldd test from comment#22 again: > mv /usr/local/lib64/libhwloc.so.15 /usr/local/lib64/.DISABLED.libhwloc.so.15 > ldd /usr/local/sbin/slurmd Here we go: [root@slurmctld slurm-20.02.6]# ldd /usr/local/sbin/slurmd linux-vdso.so.1 => (0x00007ffdc75f8000) libz.so.1 => /usr/lib64/libz.so.1 (0x00007f5695d80000) liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00007f5695b6b000) libslurmfull.so => /usr/local/lib64/libslurmfull.so (0x00007f569575c000) libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f5695558000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f569534c000) libhwloc.so.15 => not found libpam.so.0 => /usr/lib64/libpam.so.0 (0x00007f569513d000) libpam_misc.so.0 => /usr/lib64/libpam_misc.so.0 (0x00007f5694f39000) libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f5694d36000) libresolv.so.2 => /usr/lib64/libresolv.so.2 (0x00007f5694b1c000) libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f5694900000) libc.so.6 => /usr/lib64/libc.so.6 (0x00007f569453f000) /lib64/ld-linux-x86-64.so.2 (0x00007f5695f96000) libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f5694329000) libaudit.so.1 => /usr/lib64/libaudit.so.1 (0x00007f5694101000) libcap-ng.so.0 => /usr/lib64/libcap-ng.so.0 (0x00007f5693efb000) [root@slurmctld slurm-20.02.6]# On Thu, Mar 25, 2021 at 11:43 AM <bugs@schedmd.com> wrote: > *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c26> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #24 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c24>)> Interesting.... > > > > # rpm -q --whatprovides /usr/local/lib64/libhwloc.so.15 > > file /usr/local/lib64/libhwloc.so.15 is not owned by any package > > Please rename that file and try the ldd test from comment#22 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c22> again:> mv /usr/local/lib64/libhwloc.so.15 /usr/local/lib64/.DISABLED.libhwloc.so.15 > > ldd /usr/local/sbin/slurmd > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #27) > libhwloc.so.15 => not found Please reconfigure and 'make install' Slurm (from source) now and try again. ./configure now fails, have attached the output.... Created attachment 18655 [details]
./configure output
Please attach the config.log that is generated too. config.log for latest attempt is attached. Nate-if you think it would be any use; I can setup a Zoom call easily and we can do this interactively if you think that would expedite a solution. Created attachment 18667 [details]
latest config.log
(In reply to Jeff Avila from comment #33) > Created attachment 18667 [details] > latest config.log > > /usr/bin/ld: warning: libhwloc.so.15, needed by /usr/local/lib/libpmix.so, not found (try using -rpath or -rpath-link) > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_flags' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_xml' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_export_xmlbuffer' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_xmlbuffer' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_shmem_topology_write' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_load' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_destroy' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_io_types_filter' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_init' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_shmem_topology_get_length' > /usr/local/lib/libpmix.so: undefined reference to `hwloc_free_xmlbuffer' It is still looking for the wrong one. Lets see if we missing something: > ls -la /usr/local/lib64/libhwloc* [root@slurmctld slurm-20.02.6]# ls -la /usr/local/lib64/libhwloc* -rwxr-xr-x 1 root root 921 Nov 16 10:27 /usr/local/lib64/libhwloc.la lrwxrwxrwx 1 root root 18 Nov 16 10:27 /usr/local/lib64/libhwloc.so -> libhwloc.so.15.3.0 -rwxr-xr-x 1 root root 1589480 Nov 16 10:27 /usr/local/lib64/libhwloc.so.15.3.0 On Thu, Mar 25, 2021 at 2:52 PM <bugs@schedmd.com> wrote: > *Comment # 34 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c34> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #33 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c33>)> Created attachment 18667 [details] <https://bugs.schedmd.com/attachment.cgi?id=18667> [details] <https://bugs.schedmd.com/attachment.cgi?id=18667&action=edit> > > latest config.log > >> /usr/bin/ld: warning: libhwloc.so.15, needed by /usr/local/lib/libpmix.so, not found (try using -rpath or -rpath-link) > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_flags' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_xml' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_export_xmlbuffer' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_xmlbuffer' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_shmem_topology_write' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_load' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_destroy' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_set_io_types_filter' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_topology_init' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_shmem_topology_get_length' > > /usr/local/lib/libpmix.so: undefined reference to `hwloc_free_xmlbuffer' > > It is still looking for the wrong one. Lets see if we missing something:> ls -la /usr/local/lib64/libhwloc* > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #35) > [root@slurmctld slurm-20.02.6]# ls -la /usr/local/lib64/libhwloc* > -rwxr-xr-x 1 root root 921 Nov 16 10:27 /usr/local/lib64/libhwloc.la > lrwxrwxrwx 1 root root 18 Nov 16 10:27 /usr/local/lib64/libhwloc.so -> > libhwloc.so.15.3.0 > -rwxr-xr-x 1 root root 1589480 Nov 16 10:27 > /usr/local/lib64/libhwloc.so.15.3.0 Please move all of those out of the way and recompile again. Created attachment 18668 [details]
latest config.log output
./configure didn't complete; config.log.latest2 is attached. Please note that the severity levels are strictly defined here: > https://www.schedmd.com/support.php I'm going to change this to SEV4 as this is a question about installing a new feature and not an existing service that is degraded. Please note that increasing the SEV levels will not automatically result in a faster response time. > Severity 4 — Minor Issues > A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications. Also, I got your email and if your site has consulting time, I would be happy to work with Jess to get a call setup. (In reply to Jeff Avila from comment #38) > ./configure didn't complete; config.log.latest2 is attached. > > /usr/bin/ld: warning: libhwloc.so.15, needed by /usr/local/lib/libpmix.so, not found (try using -rpath or -rpath-link) Lets change around the configure command to point to the other hwloc: > ./configure --prefix=/home/gba/slurm20 to > export PKG_CONFIG_PATH=/usr/lib64/pkgconfig/:$PKG_CONFIG_PATH > ./configure --prefix=/home/gba/slurm20 I also suggest re-installing the hwloc-devel-1.11.2-1.el7.x86_64 rpm before calling attempting a recompile. Hi Nate, I'm ready to have a zoom call at your earliest convenience. I've set the pkg_config_path variable: [root@slurmctld slurm-20.02.6]# echo $PKG_CONFIG_PATH /usr/lib64/pkgconfig/: [root@slurmctld slurm-20.02.6]# ...but the configure fails at the same place as before. I'll upload the config.log presently. Thanks! -Jeff On Fri, Mar 26, 2021 at 11:49 AM <bugs@schedmd.com> wrote: > *Comment # 39 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c39> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > Please note that the severity levels are strictly defined here:> https://www.schedmd.com/support.php > I'm going to change this to SEV4 as this is a question about installing a new > feature and not an existing service that is degraded. Please note that > increasing the SEV levels will not automatically result in a faster response > time. > > Severity 4 — Minor Issues > > A Severity 4 issue is a minor issue with limited or no loss in functionality within the customer environment. Severity 4 issues may also be used for recommendations for future product enhancements or modifications. > > Also, I got your email and if your site has consulting time, I would be happy > to work with Jess to get a call setup. > > (In reply to Jeff Avila from comment #38 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c38>)> ./configure didn't complete; config.log.latest2 is attached. > >> /usr/bin/ld: warning: libhwloc.so.15, needed by /usr/local/lib/libpmix.so, not found (try using -rpath or -rpath-link) > > Lets change around the configure command to point to the other hwloc:> ./configure --prefix=/home/gba/slurm20 > to> export PKG_CONFIG_PATH=/usr/lib64/pkgconfig/:$PKG_CONFIG_PATH > > ./configure --prefix=/home/gba/slurm20 > > I also suggest re-installing the hwloc-devel-1.11.2-1.el7.x86_64 rpm before > calling attempting a recompile. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Created attachment 18695 [details]
config.log build attempt
(In reply to Nate Rini from comment #39) > I also suggest re-installing the hwloc-devel-1.11.2-1.el7.x86_64 rpm before > calling attempting a recompile. Was this done? Yes, I did. On Fri, Mar 26, 2021 at 12:31 PM <bugs@schedmd.com> wrote: > *Comment # 42 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c42> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Nate Rini from comment #39 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c39>)> I also suggest re-installing the hwloc-devel-1.11.2-1.el7.x86_64 rpm before > > calling attempting a recompile. > > Was this done? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #43) > Yes, I did. Looks like hwloc is now correct but pmix (install) is now the issue: > HWLOC_CPPFLAGS='-I/usr/include' > HWLOC_LDFLAGS='-Wl,-rpath -Wl,/usr/lib64 -L/usr/lib64' > HWLOC_LIBS='-lhwloc' > > /usr/bin/ld: warning: libhwloc.so.15, needed by /usr/local/lib64/libpmix.so, not found (try using -rpath or -rpath-link) You will now need to recompile pmix now against the correct hwloc. Neither pmix-3.1.5 nor pmix-3.2.1 build successfully; both configure properly and then the build ends the same way: make[2]: Entering directory `/home/gba/pmix-3.1.5/src/tools/pevent' CC pevent.o CCLD pevent ../../../src/.libs/libpmix.so: undefined reference to `hwloc_shmem_topology_write' ../../../src/.libs/libpmix.so: undefined reference to `hwloc_shmem_topology_get_length' ../../../src/.libs/libpmix.so: undefined reference to `hwloc_topology_set_io_types_filter' collect2: error: ld returned 1 exit status make[2]: *** [pevent] Error 1 make[2]: Leaving directory `/home/gba/pmix-3.1.5/src/tools/pevent' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/gba/pmix-3.1.5/src' make: *** [all-recursive] Error 1 [root@slurmctld pmix-3.1.5]# On Fri, Mar 26, 2021 at 12:38 PM <bugs@schedmd.com> wrote: > *Comment # 45 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c45> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #43 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c43>)> Yes, I did. > > Looks like hwloc is now correct but pmix (install) is now the issue:> HWLOC_CPPFLAGS='-I/usr/include' > > HWLOC_LDFLAGS='-Wl,-rpath -Wl,/usr/lib64 -L/usr/lib64' > > HWLOC_LIBS='-lhwloc' > >> /usr/bin/ld: warning: libhwloc.so.15, needed by /usr/local/lib64/libpmix.so, not found (try using -rpath or -rpath-link) > You will now need to recompile pmix now against the correct hwloc. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > More info: After neither pmix-3.2.1 nor pmix-3.1.5 built from the tarball I had; I followed the instructions at https://slurm.schedmd.com/mpi_guide.html#pmix; and built pmix-2.1 from the git repo. This version of pmix built! Unfortunately, slurm-20.0.6 still fails to build using the following configure cli: [root@slurmctld slurm-20.02.6]# ./configure --prefix=/home/gba/slurm20 --with-pmix=/home/user/gba/pmix/install/2.1 This configures properly, and some time after running "make", I get the following: ../common/libslurmd_common.o: In function `xcpuinfo_hwloc_topo_load': /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:224: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:226: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:228: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:230: undefined reference to `hwloc_topology_set_type_filter' /home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:232: undefined reference to `hwloc_topology_set_type_filter' ../common/libslurmd_common.o:/home/gba/slurm-20.02.6/src/slurmd/common/xcpuinfo.c:234: more undefined references to `hwloc_topology_set_type_filter' follow collect2: error: ld returned 1 exit status make[4]: *** [slurmd] Error 1 make[4]: Leaving directory `/home/gba/slurm-20.02.6/src/slurmd/slurmd' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/home/gba/slurm-20.02.6/src/slurmd' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/home/gba/slurm-20.02.6/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/gba/slurm-20.02.6' make: *** [all] Error 2 Ok, after building hwloc-2.3 from source, pointing the slurm config at both that and the new pmix-2.1, I got slurm to build. At this point, I can run slurmrestd out of the new build directory like so: [root@slurmctld sbin]# ./slurmrestd -vvv -f /usr/local/etc/slurm.conf localhost:10071 slurmrestd: debug2: _establish_config_source: using config_file=/usr/local/etc/slurm.conf (provided) slurmrestd: debug: slurm_conf_init: using config_file=/usr/local/etc/slurm.conf slurmrestd: debug: Reading slurm.conf file: /usr/local/etc/slurm.conf slurmrestd: debug: Ignoring obsolete CacheGroups option. slurmrestd: debug: Ignoring obsolete SchedulerPort option. slurmrestd: debug: Interactive mode activated (TTY detected on STDIN) slurmrestd: debug: main: server listen mode activated slurmrestd: debug: Munge authentication plugin loaded slurmrestd: debug: parse_http: [localhost:39284] Accepted HTTP connection slurmrestd: error: parse_http: [localhost:39284] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method slurmrestd: error: _wrap_on_data: [localhost:39284] on_data returned rc: Unexpected message received slurmrestd: debug: parse_http: [localhost:39838] Accepted HTTP connection slurmrestd: error: parse_http: [localhost:39838] unexpected HTTP error HPE_INVALID_METHOD: invalid HTTP method slurmrestd: error: _wrap_on_data: [localhost:39838] on_data returned rc: Unexpected message received ...I've tried poking at it with curl, but I am not a web developer, so I'm at a loss to see how to check functionality here...any ideas? Thanks, -Jeff (In reply to Jeff Avila from comment #48) > Ok, after building hwloc-2.3 from source, pointing the slurm config at both > that and the new pmix-2.1, I got slurm to build. At this point, I can run > slurmrestd out of the new build directory like so: Great, I was just about to send the meeting invite as I just finished another meeting. > ...I've tried poking at it with curl, but I am not a web developer, so I'm > at a loss to see how to check functionality here...any ideas? Please take a look at this presentation for examples: > https://slurm.schedmd.com/SLUG20/REST_API.pdf Note that if you're using munge authentication for slurmrestd, you will need to use a UNIX socket (denoted with unix:) instead of a TCP socket. Thanks Nate, So, if we're using a unix domain socket, how do we connect web clients to slurmrestd? Is there some way to do that via inetd? -Jeff On Fri, Mar 26, 2021 at 3:30 PM <bugs@schedmd.com> wrote: > *Comment # 49 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c49> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #48 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c48>)> Ok, after building hwloc-2.3 from source, pointing the slurm config at both > > that and the new pmix-2.1, I got slurm to build. At this point, I can run > > slurmrestd out of the new build directory like so: > > Great, I was just about to send the meeting invite as I just finished another > meeting. > > ...I've tried poking at it with curl, but I am not a web developer, so I'm > > at a loss to see how to check functionality here...any ideas? > Please take a look at this presentation for examples:> https://slurm.schedmd.com/SLUG20/REST_API.pdf > > Note that if you're using munge authentication for slurmrestd, you will need to > use a UNIX socket (denoted with unix:) instead of a TCP socket. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #50) > So, if we're using a unix domain socket, how do we connect web clients to > slurmrestd? You can't with local authentication. The Linux kernel only supports doing authentication of unix sockets directly, so another authentication method will be required. > Is there some way to do that via inetd? You will need to activate "JSON Web Token (JWT) Authentication": > https://slurm.schedmd.com/rest.html > https://slurm.schedmd.com/jwt.html Please follow the docs above and see comment we missed anything. Ok, I've rebuilt slurmrestd with YAML and JWT support according to the instructions. I can run an ldd on the executable: [root@slurmctld sbin]# ldd slurmrestd linux-vdso.so.1 => (0x00007ffe0c7a9000) libslurmfull.so => /home/gba/slurm20/lib/slurm/libslurmfull.so (0x00007fdb62ade000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fdb628da000) libhttp_parser.so.2 => /lib64/libhttp_parser.so.2 (0x00007fdb626d2000) libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x00007fdb624b2000) libjson-c.so.2 => /lib64/libjson-c.so.2 (0x00007fdb622a7000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fdb6208d000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fdb61e71000) libc.so.6 => /lib64/libc.so.6 (0x00007fdb61ab0000) /lib64/ld-linux-x86-64.so.2 (0x00007fdb62ed7000) [root@slurmctld sbin]# I don't see a jwt library there,is that being loaded dynamically at startup? How do I verify JWT support is working? According to the instructions at : https://slurm.schedmd.com/jwt.html ...we need to put a system-wide key in the state-save space, have it owned by the slurm user, and then manually create tokens for each user. How do we get the tokens to the users for their user-agents? Is this communicated to the users out-of-band? I guess I'm not sure how this JWT method is supposed to work... On Mon, Mar 29, 2021 at 1:05 PM <bugs@schedmd.com> wrote: > *Comment # 51 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c51> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #50 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c50>)> So, if we're using a unix domain socket, how do we connect web clients to > > slurmrestd? > You can't with local authentication. The Linux kernel only supports doing > authentication of unix sockets directly, so another authentication method will > be required. > > Is there some way to do that via inetd? > > You will need to activate "JSON Web Token (JWT) Authentication":> https://slurm.schedmd.com/rest.html > > https://slurm.schedmd.com/jwt.html > > Please follow the docs above and see comment we missed anything. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #52) > I've rebuilt slurmrestd with YAML and JWT support according to the > instructions. I can run an ldd on the executable: > I don't see a jwt library there,is that being loaded dynamically at > startup? How do I verify JWT support is working? ldd will not find it as libjwt is loaded at runtime. Try this instead: > pgrep slurmrestd | xargs -i grep -i jwt /proc/{}/maps > How do we get the tokens to the users for their user-agents? > Is this communicated to the users out-of-band? I guess I'm not sure > how this JWT method is supposed to work... Users can call 'scontrol token' directly if they have access to the cluster. If users do not have direct access, a separate mechanism outside of Slurm will be required. Please note that an authenticating proxy is also an option (on top of auth/JWT) to allow a site to use their existing single sign-on system to avoid users needing to be given JWT out of band. A (trivial) example is provided here: > https://gitlab.com/SchedMD/training/docker-scale-out/-/tree/master/proxy Sites can also directly generate JWT as they are based on RFC7519. We provide an example here which on the next release will be on normal documentation link for JWT: > https://github.com/SchedMD/slurm/commit/c9e5ed775c2b5c1428f51844583fe77bd7aae3e7 This does not seem to be loading; does it need a separate cli flag? [root@slurmctld sbin]# ps -aef | grep slurmrestd root 2570 17994 0 14:50 pts/11 00:00:00 ./slurmrestd -f /usr/local/etc/slurm.conf localhost:10011 root 4912 17994 0 14:53 pts/11 00:00:00 grep --color=auto slurmrestd [root@slurmctld sbin]# cat /proc/2570/maps | grep jwt [root@slurmctld sbin]# On Mon, Mar 29, 2021 at 2:40 PM <bugs@schedmd.com> wrote: > *Comment # 53 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c53> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #52 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c52>)> I've rebuilt slurmrestd with YAML and JWT support according to the > > instructions. I can run an ldd on the executable: > > I don't see a jwt library there,is that being loaded dynamically at > > startup? How do I verify JWT support is working? > ldd will not find it as libjwt is loaded at runtime. > > Try this instead:> pgrep slurmrestd | xargs -i grep -i jwt /proc/{}/maps > > > How do we get the tokens to the users for their user-agents? > > Is this communicated to the users out-of-band? I guess I'm not sure > > how this JWT method is supposed to work... > > Users can call 'scontrol token' directly if they have access to the cluster. If > users do not have direct access, a separate mechanism outside of Slurm will be > required. > > Please note that an authenticating proxy is also an option (on top of auth/JWT) > to allow a site to use their existing single sign-on system to avoid users > needing to be given JWT out of band. > > A (trivial) example is provided here:> https://gitlab.com/SchedMD/training/docker-scale-out/-/tree/master/proxy > > Sites can also directly generate JWT as they are based on RFC7519. We provide > an example here which on the next release will be on normal documentation link > for JWT:> https://github.com/SchedMD/slurm/commit/c9e5ed775c2b5c1428f51844583fe77bd7aae3e7 > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #54) > This does not seem to be loading; does it need a separate cli flag? Call this: > scontrol show config Try this: > slurmrestd -a jwt I should add that I haven't made the changes in slurm.conf and restarted slurmctld yet; if that makes a difference...here's the output: [root@slurmctld sbin]# scontrol show config Configuration data as of 2021-03-29T14:57:24 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos AccountingStorageHost = slurmctld AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = No AuthAltTypes = (null) AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 120 sec BOOT_TIME = 2021-03-29T11:37:02 BurstBufferType = (null) CliFilterPlugins = (null) ClusterName = slurmctld CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CredType = cred/munge DebugFlags = NO_CONF_HASH DefMemPerCPU = 2800 DependencyParameters = (null) DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = NO Epilog = /usr/local/etc/slurm/epilog EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GpuFreqDef = high,memory=high GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 0 sec JobAcctGatherFrequency = 30 JobAcctGatherType = jobacct_gather/linux JobAcctGatherParams = (null) JobCompHost = slurmctld JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 0 JobSubmitPlugins = (null) KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 60 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = (null) LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 10001 MaxDBDMsgs = 132842 MaxJobCount = 65535 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MessageTimeout = 60 sec MinJobAge = 10 sec MpiDefault = none MpiParams = (null) MsgAggregationParams = (null) NEXT_JOB_ID = 948917 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/local/lib/slurm PlugStackConfig = (null) PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PreemptExemptTime = 00:00:00 PrEpParameters = (null) PrEpPlugins = prep/script PriorityParameters = (null) PrioritySiteFactorParameters = (null) PrioritySiteFactorPlugin = (null) PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 3-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1 PriorityWeightAssoc = 0 PriorityWeightFairShare = 8000 PriorityWeightJobSize = 1 PriorityWeightPartition = 1000 PriorityWeightQOS = 40000 PriorityWeightTRES = (null) PrivateData = none ProctrackType = proctrack/cgroup Prolog = /usr/local/etc/slurm/prolog PrologEpilogTimeout = 65534 PrologSlurmctld = /usr/local/etc/slurm/controller_prolog PrologFlags = Alloc,Contain,X11 PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /sbin/reboot ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 2 RoutePlugin = route/default SallocDefaultCommand = (null) SbcastParameters = (null) SchedulerParameters = defer,bf_max_job_assoc=10,bf_max_job_test=100,bf_continue,max_array_tasks=10001,sched_min_interval=1000,bf_interval=120,bf_max_job_array_resv=2 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE SlurmUser = slurm(508) SlurmctldAddr = (null) SlurmctldDebug = error SlurmctldHost[0] = slurmctld SlurmctldLogFile = /var/log/slurm/slurmctld SlurmctldPort = 6810-6817 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 420 sec SlurmctldParameters = (null) SlurmdDebug = error SlurmdLogFile = /var/log/slurm/slurmd SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 600 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /usr/local/etc/slurm.conf SLURM_VERSION = 20.02.6 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /var/spool/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = /usr/local/etc/slurm/task_prolog TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/tree TrackWCKey = No TreeWidth = 50 UsePam = No UnkillableStepProgram = (null) UnkillableStepTimeout = 300 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /usr/local/etc/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup ConstrainCores = yes ConstrainDevices = yes ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = no Slurmctld(primary) at slurmctld is UP [root@slurmctld sbin]# ./slurmrestd -a jwt Usage: slurmrestd [OPTIONS] [host:port]... -f file Use specified file for slurmctld configuration -h Print this help message. -t <thread count> Number of threads to use for processing. -u <user> setuid() to user after opening sockets. -v Verbose mode. Multiple -v's increase verbosity. -V Print version information and exit. [root@slurmctld sbin]# On Mon, Mar 29, 2021 at 2:57 PM <bugs@schedmd.com> wrote: > *Comment # 55 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c55> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #54 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c54>)> This does not seem to be loading; does it need a separate cli flag? > > Call this:> scontrol show config > > Try this:> slurmrestd -a jwt > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #56) > I should add that I haven't made the changes in slurm.conf and restarted > slurmctld yet; if that makes a difference...here's the output: Yes, it does. Note that JWT is being added as a secondary auth and does not replace munge. > [root@slurmctld sbin]# ./slurmrestd -a jwt The argument to requite JWT auth is '-a jwt' but it still needs to put the previous arguments to get it running. I'm still confused here. I have been trying to build slurmrestd properly. Our current slurmctld, the one currently controlling our cluster, isn't the same binary as the slurmctld that I just built in the process of building a slurmrestd with JWT and YAML support. Do we have to use the newly-built slurmctld binary in concert with the new slurmrestd, or can we continue to use our old slurmctld binary in concert with the new slurmrestd? Likewise, is it necessary to pass the "-a jwt" string on the cli to slurmrestd, before giving it the ipaddr:port to bind to? Thanks! On Mon, Mar 29, 2021 at 3:02 PM <bugs@schedmd.com> wrote: > *Comment # 57 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c57> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #56 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c56>)> I should add that I haven't made the changes in slurm.conf and restarted > > slurmctld yet; if that makes a difference...here's the output: > Yes, it does. Note that JWT is being added as a secondary auth and does not > replace munge. > > [root@slurmctld sbin]# ./slurmrestd -a jwt > The argument to requite JWT auth is '-a jwt' but it still needs to put the > previous arguments to get it running. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #58) > I have been trying to build slurmrestd properly. Our current slurmctld, the > one currently controlling our cluster, isn't the same binary as the > slurmctld that I just built in the process of building a slurmrestd with > JWT and YAML support. Does the current slurmctld have libjwt compiled in? Please attach current slurm.conf. > Do we have to use the newly-built slurmctld binary in concert with the new > slurmrestd, or can we continue to use our old slurmctld binary in concert > with the new slurmrestd? I would suggest using the new binary but if your current binary is at same major version and has the JWT auth plugin compiled, then it should work. > Likewise, is it necessary to pass the "-a jwt" string on the cli to > slurmrestd, before giving it the ipaddr:port to bind to? It is suggested if your going to set it up as a http server. Please remember, we do not suggest slurmrestd be directly exposed to the internet. Created attachment 18723 [details]
slurm.conf file
I've attached our slurm.conf to the ticket. Looking at our existing, running slurmctld binary, it doesn't seem to have any mention of a jwt library loaded, but again, I haven't made that modification to slurm.conf... [root@slurmctld ~]# ps -aef | grep slurmctld root 8776 17994 0 15:43 pts/11 00:00:00 grep --color=auto slurmctld root 15286 1 0 Jan05 ? 00:01:39 tail -f slurmctld slurm 29344 1 1 11:37 ? 00:03:35 /usr/local/sbin/slurmctld [root@slurmctld ~]# cat /proc/29344/maps | grep jwt [root@slurmctld ~]# On Mon, Mar 29, 2021 at 3:24 PM <bugs@schedmd.com> wrote: > *Comment # 59 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c59> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #58 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c58>)> I have been trying to build slurmrestd properly. Our current slurmctld, the > > one currently controlling our cluster, isn't the same binary as the > > slurmctld that I just built in the process of building a slurmrestd with > > JWT and YAML support. > Does the current slurmctld have libjwt compiled in? Please attach current > slurm.conf. > > Do we have to use the newly-built slurmctld binary in concert with the new > > slurmrestd, or can we continue to use our old slurmctld binary in concert > > with the new slurmrestd? > I would suggest using the new binary but if your current binary is at same > major version and has the JWT auth plugin compiled, then it should work. > > Likewise, is it necessary to pass the "-a jwt" string on the cli to > > slurmrestd, before giving it the ipaddr:port to bind to? > It is suggested if your going to set it up as a http server. Please remember, > we do not suggest slurmrestd be directly exposed to the internet. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #61) > I've attached our slurm.conf to the ticket. > Looking at our existing, running slurmctld binary, it doesn't seem to have > any mention of a jwt library loaded, but again, I haven't made that > modification to slurm.conf... Please apply the instructions here: > https://slurm.schedmd.com/jwt.html Hi Nate, Adding AuthAltTypes=auth/jwt to our current slurm.conf causes slurmctld to not restart successfully. The error is " fatal: failed to initialize authentication plugin". I suppose this means that we have to rebuild *everything* and put the newly-configured slurmrestd into production along with the corresponding slurmctld, slurmd, slurmdbd etc. Can you see any way around that? Thanks, -Jeff On Mon, Mar 29, 2021 at 3:53 PM <bugs@schedmd.com> wrote: > *Comment # 62 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c62> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #61 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c61>)> I've attached our slurm.conf to the ticket. > > Looking at our existing, running slurmctld binary, it doesn't seem to have > > any mention of a jwt library loaded, but again, I haven't made that > > modification to slurm.conf... > > Please apply the instructions here:> https://slurm.schedmd.com/jwt.html > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #63) > Adding AuthAltTypes=auth/jwt to our current slurm.conf causes slurmctld to > not restart successfully. The error is " fatal: failed to initialize > authentication plugin". call 'slurmctld -Dvvvvvv' and post the log. > I suppose this means that we have to rebuild *everything* and put the > newly-configured slurmrestd into production along with the corresponding > slurmctld, slurmd, slurmdbd etc. > > Can you see any way around that? Probably not. I assume slurmctld was compiled for one of the previously mentioned RPMs? Er...is 'slurmctld -Dvvvvvv' safe to execute on the same host that's already running our production slurmctld? On Mon, Mar 29, 2021 at 5:38 PM <bugs@schedmd.com> wrote: > *Comment # 64 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c64> on bug > 11134 <https://bugs.schedmd.com/show_bug.cgi?id=11134> from Nate Rini > <nate@schedmd.com> * > > (In reply to Jeff Avila from comment #63 <https://bugs.schedmd.com/show_bug.cgi?id=11134#c63>)> Adding AuthAltTypes=auth/jwt to our current slurm.conf causes slurmctld to > > not restart successfully. The error is " fatal: failed to initialize > > authentication plugin". > call 'slurmctld -Dvvvvvv' and post the log. > > I suppose this means that we have to rebuild *everything* and put the > > newly-configured slurmrestd into production along with the corresponding > > slurmctld, slurmd, slurmdbd etc. > > > > Can you see any way around that? > Probably not. I assume slurmctld was compiled for one of the previously > mentioned RPMs? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Jeff Avila from comment #65) > safe to execute on the same host that's already running our production > slurmctld? Generally no, it will request the currently running slurmctld to shutdown and then it will take over (unless it errors) on startup. Jeff, I'm going to time this ticket out while we wait for an outage window. Please reply and the ticket will automatically re-open. Thanks, --Nate |