| Summary: | slurmctld crash on "scontrol reconfigure" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
| Component: | slurmctld | Assignee: | Carlos Tripiana Montes <tripiana> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | jbernauer, lyeager, taras.shapovalov |
| Version: | 22.05.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=14401 | ||
| Site: | NVIDIA (PSLA) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Felix Abecassis
2022-06-08 11:43:14 MDT
Hi Felix, > I was also wondering if dlopening libpmix from the slurmctld context is truly needed? As part of the great improvements done in Bug 9395 landed in 22.05, ctld now needs to load the MPI plugins at startup, or if reconfigured. This is why you haven't seen this crash happening in the ctld until now. I've taken a look at this and realized we can't do anything from our side to workaround this in some way. I've even gone down to the end of the road in glibc-2.35/stdlib/getenv.c line:84 from the sources of the ubuntu package. I can't really see which is the corrupted pointer and/or why is corrupted. But it seems it is either __environ or some of the indexes. Whatever it is, I can't spot on Internet an open bug for this. I'm a bit puzzled but I'm not able extract more knowledge from this. If you are OK I can close the bug as info given, I'm afraid this is just too deep into the OS to be workarounded in a better way. Regards, Carlos. Hi Carlos, > As part of the great improvements done in Bug 9395 landed in 22.05, ctld now needs to load the MPI plugins at startup, or if reconfigured. I don't have access to Bug 9395, so I don't know what the context is (but it's not too important). Perhaps you can link the git commits instead? > If you are OK I can close the bug as info given, I'm afraid this is just too deep into the OS to be workarounded in a better way. Yes I think that's fair, this was a long shot anyway. But I'm curious, were you able to reproduce the issue on your side or did you investigate just with the stack trace? Thanks > I don't have access to Bug 9395, so I don't know what the context is (but it's > not too important). Perhaps you can link the git commits instead? For bug#9395 here are the relevant commits. > https://github.com/SchedMD/slurm/commit/c67c071ffa994b0c1ebadccf637b804f51c753eb > https://github.com/SchedMD/slurm/commit/442576f78bcc0ec91c482cfa2106a758b750af8d These were in preparation for those changes. > https://github.com/SchedMD/slurm/commit/92cc7a296db60ec6231ca9a01dbe19ffae5c5945 > https://github.com/SchedMD/slurm/commit/9efbf3b008ba4e82550679355f5fc92c01c14917 Carlos will reply to your other questions. I am running ubuntu 22.04 up to date, plus slurm 22.05.0. My compilation is as follows: ./configure --prefix=/home/tripi/slurm/22.05.0_14276/inst --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer --enable-multiple-slurmd --with-hwloc --with-ucx --with-pmix --with-munge --with-hdf5 --with-pam_dir=/home/tripi/slurm/22.05.0_14276/inst/lib/security The PMIx in use is the same as you, from system packages. Same for UCX and the rest. Firing this crazy loop: while [ 1 ]; do scontrol reconfigure; done Can't make my slurmctld to crash. Whatever is happening to you in this GDB stacktrace, has been traced down into the GLIBC. I'm not sure, but it seems like a corrupt pointer in __environ messing the getenv function up. The concrete details on how GLIBC got that corruption is a mystery to me because I can't reproduce it using the same Ubuntu as you. I'm sorry that I can't make it fail, but maybe just an "apt-get upgrade" will fix your issue? I hope so. Regarding the Bug 9395, Jason missed some commits, but that's not really important. The summary will be: https://slurm.schedmd.com/mpi.conf.html. We now have this file (it supports configless) and can be used to tune the config for the specific underlying PMI. By now, only PMIx can be tuned. Regards, Carlos. Is it possible you send the raw output for: sudo gdb -batch -ex run -ex "thread apply all bt full" --args /usr/local/sbin/slurmctld -D -i -v I've discovered this [1], so I want to be sure this is not causing problems to Slurm. [1] https://github.com/xianyi/OpenBLAS/issues/716#issuecomment-164334498 Running this command did not reveal anything new, but you did send me down the right direction to finish investigating this bug. So thanks!
Running the application under gdb and setting a breakpoint in "getenv", I saw that "ZES_ENABLE_SYSMAN" was set during the first invocation of getenv("GNUTLS_NO_IMPLICIT_INIT") (no crash), but was not set during the second invocation of getenv("GNUTLS_NO_IMPLICIT_INIT").
At first I suspected the OneAPI plugin Slurm:
https://github.com/SchedMD/slurm/blob/e54b6d224c7873ba38a0fcfd2b41bbba0eaeb58b/src/plugins/gpu/oneapi/gpu_oneapi.c#L967
But commenting this line did not solve the problem, and then I also realized this plugin was not even active on my setup.
However: this pattern might still be dangerous given that, as you pointed out, this call to setenv() is unsafe. But that's not the problem I was facing.
Digging further, I noticed that hwloc is also setting this environment variable unconditionally, and the hwloc version on Ubuntu 22.04 (2.7.0) is using putenv():
https://github.com/open-mpi/hwloc/blob/hwloc-2.7.0/hwloc/topology.c#L85
And from man putenv:
> The string pointed to by string becomes part of the environment, so altering the string changes the environment.
Hence the corruption after a dlclose(): the environment now references a string from an unloaded library.
The following code is a simpler repro to generate the segfault:
#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(void)
{
void *lib = dlopen("libhwloc.so.15", RTLD_NOW);
printf("dlopen: %p\n", lib);
printf("getenv: %p\n", getenv("GNUTLS_NO_IMPLICIT_INIT"));
printf("dlclose: %d\n", dlclose(lib));
printf("getenv: %p\n", getenv("GNUTLS_NO_IMPLICIT_INIT"));
}
Now that I knew where to look (hwloc), I noticed that they were aware of the problem with putenv:
https://github.com/open-mpi/hwloc/pull/514
And another user reported a similar problem too with dlopen/dlclose:
https://github.com/open-mpi/hwloc/issues/533
So the issue will be solved when Ubuntu upgrades hwloc to 2.7.1, so that's good news.
Setting as RESOLVED, unfortunately there is no status for INFOGIVENTOMYSELF :) BTW I noticed there is already an Ubuntu request to upgrade libhwloc for this bug, so I pinged it: https://bugs.launchpad.net/ubuntu/+source/hwloc/+bug/1968742?comments=all I'm glad this was your issue. I was aware of that lore, and I was looking after the full backtrace to see if something related arised. My main concern though was not being able to reproduce myself the issue with the same packages/OS. That's a matter of thread order of execution and I might had been lucky enough... or not, because I was not able to get a reproducer. Btw, I don't expect oneapi init for cause trouble with this because gpu_plugin_init is using locks and thus setenv is in locked area, so thread safe. At the end of the day a fix is going to get release soon or later, so that's good news. Good job with your investigation from your side. This was helpful. Cheers, Carlos. We constantly reproduce the issue on Rocky8 and CentOS7 with hwloc 2.7.0. > when Ubuntu upgrades hwloc to 2.7.1 I see hwloc developers are not really hurry to fix the issue, in the both 2.7.1 and 2.8.0 they still use putenv: https://github.com/open-mpi/hwloc/blob/hwloc-2.8.0/tests/hwloc/levelzero.c#L25 Our workaround: root@ts-tr-c7 ~]# cat /etc/sysconfig/slurmctld ZES_ENABLE_SYSMAN=1 [root@ts-tr-c7 ~]# Taking into account that it may take time when hwloc is really fixed, do you think Slurm 22.05 can just set this automatically on start? Please disregard my last question, updating to 2.7.1 fixed the issue. |