Created attachment 7587 [details] strace from seg faulting srun We're seeing some odd segmentation faults from srun on our test clusters running Slurm 17.11.8. Our best reproducer involves an sbatch script to get an interactive xterm session with, for example: [day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 bx Submitted batch job 31728 [day36@opal186:srun_segfault]$ cat bx #!/bin/bash xterm [day36@opal186:srun_segfault]$ 'srun' commands that are run within the resulting xterm seg fault. Interestingly, the seg faults can be avoided by defining random environment variables: sh-4.2$ srun -n1 /usr/bin/hostname Segmentation fault sh-4.2$ A=1 srun -n1 /usr/bin/hostname opal96 sh-4.2$ A=1 B=1 srun -n1 /usr/bin/hostname Segmentation fault sh-4.2$ A=1 B=1 C=1 srun -n1 /usr/bin/hostname opal96 sh-4.2$ Also, if the number of tasks in the srun statement is equal to the number specified on the sbatch line, I don't see the seg fault: sh-4.2$ srun -n24 /usr/bin/hostname opal96 opal96 opal96 ... If I run the sbatch without the -n24, I don't see the seg fault initially, but if I define SLURM_NTASKS=24 and SLURM_NPROCS=24, I do. It doesn't appear to be just environment size though, as I can define ALURM_NTASKS and BLURM_NPROCS and things go fine. Adding the 'A=1' also rescues this scenario: sh-4.2$ srun -n1 hostname opal96 sh-4.2$ SLURM_NTASKS=24 SLURM_NPROCS=24 srun -n1 hostname Segmentation fault sh-4.2$ ALURM_NTASKS=24 BLURM_NPROCS=24 srun -n1 hostname opal96 sh-4.2$ SLURM_NTASKS=36 SLURM_NPROCS=36 srun -n1 hostname Segmentation fault sh-4.2$ SLURM_NTASKS=1 SLURM_NPROCS=1 srun -n1 hostname Segmentation fault sh-4.2$ SLURM_NTASKS=24 SLURM_NPROCS=24 A=1 srun -n1 hostname opal96 I've attached an strace of a seg faulting srun from the first scenario described, but it's not telling me much. Does that give you all an idea of what might be going on here?
Hi Ryan, Do you have a core file from srun? You can generate a backtrace by using gdb gdb srun (gdb) r -N1 hostname <Let the command run> (gdb) bt If you have a core file you can gather the backtrace with: gdb path/to/the/binary path/to/the/core thread apply all bt full Based on the output you have sent in the issue seems like it has to do with an OS call. getgid() = 36075 brk(NULL) = 0x6d1000 brk(0x711000) = 0x711000 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} --- +++ killed by SIGSEGV +++ -Jason
Created attachment 7592 [details] backtrace from core file
Created attachment 7593 [details] environment in sbatch Also, here's what my environment looks like in the sbatched xterm
Hi Ryan, I find it odd using sbatch for interactive purposes when salloc and/or srun --pty are available. I guess though you do so for reproducibility purposes. I can't reproduce so far, can you attach your slurm.conf? I also see this potentially related commit available since slurm-18-08-0-0pre1: https://github.com/SchedMD/slurm/commit/f431ca59a9970411a3b9c1f05dede353891ecbe3 (In reply to Ryan Day from comment #3) > Created attachment 7593 [details] > environment in sbatch > > Also, here's what my environment looks like in the sbatched xterm I see no SLURM_NTASKS variable defined in there. Is this environment the result of 'sbatch -N1 -n24 -pplustre28 bx' or the result of a different sbatch without specifying -n?
Ryan, besides the slurm.conf request on my previous comment, we'd be interested in knowing what version of Slurm you were running before and confirm this problem started after the last upgrade. Although unlikely, We're wondering if the upgrade failed in any node so potentially using old library/plugins. Thanks.
Created attachment 7601 [details] output of scontrol show config from inside of sbatched xterm
The xterm inside of an sbatch is a common thing for our tools and code developers to do when they want to use GUI based debuggers or other development tools. There are no doubt better ways to do it now, but old habits... I haven't been able to reproduce it from an salloc either. Slurm 17.11.8 was installed as part of a fresh image of our latest OS release candidate (TOSS-3.3-3rc1), and I don't know of any failed installs. I did rebuild all of our packages that said they depend on slurm, although I suppose it's possible that there was something that I missed. Some of our lua plugins do set things in the job control environment, and I didn't touch those, so I can try moving them aside to see if that changes anything. The test cluster was previously running 17.02.10. We haven't seen any problems with that in the previous OS version or in a version of the release candidate with 17.02.10. I believe SLURM_NTASKS is present in the environment that I attached: [benicia:~/myslurm] day36% grep -i slurm_nt ~/Downloads/env_in_sbatch.txt SLURM_NTASKS=24 [benicia:~/myslurm] day36%
Does srun also segfault if using --mpi=none?
(In reply to Alejandro Sanchez from comment #10) > Does srun also segfault if using --mpi=none? Yes: sh-4.2$ srun -n1 --mpi=none hostname opal96 sh-4.2$ A=1 srun -n1 --mpi=none hostname Segmentation fault (core dumped) sh-4.2$ A=1 B=1 srun -n1 --mpi=none hostname opal96 sh-4.2$ A=1 B=1 srun -n1 hostname opal96 sh-4.2$ srun -n1 hostname opal96 sh-4.2$ A=1 srun -n1 hostname Segmentation fault (core dumped) sh-4.2$
Interesting. Can you attach full slurm.conf? I'd like to see the nodes/partitions definitions as well (scontrol show conf doesn't show them). Are you doing any kind of env manipulation inside this? JobCompLoc = /usr/libexec/sqlog/slurm-joblog Are your environment modules doing anything special with SLURM_ variables? Is your submission host shell the same as the one where xterm is executed? from your env output I see /bin/sh SHELL=/bin/sh XTERM_SHELL=/bin/sh and the bx script has the #!/bin/bash shebang, although it shouldn't make any difference that I can come up with at first sight. Not sure if playing with changing some of the shell values has any impact on this.
Created attachment 7605 [details] slurm.conf with node, partition descriptions etc
Created attachment 7606 [details] slurm.conf.common included from slurm.conf
I found the problem. We do have a spank plugin that allows users to change the environment in their job. Commenting it out of the plugstack.conf makes the seg faults go away. Probably should have started with that, but I forgot that it existed. Thanks for taking the time on this. I'll dig into that and try to figure out why it's breaking under 17.11.8.
(In reply to Ryan Day from comment #15) > I found the problem. We do have a spank plugin that allows users to change > the environment in their job. Commenting it out of the plugstack.conf makes > the seg faults go away. Probably should have started with that, but I forgot > that it existed. Thanks for taking the time on this. I'll dig into that and > try to figure out why it's breaking under 17.11.8. Note: SPANK plugins using the Slurm APIs need to be recompiled when upgrading Slurm to a new major release. I'm marking the bug as resolved/infogiven. Please, reopen if you have anything else. Thanks.
(In reply to Alejandro Sanchez from comment #16) > (In reply to Ryan Day from comment #15) > > I found the problem. We do have a spank plugin that allows users to change > > the environment in their job. Commenting it out of the plugstack.conf makes > > the seg faults go away. Probably should have started with that, but I forgot > > that it existed. Thanks for taking the time on this. I'll dig into that and > > try to figure out why it's breaking under 17.11.8. > > Note: SPANK plugins using the Slurm APIs need to be recompiled when > upgrading Slurm to a new major release. I'm marking the bug as > resolved/infogiven. Please, reopen if you have anything else. Thanks. The SPANK plugins were recompiled with 17.11.8. It looks like all that plugin does is parse a file and then set environment variables that it finds with setenv() (in a non-remote context, it would use the spank_setenv in a remote context). Not sure why that would've broken, but I'll keep digging.
I'm re-opening this because I have a simple reproducer derived from our use-env.so plugin. I made a plugin that just does a bunch of setenv()'s in the slurm_spank_local_user_init and that reproduces the issue. See attachment. I also found that you can reproduce this within a 'normal' sbatch script. The xterm part of it is not important (opal96 has a modified plugstack.conf to point at my setenv_test.so plugin instead of the use-env.so plugin): [day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 -w opal96 bc Submitted batch job 31752 [day36@opal186:srun_segfault]$ cat bc #!/bin/sh srun -n1 hostname A=1 srun -n1 hostname A=1 B=1 srun -n1 hostname [day36@opal186:srun_segfault]$ cat slurm-31752.out opal96 /var/spool/slurmd/job31752/slurm_script: line 3: 33308 Segmentation fault A=1 srun -n1 hostname opal96 [day36@opal186:srun_segfault]$ The problem appears to be with having the setenvs inside of the slurm_spank_local_user_init. If I just export the variables in my sbatch script, I don't see seg faults (this is with the use-env.so and the setenv_test.so commented out of plugstack.conf): [day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 -w opal96 --reservation=test bd Submitted batch job 31758 [day36@opal186:srun_segfault]$ cat slurm-31758.out opal96 opal96 opal96 [day36@opal186:srun_segfault]$ cat bd #!/bin/sh export VIADEV_DEFAULT_TIME_OUT=22 export VIADEV_NUM_RDMA_BUFFER=4 export VIADEV_DEFAULT_MIN_RNR_TIMER=25 export IBV_FORK_SAFE=1 export IPATH_NO_CPUAFFINITY=1 export IPATH_NO_BACKTRACE=1 export MV2_ENABLE_AFFINITY=0 export HFI_NO_CPUAFFINITY=1 export PSM2_CLOSE_GRACE_INTERVAL=0 export PSM2_CLOSE_TIMEOUT=1 export I_MPI_FABRICS=shm:tmi export TMI_CONFIG=/etc/tmi.conf export CENTER_JOB_ID=${SLURM_CLUSTER_NAME}-${SLURM_JOB_ID} srun -n1 hostname A=1 srun -n1 hostname A=1 B=1 srun -n1 hostname [day36@opal186:srun_segfault]$
Created attachment 7616 [details] slurm spank plugin that reproduces seg faults compiled with: sh-4.2$ gcc -Wall -o setenv.o -fPIC -c setenv.c sh-4.2$ gcc -shared -o setenv_test.so setenv.o
I've also been emailing with Marc Grondona, who originally developed the use-env plugin. He ran srun under valgrind to try to see more of what was happening. Here's his email on that: If it is helpful, here's what valgrind caught: ==153910== Invalid read of size 8 ==153910== at 0x52C4C30: env_array_merge (env.c:1816) ==153910== by 0x8DEBE21: _build_user_env (launch_slurm.c:625) ==153910== by 0x8DEBE21: launch_p_step_launch (launch_slurm.c:761) ==153910== by 0x40A504: launch_g_step_launch (launch.c:532) ==153910== by 0x408187: _launch_one_app (srun.c:249) ==153910== by 0x409425: _launch_app (srun.c:498) ==153910== by 0x409425: srun (srun.c:203) ==153910== by 0x409889: main (srun.wrapper.c:17) ==153910== Address 0x95cf320 is 0 bytes inside a block of size 1,088 free'd ==153910== at 0x4C2BBB8: realloc (vg_replace_malloc.c:785) ==153910== by 0x586136F: __add_to_environ (setenv.c:142) ==153910== by 0x4C316EF: setenv (vg_replace_strmem.c:2043) ==153910== by 0x716EA6D: ??? (in /usr/lib64/slurm/use-env.so) ==153910== by 0x716F076: ??? (in /usr/lib64/slurm/use-env.so) ==153910== by 0x717103C: ??? (in /usr/lib64/slurm/use-env.so) ==153910== by 0x716FA6C: ??? (in /usr/lib64/slurm/use-env.so) ==153910== by 0x7171A26: slurm_spank_local_user_init (in /usr/lib64/slurm/use-env.so) ==153910== by 0x53A3032: _do_call_stack (plugstack.c:747) ==153910== by 0x53A45CA: spank_local_user (plugstack.c:865) ==153910== by 0x414024: _call_spank_local_user (srun_job.c:1622) ==153910== by 0x414024: pre_launch_srun_job (srun_job.c:1298) ==153910== by 0x408072: _launch_one_app (srun.c:235) ==153910== Block was alloc'd at ==153910== at 0x4C2BBB8: realloc (vg_replace_malloc.c:785) ==153910== by 0x586136F: __add_to_environ (setenv.c:142) ==153910== by 0x4C316EF: setenv (vg_replace_strmem.c:2043) ==153910== by 0x52C1F07: setenvf (env.c:291) ==153910== by 0x52C3057: setup_env (env.c:761) ==153910== by 0x407EC1: _setup_one_job_env (srun.c:599) ==153910== by 0x4087A8: _setup_job_env (srun.c:636) ==153910== by 0x4087A8: srun (srun.c:201) ==153910== by 0x409889: main (srun.wrapper.c:17) Looks like they are freeing the pointer to the environ!???!? mark
One last thing, Mark found this commit: commit aff20b90daafebf682412daa6b360b811ca048ce Author: Felip Moll <felip.moll@schedmd.com> Date: Fri Jan 12 08:50:44 2018 -0700 Global environment was not set correctly in srun Creating a copy of the actual environment in env->env defines a new pointer, then next call to setup_env and setenvf doesn't define variables in the global environment but in this new copy. Bug 4615 diff --git a/NEWS b/NEWS index 7e90633..965deff 100644 --- a/NEWS +++ b/NEWS @@ -20,6 +20,7 @@ documents those changes that are of interest to users and administrators. reconfigured. -- node_feature/knl_cray - Fix memory leak that can occur during normal operation. + -- Fix srun environment variables for --prolog script. * Changes in Slurm 17.11.2 ========================== diff --git a/src/srun/srun.c b/src/srun/srun.c index 633afbb..a478000 100644 --- a/src/srun/srun.c +++ b/src/srun/srun.c @@ -596,9 +596,8 @@ static void _setup_one_job_env(slurm_opt_t *opt_local, srun_job_t *job, env->ws_row = job->ws_row; } - env->env = env_array_copy((const char **) environ); setup_env(env, srun_opt->preserve_env); - job->env = env->env; + job->env = environ; xfree(env->task_count); xfree(env); } and wondered if it might be related. I reverted that, and it does indeed fix the seg fault issue: [day36@opal186:srun_segfault]$ cat bc #!/bin/sh /usr/workspace/wsb/day36/myslurm/test-17.11.8/src/srun/srun -n1 hostname A=1 /usr/workspace/wsb/day36/myslurm/test-17.11.8/src/srun/srun -n1 hostname A=1 B=1 /usr/workspace/wsb/day36/myslurm/test-17.11.8/src/srun/srun -n1 hostname echo "::" srun -n1 hostname A=1 srun -n1 hostname A=1 B=1 srun -n1 hostname [day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 bc Submitted batch job 31767 [day36@opal186:srun_segfault]$ cat slurm-31767.out opal97 opal97 opal97 :: opal97 /var/spool/slurmd/job31767/slurm_script: line 7: 167531 Segmentation fault A=1 srun -n1 hostname opal97 [day36@opal186:srun_segfault]$
I also saw that commit today. Have to study further if reverting that commit will make bug 4615 issue appear again though.
Hey Ryan, since you have different workarounds I'm lowering the the severity of this to 3. Although perhaps reverting such commit stops srun from segfaulting, I think we are reintroducing the issue that commit was fixing and also I see valgrind reporting definitely lost blocks with such revert applied (although it is true the invalid reads are gone and most probably that's why srun stops segfaulting). I'll investigate further and come back to you.
(In reply to Alejandro Sanchez from comment #23) > Hey Ryan, since you have different workarounds I'm lowering the the severity > of this to 3. Although perhaps reverting such commit stops srun from > segfaulting, I think we are reintroducing the issue that commit was fixing > and also I see valgrind reporting definitely lost blocks with such revert > applied (although it is true the invalid reads are gone and most probably > that's why srun stops segfaulting). I'll investigate further and come back > to you. Reverting the patch stops the seg faults, but then we're affected by the bug that it was intended to patch. The variables that we're trying to set in the job prolog don't show up in the global environment for the job. So, it's not really a workaround for us and continues to hold up our deployment of 17.11.
Created attachment 7655 [details] 17.11 patch Hi Ryan. Will you try out the attached patch and see if that solves the things for you? It solves the invalid reads for me and preserves the issue in the other bug. Also I've run the regression tests with this applied and seems it is not introducing any unintended side effects.
This looks good to me Alejandro. I'm not getting seg faults and I am getting all of the environment variables that I expect. Thank you for all your work on this! (In reply to Alejandro Sanchez from comment #29) > Created attachment 7655 [details] > 17.11 patch > > Hi Ryan. Will you try out the attached patch and see if that solves the > things for you? It solves the invalid reads for me and preserves the issue > in the other bug. Also I've run the regression tests with this applied and > seems it is not introducing any unintended side effects.
(In reply to Ryan Day from comment #30) > This looks good to me Alejandro. I'm not getting seg faults and I am getting > all of the environment variables that I expect. Thank you for all your work > on this! Patch will be available starting from 17.11.10 onwards: https://github.com/SchedMD/slurm/commit/aa947d8096b55 Closing the bug, thanks for reporting.