Ticket 5561 - srun segfaults
Summary: srun segfaults
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 17.11.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-08-13 15:23 MDT by Ryan Day
Modified: 2018-08-22 02:47 MDT (History)
1 user (show)

See Also:
Site: LLNL
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.10 18.08.0rc1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
strace from seg faulting srun (71.97 KB, text/plain)
2018-08-13 15:23 MDT, Ryan Day
Details
backtrace from core file (7.26 KB, text/plain)
2018-08-13 16:26 MDT, Ryan Day
Details
environment in sbatch (8.03 KB, text/plain)
2018-08-13 16:38 MDT, Ryan Day
Details
output of scontrol show config from inside of sbatched xterm (6.43 KB, text/plain)
2018-08-14 09:28 MDT, Ryan Day
Details
slurm.conf with node, partition descriptions etc (1.35 KB, text/x-matlab)
2018-08-14 11:24 MDT, Ryan Day
Details
slurm.conf.common included from slurm.conf (1.50 KB, text/x-matlab)
2018-08-14 11:25 MDT, Ryan Day
Details
slurm spank plugin that reproduces seg faults (723 bytes, text/x-csrc)
2018-08-15 11:03 MDT, Ryan Day
Details
17.11 patch (1.21 KB, patch)
2018-08-21 11:35 MDT, Alejandro Sanchez
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Ryan Day 2018-08-13 15:23:59 MDT
Created attachment 7587 [details]
strace from seg faulting srun

We're seeing some odd segmentation faults from srun on our test clusters running Slurm 17.11.8. Our best reproducer involves an sbatch script to get an interactive xterm session with, for example:

[day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 bx 
Submitted batch job 31728
[day36@opal186:srun_segfault]$ cat bx
#!/bin/bash
xterm
[day36@opal186:srun_segfault]$

'srun' commands that are run within the resulting xterm seg fault. Interestingly, the seg faults can be avoided by defining random environment variables:

sh-4.2$ srun -n1 /usr/bin/hostname
Segmentation fault
sh-4.2$ A=1 srun -n1 /usr/bin/hostname
opal96
sh-4.2$ A=1 B=1 srun -n1 /usr/bin/hostname
Segmentation fault
sh-4.2$ A=1 B=1 C=1 srun -n1 /usr/bin/hostname
opal96
sh-4.2$

Also, if the number of tasks in the srun statement is equal to the number specified on the sbatch line, I don't see the seg fault:

sh-4.2$ srun -n24 /usr/bin/hostname
opal96
opal96
opal96
...

If I run the sbatch without the -n24, I don't see the seg fault initially, but if I define SLURM_NTASKS=24 and SLURM_NPROCS=24, I do. It doesn't appear to be just environment size though, as I can define ALURM_NTASKS and BLURM_NPROCS and things go fine. Adding the 'A=1' also rescues this scenario:

sh-4.2$ srun -n1 hostname
opal96
sh-4.2$ SLURM_NTASKS=24 SLURM_NPROCS=24 srun -n1 hostname
Segmentation fault
sh-4.2$ ALURM_NTASKS=24 BLURM_NPROCS=24 srun -n1 hostname
opal96
sh-4.2$ SLURM_NTASKS=36 SLURM_NPROCS=36 srun -n1 hostname
Segmentation fault
sh-4.2$ SLURM_NTASKS=1 SLURM_NPROCS=1 srun -n1 hostname
Segmentation fault
sh-4.2$ SLURM_NTASKS=24 SLURM_NPROCS=24 A=1 srun -n1 hostname
opal96

I've attached an strace of a seg faulting srun from the first scenario described, but it's not telling me much. Does that give you all an idea of what might be going on here?
Comment 1 Jason Booth 2018-08-13 16:05:02 MDT
Hi Ryan,

 Do you have a core file from srun? You can generate a backtrace by using gdb

gdb srun
(gdb) r -N1 hostname
<Let the command run>
(gdb) bt


If you have a core file you can gather the backtrace with:
gdb path/to/the/binary path/to/the/core
thread apply all bt full

Based on the output you have sent in the issue seems like it has to do with an OS call.

getgid()                                = 36075
brk(NULL)                               = 0x6d1000
brk(0x711000)                           = 0x711000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
+++ killed by SIGSEGV +++


-Jason
Comment 2 Ryan Day 2018-08-13 16:26:24 MDT
Created attachment 7592 [details]
backtrace from core file
Comment 3 Ryan Day 2018-08-13 16:38:45 MDT
Created attachment 7593 [details]
environment in sbatch

Also, here's what my environment looks like in the sbatched xterm
Comment 4 Alejandro Sanchez 2018-08-14 06:46:48 MDT
Hi Ryan,

I find it odd using sbatch for interactive purposes when salloc and/or srun --pty are available. I guess though you do so for reproducibility purposes.

I can't reproduce so far, can you attach your slurm.conf?

I also see this potentially related commit available since slurm-18-08-0-0pre1:

https://github.com/SchedMD/slurm/commit/f431ca59a9970411a3b9c1f05dede353891ecbe3

(In reply to Ryan Day from comment #3)
> Created attachment 7593 [details]
> environment in sbatch
> 
> Also, here's what my environment looks like in the sbatched xterm

I see no SLURM_NTASKS variable defined in there. Is this environment the result of 'sbatch -N1 -n24 -pplustre28 bx' or the result of a different sbatch without specifying -n?
Comment 7 Alejandro Sanchez 2018-08-14 08:57:50 MDT
Ryan, besides the slurm.conf request on my previous comment, we'd be interested in knowing what version of Slurm you were running before and confirm this problem started after the last upgrade. Although unlikely, We're wondering if the upgrade failed in any node so potentially using old library/plugins. Thanks.
Comment 8 Ryan Day 2018-08-14 09:28:32 MDT
Created attachment 7601 [details]
output of scontrol show config from inside of sbatched xterm
Comment 9 Ryan Day 2018-08-14 09:57:32 MDT
The xterm inside of an sbatch is a common thing for our tools and code developers to do when they want to use GUI based debuggers or other development tools. There are no doubt better ways to do it now, but old habits... I haven't been able to reproduce it from an salloc either.

Slurm 17.11.8 was installed as part of a fresh image of our latest OS release candidate (TOSS-3.3-3rc1), and I don't know of any failed installs. I did rebuild all of our packages that said they depend on slurm, although I suppose it's possible that there was something that I missed. Some of our lua plugins do set things in the job control environment, and I didn't touch those, so I can try moving them aside to see if that changes anything.

The test cluster was previously running 17.02.10. We haven't seen any problems with that in the previous OS version or in a version of the release candidate with 17.02.10.

I believe SLURM_NTASKS is present in the environment that I attached:

[benicia:~/myslurm] day36% grep -i slurm_nt ~/Downloads/env_in_sbatch.txt 
SLURM_NTASKS=24
[benicia:~/myslurm] day36%
Comment 10 Alejandro Sanchez 2018-08-14 10:12:52 MDT
Does srun also segfault if using --mpi=none?
Comment 11 Ryan Day 2018-08-14 10:56:56 MDT
(In reply to Alejandro Sanchez from comment #10)
> Does srun also segfault if using --mpi=none?


Yes:

sh-4.2$ srun -n1 --mpi=none hostname
opal96
sh-4.2$ A=1 srun -n1 --mpi=none hostname
Segmentation fault (core dumped)
sh-4.2$ A=1 B=1 srun -n1 --mpi=none hostname
opal96
sh-4.2$ A=1 B=1 srun -n1 hostname
opal96
sh-4.2$ srun -n1 hostname
opal96
sh-4.2$ A=1 srun -n1 hostname
Segmentation fault (core dumped)
sh-4.2$
Comment 12 Alejandro Sanchez 2018-08-14 11:12:27 MDT
Interesting. Can you attach full slurm.conf? I'd like to see the nodes/partitions definitions as well (scontrol show conf doesn't show them).

Are you doing any kind of env manipulation inside this?

JobCompLoc              = /usr/libexec/sqlog/slurm-joblog

Are your environment modules doing anything special with SLURM_ variables?

Is your submission host shell the same as the one where xterm is executed? from your env output I see /bin/sh

SHELL=/bin/sh
XTERM_SHELL=/bin/sh

and the bx script has the #!/bin/bash shebang, although it shouldn't make any difference that I can come up with at first sight. Not sure if playing with changing some of the shell values has any impact on this.
Comment 13 Ryan Day 2018-08-14 11:24:32 MDT
Created attachment 7605 [details]
slurm.conf with node, partition descriptions etc
Comment 14 Ryan Day 2018-08-14 11:25:07 MDT
Created attachment 7606 [details]
slurm.conf.common included from slurm.conf
Comment 15 Ryan Day 2018-08-14 11:43:34 MDT
I found the problem. We do have a spank plugin that allows users to change the environment in their job. Commenting it out of the plugstack.conf makes the seg faults go away. Probably should have started with that, but I forgot that it existed. Thanks for taking the time on this. I'll dig into that and try to figure out why it's breaking under 17.11.8.
Comment 16 Alejandro Sanchez 2018-08-15 02:04:14 MDT
(In reply to Ryan Day from comment #15)
> I found the problem. We do have a spank plugin that allows users to change
> the environment in their job. Commenting it out of the plugstack.conf makes
> the seg faults go away. Probably should have started with that, but I forgot
> that it existed. Thanks for taking the time on this. I'll dig into that and
> try to figure out why it's breaking under 17.11.8.

Note: SPANK plugins using the Slurm APIs need to be recompiled when upgrading Slurm to a new major release. I'm marking the bug as resolved/infogiven. Please, reopen if you have anything else. Thanks.
Comment 17 Ryan Day 2018-08-15 09:33:30 MDT
(In reply to Alejandro Sanchez from comment #16)
> (In reply to Ryan Day from comment #15)
> > I found the problem. We do have a spank plugin that allows users to change
> > the environment in their job. Commenting it out of the plugstack.conf makes
> > the seg faults go away. Probably should have started with that, but I forgot
> > that it existed. Thanks for taking the time on this. I'll dig into that and
> > try to figure out why it's breaking under 17.11.8.
> 
> Note: SPANK plugins using the Slurm APIs need to be recompiled when
> upgrading Slurm to a new major release. I'm marking the bug as
> resolved/infogiven. Please, reopen if you have anything else. Thanks.

The SPANK plugins were recompiled with 17.11.8. It looks like all that plugin does is parse a file and then set environment variables that it finds with setenv() (in a non-remote context, it would use the spank_setenv in a remote context). Not sure why that would've broken, but I'll keep digging.
Comment 18 Ryan Day 2018-08-15 11:01:57 MDT
I'm re-opening this because I have a simple reproducer derived from our use-env.so plugin. I made a plugin that just does a bunch of setenv()'s in the slurm_spank_local_user_init and that reproduces the issue. See attachment.

I also found that you can reproduce this within a 'normal' sbatch script. The xterm part of it is not important (opal96 has a modified plugstack.conf to point at my setenv_test.so plugin instead of the use-env.so plugin):

[day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 -w opal96 bc
Submitted batch job 31752
[day36@opal186:srun_segfault]$ cat bc    
#!/bin/sh
srun -n1 hostname
A=1 srun -n1 hostname
A=1 B=1 srun -n1 hostname
[day36@opal186:srun_segfault]$ cat slurm-31752.out 
opal96
/var/spool/slurmd/job31752/slurm_script: line 3: 33308 Segmentation fault      A=1 srun -n1 hostname
opal96
[day36@opal186:srun_segfault]$

The problem appears to be with having the setenvs inside of the slurm_spank_local_user_init. If I just export the variables in my sbatch script, I don't see seg faults (this is with the use-env.so and the setenv_test.so commented out of plugstack.conf):

[day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 -w opal96 --reservation=test bd
Submitted batch job 31758
[day36@opal186:srun_segfault]$ cat slurm-31758.out 
opal96
opal96
opal96
[day36@opal186:srun_segfault]$ cat bd
#!/bin/sh
export VIADEV_DEFAULT_TIME_OUT=22
export VIADEV_NUM_RDMA_BUFFER=4
export VIADEV_DEFAULT_MIN_RNR_TIMER=25
export IBV_FORK_SAFE=1
export IPATH_NO_CPUAFFINITY=1
export IPATH_NO_BACKTRACE=1
export MV2_ENABLE_AFFINITY=0
export HFI_NO_CPUAFFINITY=1
export PSM2_CLOSE_GRACE_INTERVAL=0
export PSM2_CLOSE_TIMEOUT=1
export I_MPI_FABRICS=shm:tmi
export TMI_CONFIG=/etc/tmi.conf
export CENTER_JOB_ID=${SLURM_CLUSTER_NAME}-${SLURM_JOB_ID}
srun -n1 hostname
A=1 srun -n1 hostname
A=1 B=1 srun -n1 hostname
[day36@opal186:srun_segfault]$
Comment 19 Ryan Day 2018-08-15 11:03:39 MDT
Created attachment 7616 [details]
slurm spank plugin that reproduces seg faults

compiled with:

sh-4.2$ gcc -Wall -o setenv.o -fPIC -c setenv.c
sh-4.2$ gcc -shared -o setenv_test.so setenv.o
Comment 20 Ryan Day 2018-08-15 11:07:17 MDT
I've also been emailing with Marc Grondona, who originally developed the use-env plugin. He ran srun under valgrind to try to see more of what was happening. Here's his email on that:

If it is helpful, here's what valgrind caught:

==153910== Invalid read of size 8
==153910==    at 0x52C4C30: env_array_merge (env.c:1816)
==153910==    by 0x8DEBE21: _build_user_env (launch_slurm.c:625)
==153910==    by 0x8DEBE21: launch_p_step_launch (launch_slurm.c:761)
==153910==    by 0x40A504: launch_g_step_launch (launch.c:532)
==153910==    by 0x408187: _launch_one_app (srun.c:249)
==153910==    by 0x409425: _launch_app (srun.c:498)
==153910==    by 0x409425: srun (srun.c:203)
==153910==    by 0x409889: main (srun.wrapper.c:17)
==153910==  Address 0x95cf320 is 0 bytes inside a block of size 1,088
free'd
==153910==    at 0x4C2BBB8: realloc (vg_replace_malloc.c:785)
==153910==    by 0x586136F: __add_to_environ (setenv.c:142)
==153910==    by 0x4C316EF: setenv (vg_replace_strmem.c:2043)
==153910==    by 0x716EA6D: ??? (in /usr/lib64/slurm/use-env.so)
==153910==    by 0x716F076: ??? (in /usr/lib64/slurm/use-env.so)
==153910==    by 0x717103C: ??? (in /usr/lib64/slurm/use-env.so)
==153910==    by 0x716FA6C: ??? (in /usr/lib64/slurm/use-env.so)
==153910==    by 0x7171A26: slurm_spank_local_user_init (in
/usr/lib64/slurm/use-env.so)
==153910==    by 0x53A3032: _do_call_stack (plugstack.c:747)
==153910==    by 0x53A45CA: spank_local_user (plugstack.c:865)
==153910==    by 0x414024: _call_spank_local_user (srun_job.c:1622)
==153910==    by 0x414024: pre_launch_srun_job (srun_job.c:1298)
==153910==    by 0x408072: _launch_one_app (srun.c:235)
==153910==  Block was alloc'd at
==153910==    at 0x4C2BBB8: realloc (vg_replace_malloc.c:785)
==153910==    by 0x586136F: __add_to_environ (setenv.c:142)
==153910==    by 0x4C316EF: setenv (vg_replace_strmem.c:2043)
==153910==    by 0x52C1F07: setenvf (env.c:291)
==153910==    by 0x52C3057: setup_env (env.c:761)
==153910==    by 0x407EC1: _setup_one_job_env (srun.c:599)
==153910==    by 0x4087A8: _setup_job_env (srun.c:636)
==153910==    by 0x4087A8: srun (srun.c:201)
==153910==    by 0x409889: main (srun.wrapper.c:17)


Looks like they are freeing the pointer to the environ!???!?

mark
Comment 21 Ryan Day 2018-08-16 10:33:25 MDT
One last thing, Mark found this commit:

commit aff20b90daafebf682412daa6b360b811ca048ce
Author: Felip Moll <felip.moll@schedmd.com>
Date:   Fri Jan 12 08:50:44 2018 -0700

    Global environment was not set correctly in srun
    
    Creating a copy of the actual environment in env->env defines a new pointer,
    then next call to setup_env and setenvf doesn't define variables in the global
    environment but in this new copy.
    
    Bug 4615

diff --git a/NEWS b/NEWS
index 7e90633..965deff 100644
--- a/NEWS
+++ b/NEWS
@@ -20,6 +20,7 @@ documents those changes that are of interest to users and administrators.
     reconfigured.
  -- node_feature/knl_cray - Fix memory leak that can occur during normal
     operation.
+ -- Fix srun environment variables for --prolog script.
* Changes in Slurm 17.11.2
==========================
diff --git a/src/srun/srun.c b/src/srun/srun.c
index 633afbb..a478000 100644
--- a/src/srun/srun.c
+++ b/src/srun/srun.c
@@ -596,9 +596,8 @@ static void _setup_one_job_env(slurm_opt_t *opt_local, srun_job_t *job,
                env->ws_row   = job->ws_row;
        }
-       env->env = env_array_copy((const char **) environ);
        setup_env(env, srun_opt->preserve_env);
-       job->env = env->env;
+       job->env = environ;
        xfree(env->task_count);
        xfree(env);
}

and wondered if it might be related. I reverted that, and it does indeed fix the seg fault issue:

[day36@opal186:srun_segfault]$ cat bc
#!/bin/sh
/usr/workspace/wsb/day36/myslurm/test-17.11.8/src/srun/srun -n1 hostname
A=1 /usr/workspace/wsb/day36/myslurm/test-17.11.8/src/srun/srun -n1 hostname
A=1 B=1 /usr/workspace/wsb/day36/myslurm/test-17.11.8/src/srun/srun -n1 hostname
echo "::"
srun -n1 hostname
A=1 srun -n1 hostname
A=1 B=1 srun -n1 hostname
[day36@opal186:srun_segfault]$ sbatch -N1 -n24 -pplustre28 bc 
Submitted batch job 31767
[day36@opal186:srun_segfault]$ cat slurm-31767.out 
opal97
opal97
opal97
::
opal97
/var/spool/slurmd/job31767/slurm_script: line 7: 167531 Segmentation fault      A=1 srun -n1 hostname
opal97
[day36@opal186:srun_segfault]$
Comment 22 Alejandro Sanchez 2018-08-16 10:38:02 MDT
I also saw that commit today. Have to study further if reverting that commit will make bug 4615 issue appear again though.
Comment 23 Alejandro Sanchez 2018-08-20 04:57:39 MDT
Hey Ryan, since you have different workarounds I'm lowering the the severity of this to 3. Although perhaps reverting such commit stops srun from segfaulting, I think we are reintroducing the issue that commit was fixing and also I see valgrind reporting definitely lost blocks with such revert applied (although it is true the invalid reads are gone and most probably that's why srun stops segfaulting). I'll investigate further and come back to you.
Comment 24 Ryan Day 2018-08-20 11:22:05 MDT
(In reply to Alejandro Sanchez from comment #23)
> Hey Ryan, since you have different workarounds I'm lowering the the severity
> of this to 3. Although perhaps reverting such commit stops srun from
> segfaulting, I think we are reintroducing the issue that commit was fixing
> and also I see valgrind reporting definitely lost blocks with such revert
> applied (although it is true the invalid reads are gone and most probably
> that's why srun stops segfaulting). I'll investigate further and come back
> to you.

Reverting the patch stops the seg faults, but then we're affected by the bug that it was intended to patch. The variables that we're trying to set in the job prolog don't show up in the global environment for the job. So, it's not really a workaround for us and continues to hold up our deployment of 17.11.
Comment 29 Alejandro Sanchez 2018-08-21 11:35:49 MDT
Created attachment 7655 [details]
17.11 patch

Hi Ryan. Will you try out the attached patch and see if that solves the things for you? It solves the invalid reads for me and preserves the issue in the other bug. Also I've run the regression tests with this applied and seems it is not introducing any unintended side effects.
Comment 30 Ryan Day 2018-08-21 12:02:29 MDT
This looks good to me Alejandro. I'm not getting seg faults and I am getting all of the environment variables that I expect. Thank you for all your work on this!

(In reply to Alejandro Sanchez from comment #29)
> Created attachment 7655 [details]
> 17.11 patch
> 
> Hi Ryan. Will you try out the attached patch and see if that solves the
> things for you? It solves the invalid reads for me and preserves the issue
> in the other bug. Also I've run the regression tests with this applied and
> seems it is not introducing any unintended side effects.
Comment 33 Alejandro Sanchez 2018-08-22 02:47:00 MDT
(In reply to Ryan Day from comment #30)
> This looks good to me Alejandro. I'm not getting seg faults and I am getting
> all of the environment variables that I expect. Thank you for all your work
> on this!

Patch will be available starting from 17.11.10 onwards:

https://github.com/SchedMD/slurm/commit/aa947d8096b55

Closing the bug, thanks for reporting.