srun sigfaults when SLURM_JOB_NODELIST is bracketed and terminate by a newline. Example: $ salloc -N 2 -pc2 salloc: Granted job allocation 24 salloc: Waiting for resource configuration salloc: Nodes ahlintopol-c2-ghpc-[0-1] are ready for job $ echo $SLURM_NODELIST ahlintopol-c2-ghpc-[0,1] $ srun /usr/bin/hostname ahlintopol-c2-ghpc-0 ahlintopol-c2-ghpc-1 $ env SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST " srun /bin/hostname Segmentation fault Expectation: While I agree it is an error that SLURM_JOB_NODELIST contains a newline I would expect the parsing to produce an error rather than sigfaulting Impact: The error is not all that easy to spot since e.g. echo'ing SLURM_JOB_NODELIST with and without the newline produces the same output (at least in bash). Diagnostics: While the sigsegv is in hostlist_uniq, The issue is rather that the result of hostlist_create at src/srun/libsrun/srun_job.c:293 is not checked.
After posting I noted that my example contains the legacy SLURM_NODELIST $ echo $SLURM_NODELIST ahlintopol-c2-ghpc-[0,1] To be clear, SLURM_JOB_LIST contains the same: echo $SLURM_JOB_NODELIST ahlintopol-c2-ghpc-[0,1]
(In reply to Daniel Ahlin from comment #1) > After posting I noted that my example contains the legacy SLURM_NODELIST No problem with that env var. But indeed--this is a bug and easy to reproduce from what you have provided: >$ salloc -N2 >salloc: Granted job allocation 12 >salloc: Waiting for resource configuration >salloc: Nodes vizino-[3-4] are ready for job >(salloc) $ env|grep NODEL >(salloc) $ export SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST > > " >(salloc) $ gdb --arg srun hostname >... >Starting program: /home/chad/slurm/master/vizino/bin/srun hostname >... >srun: error: ../../../../slurm/src/common/hostlist.c:2513: hostlist_uniq(): Assertion (hl != NULL) failed. > >Program received signal SIGABRT, Aborted. >... >gdb) bt >#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 >#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 >#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 >#3 0x00007ffff783bc46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 >#4 0x00007ffff78227fc in __GI_abort () at ./stdlib/abort.c:79 >#5 0x00007ffff7e73e45 in __xassert_failed (expr=0x7ffff7e8fe82 "hl != NULL", > file=0x7ffff7e8fd88 "../../../../slurm/src/common/hostlist.c", line=2513, > func=0x7ffff7e90718 <__func__.44> "hostlist_uniq") at ../../../../slurm/src/common/xassert.c:57 >#6 0x00007ffff7cf0f13 in hostlist_uniq (hl=0x0) at ../../../../slurm/src/common/hostlist.c:2513 >#7 0x000055555556808d in job_step_create_allocation (resp=0x5555555afb00, opt_local=0x55555557d0c0 <opt>) > at ../../../../../slurm/src/srun/libsrun/srun_job.c:294 >#8 0x000055555556ad61 in create_srun_job (p_job=0x55555557e0a8 <job>, got_alloc=0x7fffffffd54f, > slurm_started=false, handle_signals=true) at ../../../../../slurm/src/srun/libsrun/srun_job.c:1348 >#9 0x000055555555d651 in srun (ac=2, av=0x7fffffffd6b8) at ../../../../slurm/src/srun/srun.c:193 >#10 0x0000555555560bd4 in main (argc=2, argv=0x7fffffffd6b8) at ../../../../slurm/src/srun/srun.wrapper.c:17 >(gdb) Should be easy to fix and will look into it.
Hi. This has been fixed in this commit: >https://github.com/SchedMD/slurm/commit/0a614b8