Ticket 15305

Summary: srun sigfault when SLURM_JOB_NODELIST is bracketed and terminated by a newline
Product: Slurm Reporter: Daniel Ahlin <ahlin>
Component: User CommandsAssignee: Chad Vizino <chad>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nick
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: Google Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 23.02.0rc2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Daniel Ahlin 2022-10-28 00:51:06 MDT
srun sigfaults when SLURM_JOB_NODELIST is bracketed and terminate by a newline.

Example:
$ salloc -N 2 -pc2
salloc: Granted job allocation 24
salloc: Waiting for resource configuration
salloc: Nodes ahlintopol-c2-ghpc-[0-1] are ready for job
$ echo $SLURM_NODELIST                 
ahlintopol-c2-ghpc-[0,1]
$ srun /usr/bin/hostname               
ahlintopol-c2-ghpc-0
ahlintopol-c2-ghpc-1
$ env SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST
" srun /bin/hostname
Segmentation fault

Expectation:
While I agree it is an error that SLURM_JOB_NODELIST contains a newline I would expect the parsing to produce an error rather than sigfaulting

Impact:
The error is not all that easy to spot since e.g. echo'ing SLURM_JOB_NODELIST with and without the newline produces the same output (at least in bash). 

Diagnostics:
While the sigsegv is in hostlist_uniq, The issue is rather that the result of  hostlist_create at src/srun/libsrun/srun_job.c:293 is not checked.
Comment 1 Daniel Ahlin 2022-10-28 00:53:49 MDT
After posting I noted that my example contains the legacy SLURM_NODELIST

$ echo $SLURM_NODELIST                 
ahlintopol-c2-ghpc-[0,1]

To be clear, SLURM_JOB_LIST contains the same:
echo $SLURM_JOB_NODELIST                 
ahlintopol-c2-ghpc-[0,1]
Comment 2 Chad Vizino 2022-10-28 14:07:08 MDT
(In reply to Daniel Ahlin from comment #1)
> After posting I noted that my example contains the legacy SLURM_NODELIST
No problem with that env var. But indeed--this is a bug and easy to reproduce from what you have provided:

>$ salloc -N2
>salloc: Granted job allocation 12
>salloc: Waiting for resource configuration
>salloc: Nodes vizino-[3-4] are ready for job
>(salloc) $ env|grep NODEL
>(salloc) $ export SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST
> > "
>(salloc) $ gdb --arg srun hostname
>...
>Starting program: /home/chad/slurm/master/vizino/bin/srun hostname
>...
>srun: error: ../../../../slurm/src/common/hostlist.c:2513: hostlist_uniq(): Assertion (hl != NULL) failed.
> 
>Program received signal SIGABRT, Aborted.
>...
>gdb) bt
>#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
>#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
>#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
>#3  0x00007ffff783bc46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
>#4  0x00007ffff78227fc in __GI_abort () at ./stdlib/abort.c:79
>#5  0x00007ffff7e73e45 in __xassert_failed (expr=0x7ffff7e8fe82 "hl != NULL", 
>    file=0x7ffff7e8fd88 "../../../../slurm/src/common/hostlist.c", line=2513, 
>    func=0x7ffff7e90718 <__func__.44> "hostlist_uniq") at ../../../../slurm/src/common/xassert.c:57
>#6  0x00007ffff7cf0f13 in hostlist_uniq (hl=0x0) at ../../../../slurm/src/common/hostlist.c:2513
>#7  0x000055555556808d in job_step_create_allocation (resp=0x5555555afb00, opt_local=0x55555557d0c0 <opt>)
>    at ../../../../../slurm/src/srun/libsrun/srun_job.c:294
>#8  0x000055555556ad61 in create_srun_job (p_job=0x55555557e0a8 <job>, got_alloc=0x7fffffffd54f, 
>    slurm_started=false, handle_signals=true) at ../../../../../slurm/src/srun/libsrun/srun_job.c:1348
>#9  0x000055555555d651 in srun (ac=2, av=0x7fffffffd6b8) at ../../../../slurm/src/srun/srun.c:193
>#10 0x0000555555560bd4 in main (argc=2, argv=0x7fffffffd6b8) at ../../../../slurm/src/srun/srun.wrapper.c:17
>(gdb)
Should be easy to fix and will look into it.
Comment 17 Chad Vizino 2023-02-20 11:42:25 MST
Hi. This has been fixed in this commit:

>https://github.com/SchedMD/slurm/commit/0a614b8