| Summary: | srun sigfault when SLURM_JOB_NODELIST is bracketed and terminated by a newline | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Daniel Ahlin <ahlin> |
| Component: | User Commands | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | nick |
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Alineos Sites: | --- | |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.0rc2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Daniel Ahlin
2022-10-28 00:51:06 MDT
After posting I noted that my example contains the legacy SLURM_NODELIST $ echo $SLURM_NODELIST ahlintopol-c2-ghpc-[0,1] To be clear, SLURM_JOB_LIST contains the same: echo $SLURM_JOB_NODELIST ahlintopol-c2-ghpc-[0,1] (In reply to Daniel Ahlin from comment #1) > After posting I noted that my example contains the legacy SLURM_NODELIST No problem with that env var. But indeed--this is a bug and easy to reproduce from what you have provided: >$ salloc -N2 >salloc: Granted job allocation 12 >salloc: Waiting for resource configuration >salloc: Nodes vizino-[3-4] are ready for job >(salloc) $ env|grep NODEL >(salloc) $ export SLURM_JOB_NODELIST="$SLURM_JOB_NODELIST > > " >(salloc) $ gdb --arg srun hostname >... >Starting program: /home/chad/slurm/master/vizino/bin/srun hostname >... >srun: error: ../../../../slurm/src/common/hostlist.c:2513: hostlist_uniq(): Assertion (hl != NULL) failed. > >Program received signal SIGABRT, Aborted. >... >gdb) bt >#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 >#1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 >#2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 >#3 0x00007ffff783bc46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 >#4 0x00007ffff78227fc in __GI_abort () at ./stdlib/abort.c:79 >#5 0x00007ffff7e73e45 in __xassert_failed (expr=0x7ffff7e8fe82 "hl != NULL", > file=0x7ffff7e8fd88 "../../../../slurm/src/common/hostlist.c", line=2513, > func=0x7ffff7e90718 <__func__.44> "hostlist_uniq") at ../../../../slurm/src/common/xassert.c:57 >#6 0x00007ffff7cf0f13 in hostlist_uniq (hl=0x0) at ../../../../slurm/src/common/hostlist.c:2513 >#7 0x000055555556808d in job_step_create_allocation (resp=0x5555555afb00, opt_local=0x55555557d0c0 <opt>) > at ../../../../../slurm/src/srun/libsrun/srun_job.c:294 >#8 0x000055555556ad61 in create_srun_job (p_job=0x55555557e0a8 <job>, got_alloc=0x7fffffffd54f, > slurm_started=false, handle_signals=true) at ../../../../../slurm/src/srun/libsrun/srun_job.c:1348 >#9 0x000055555555d651 in srun (ac=2, av=0x7fffffffd6b8) at ../../../../slurm/src/srun/srun.c:193 >#10 0x0000555555560bd4 in main (argc=2, argv=0x7fffffffd6b8) at ../../../../slurm/src/srun/srun.wrapper.c:17 >(gdb) Should be easy to fix and will look into it. Hi. This has been fixed in this commit:
>https://github.com/SchedMD/slurm/commit/0a614b8
|