13780 – srun segfault

Ticket 13780 - srun segfault

Summary: srun segfault

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	21.08.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Chad Vizino
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-04-06 03:38 MDT by CSC sysadmins
Modified:	2022-04-07 02:49 MDT (History)
CC List:	1 user (show)

See Also:
Site:	CSC - IT Center for Science
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.7
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (10.81 KB, text/plain) 2022-04-06 03:42 MDT, CSC sysadmins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description CSC sysadmins 2022-04-06 03:38:29 MDT

User had a faulty singularity job but it manages to crash srun:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=100G
#SBATCH --time=00:15:00
#SBATCH --gres=gpu:a100:1

valgrind srun singularity exec \
-B $LOCAL_SCRATCH:$LOCAL_SCRATCH \
/appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif \
Rscript --no-save cifar.R



------------------------------------------
r-env-singularity 4.1.3
https://docs.csc.fi/apps/r-env-singularity
------------------------------------------
==636091== Memcheck, a memory error detector
==636091== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==636091== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==636091== Command: srun singularity exec -B : /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R
==636091==
srun: error: Allocation failure of 1 nodes: job size of 1, already allocated 1 nodes to previous components.
==636091== Invalid read of size 8
==636091==    at 0x411E30: _create_job_step (srun_job.c:945)
==636091==    by 0x414C26: create_srun_job (srun_job.c:1396)
==636091==    by 0x409440: srun (srun.c:195)
==636091==    by 0x40AA8A: main (srun.wrapper.c:17)
==636091==  Address 0x18 is not stack'd, malloc'd or (recently) free'd
==636091==
==636091==
==636091== Process terminating with default action of signal 11 (SIGSEGV)
==636091==  Access not within mapped region at address 0x18
==636091==    at 0x411E30: _create_job_step (srun_job.c:945)
==636091==    by 0x414C26: create_srun_job (srun_job.c:1396)
==636091==    by 0x409440: srun (srun.c:195)
==636091==    by 0x40AA8A: main (srun.wrapper.c:17)
==636091==  If you believe this happened as a result of a stack
==636091==  overflow in your program's main thread (unlikely but
==636091==  possible), you can try to increase the size of the
==636091==  main thread stack using the --main-stacksize= flag.
==636091==  The main thread stack size used in this run was 16777216.
==636091==
==636091== HEAP SUMMARY:
==636091==     in use at exit: 194,445 bytes in 2,275 blocks
==636091==   total heap usage: 18,824 allocs, 16,549 frees, 4,780,581 bytes allocated
==636091==
==636091== LEAK SUMMARY:
==636091==    definitely lost: 212 bytes in 6 blocks
==636091==    indirectly lost: 68 bytes in 3 blocks
==636091==      possibly lost: 132,228 bytes in 1,682 blocks
==636091==    still reachable: 61,937 bytes in 584 blocks
==636091==         suppressed: 0 bytes in 0 blocks
==636091== Rerun with --leak-check=full to see details of leaked memory
==636091==
==636091== For lists of detected and suppressed errors, rerun with: -s
==636091== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
/var/spool/slurmd/job1104689/slurm_script: line 31: 636091 Segmentation fault      (core dumped) valgrind srun singularity exec -B $LOCAL_SCRATCH:$LOCAL_SCRATCH /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R

Comment 1 CSC sysadmins 2022-04-06 03:42:39 MDT

Created attachment 24247 [details]
slurm.conf

Comment 2 Chad Vizino 2022-04-06 08:51:51 MDT

(In reply to Tommi Tervo from comment #0)
> User had a faulty singularity job but it manages to crash srun:
> 
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=8
> #SBATCH --mem=100G
> #SBATCH --time=00:15:00
> #SBATCH --gres=gpu:a100:1
> 
> valgrind srun singularity exec \
> -B $LOCAL_SCRATCH:$LOCAL_SCRATCH \
> /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif \
> Rscript --no-save cifar.R
> ...
>==636091== Command: srun singularity exec -B : /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R
>srun: error: Allocation failure of 1 nodes: job size of 1, already allocated 1 nodes to previous components.
>  ...
> fault      (core dumped) valgrind srun singularity exec -B
Hi. Good information suppied--thanks for the valgrind. At first glance it appears that srun processed what it was supplied as a heterogeneous step since it looks like LOCAL_SCRATCH was not defined (had no value) in the job (note the "Command:" output by valgrind) which shows a lone ":" token in the srun arg list. The message from srun "error: Allocation failure ... already allocated" comes from _handle_het_step_exclude() which is a an srun heterogeneous step function and confirms this.

I'll work on reproducing at my end to see what's going on (obviously srun shouldn't segfault) but a few questions:

Has the job worked before?
If LOCAL_SCRATCH is defined (not empty) within the job, does the srun work as expected?
How is LOCAL_SCRATCH normally set before/within the job (from the TaskProlog script)?

For now to enable this user to keep running, maybe just make sure that LOCAL_SCRATCH is defined within the job and the problem can be avoided but again, I will see what I can discover on my end.

Comment 3 Chad Vizino 2022-04-06 11:38:50 MDT

Ok--found the bug in srun. At first I couldn't reproduce it but studied the code and found that the bug is triggered in _create_job_step() when reserved ports are being used. You have them set in your slurm.conf:

>MpiParams=ports=13000-18000
I set those in my config, retested and that indeed triggered the bug--so no further information is needed from you.

The fix looks very simple and has nothing to do with singularity (other than using it with the lone ":" triggered a het step in srun which in turn triggered the failing code path) but until then to minimize the exposure from this particular use case (of course they are others), you may want to:

* Make sure TaskProlog is setting LOCAL_SCRATCH to a value (assuming that's where you are setting it and this is the reason why LOCAL_SCRATCH didn't have a value in the example output you sent).

* Have the user call singularity without the : when using -B to be safe.

The singularity docs say about the -B option spec:

>spec has the format src[:dest[:opts]], where src and dest are outside and inside paths. If dest is not given, it is set equal to src.
So call like this:

>singularity exec -B $LOCAL_SCRATCH ...
I will supply more info once the fix is formalized.

Comment 8 Chad Vizino 2022-04-06 15:21:45 MDT

Hello again. A fix for this has been committed. See:

https://github.com/SchedMD/slurm/commit/27cb6f3c59

Comment 9 CSC sysadmins 2022-04-07 02:49:46 MDT

Thanks for the really fast fix!

-Tommi