User had a faulty singularity job but it manages to crash srun: #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=100G #SBATCH --time=00:15:00 #SBATCH --gres=gpu:a100:1 valgrind srun singularity exec \ -B $LOCAL_SCRATCH:$LOCAL_SCRATCH \ /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif \ Rscript --no-save cifar.R ------------------------------------------ r-env-singularity 4.1.3 https://docs.csc.fi/apps/r-env-singularity ------------------------------------------ ==636091== Memcheck, a memory error detector ==636091== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==636091== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info ==636091== Command: srun singularity exec -B : /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R ==636091== srun: error: Allocation failure of 1 nodes: job size of 1, already allocated 1 nodes to previous components. ==636091== Invalid read of size 8 ==636091== at 0x411E30: _create_job_step (srun_job.c:945) ==636091== by 0x414C26: create_srun_job (srun_job.c:1396) ==636091== by 0x409440: srun (srun.c:195) ==636091== by 0x40AA8A: main (srun.wrapper.c:17) ==636091== Address 0x18 is not stack'd, malloc'd or (recently) free'd ==636091== ==636091== ==636091== Process terminating with default action of signal 11 (SIGSEGV) ==636091== Access not within mapped region at address 0x18 ==636091== at 0x411E30: _create_job_step (srun_job.c:945) ==636091== by 0x414C26: create_srun_job (srun_job.c:1396) ==636091== by 0x409440: srun (srun.c:195) ==636091== by 0x40AA8A: main (srun.wrapper.c:17) ==636091== If you believe this happened as a result of a stack ==636091== overflow in your program's main thread (unlikely but ==636091== possible), you can try to increase the size of the ==636091== main thread stack using the --main-stacksize= flag. ==636091== The main thread stack size used in this run was 16777216. ==636091== ==636091== HEAP SUMMARY: ==636091== in use at exit: 194,445 bytes in 2,275 blocks ==636091== total heap usage: 18,824 allocs, 16,549 frees, 4,780,581 bytes allocated ==636091== ==636091== LEAK SUMMARY: ==636091== definitely lost: 212 bytes in 6 blocks ==636091== indirectly lost: 68 bytes in 3 blocks ==636091== possibly lost: 132,228 bytes in 1,682 blocks ==636091== still reachable: 61,937 bytes in 584 blocks ==636091== suppressed: 0 bytes in 0 blocks ==636091== Rerun with --leak-check=full to see details of leaked memory ==636091== ==636091== For lists of detected and suppressed errors, rerun with: -s ==636091== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) /var/spool/slurmd/job1104689/slurm_script: line 31: 636091 Segmentation fault (core dumped) valgrind srun singularity exec -B $LOCAL_SCRATCH:$LOCAL_SCRATCH /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R
Created attachment 24247 [details] slurm.conf
(In reply to Tommi Tervo from comment #0) > User had a faulty singularity job but it manages to crash srun: > > #SBATCH --ntasks=1 > #SBATCH --cpus-per-task=8 > #SBATCH --mem=100G > #SBATCH --time=00:15:00 > #SBATCH --gres=gpu:a100:1 > > valgrind srun singularity exec \ > -B $LOCAL_SCRATCH:$LOCAL_SCRATCH \ > /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif \ > Rscript --no-save cifar.R > ... >==636091== Command: srun singularity exec -B : /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R >srun: error: Allocation failure of 1 nodes: job size of 1, already allocated 1 nodes to previous components. > ... > fault (core dumped) valgrind srun singularity exec -B Hi. Good information suppied--thanks for the valgrind. At first glance it appears that srun processed what it was supplied as a heterogeneous step since it looks like LOCAL_SCRATCH was not defined (had no value) in the job (note the "Command:" output by valgrind) which shows a lone ":" token in the srun arg list. The message from srun "error: Allocation failure ... already allocated" comes from _handle_het_step_exclude() which is a an srun heterogeneous step function and confirms this. I'll work on reproducing at my end to see what's going on (obviously srun shouldn't segfault) but a few questions: Has the job worked before? If LOCAL_SCRATCH is defined (not empty) within the job, does the srun work as expected? How is LOCAL_SCRATCH normally set before/within the job (from the TaskProlog script)? For now to enable this user to keep running, maybe just make sure that LOCAL_SCRATCH is defined within the job and the problem can be avoided but again, I will see what I can discover on my end.
Ok--found the bug in srun. At first I couldn't reproduce it but studied the code and found that the bug is triggered in _create_job_step() when reserved ports are being used. You have them set in your slurm.conf: >MpiParams=ports=13000-18000 I set those in my config, retested and that indeed triggered the bug--so no further information is needed from you. The fix looks very simple and has nothing to do with singularity (other than using it with the lone ":" triggered a het step in srun which in turn triggered the failing code path) but until then to minimize the exposure from this particular use case (of course they are others), you may want to: * Make sure TaskProlog is setting LOCAL_SCRATCH to a value (assuming that's where you are setting it and this is the reason why LOCAL_SCRATCH didn't have a value in the example output you sent). * Have the user call singularity without the : when using -B to be safe. The singularity docs say about the -B option spec: >spec has the format src[:dest[:opts]], where src and dest are outside and inside paths. If dest is not given, it is set equal to src. So call like this: >singularity exec -B $LOCAL_SCRATCH ... I will supply more info once the fix is formalized.
Hello again. A fix for this has been committed. See: https://github.com/SchedMD/slurm/commit/27cb6f3c59
Thanks for the really fast fix! -Tommi