| Summary: | srun segfault | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | CSC sysadmins <csc-slurm-tickets> |
| Component: | User Commands | Assignee: | Chad Vizino <chad> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | chad |
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | CSC - IT Center for Science | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 21.08.7 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
|
Description
CSC sysadmins
2022-04-06 03:38:29 MDT
Created attachment 24247 [details]
slurm.conf
(In reply to Tommi Tervo from comment #0) > User had a faulty singularity job but it manages to crash srun: > > #SBATCH --ntasks=1 > #SBATCH --cpus-per-task=8 > #SBATCH --mem=100G > #SBATCH --time=00:15:00 > #SBATCH --gres=gpu:a100:1 > > valgrind srun singularity exec \ > -B $LOCAL_SCRATCH:$LOCAL_SCRATCH \ > /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif \ > Rscript --no-save cifar.R > ... >==636091== Command: srun singularity exec -B : /appl/soft/math/r-env-singularity/4.1.3/4.1.3.sif Rscript --no-save cifar.R >srun: error: Allocation failure of 1 nodes: job size of 1, already allocated 1 nodes to previous components. > ... > fault (core dumped) valgrind srun singularity exec -B Hi. Good information suppied--thanks for the valgrind. At first glance it appears that srun processed what it was supplied as a heterogeneous step since it looks like LOCAL_SCRATCH was not defined (had no value) in the job (note the "Command:" output by valgrind) which shows a lone ":" token in the srun arg list. The message from srun "error: Allocation failure ... already allocated" comes from _handle_het_step_exclude() which is a an srun heterogeneous step function and confirms this. I'll work on reproducing at my end to see what's going on (obviously srun shouldn't segfault) but a few questions: Has the job worked before? If LOCAL_SCRATCH is defined (not empty) within the job, does the srun work as expected? How is LOCAL_SCRATCH normally set before/within the job (from the TaskProlog script)? For now to enable this user to keep running, maybe just make sure that LOCAL_SCRATCH is defined within the job and the problem can be avoided but again, I will see what I can discover on my end. Ok--found the bug in srun. At first I couldn't reproduce it but studied the code and found that the bug is triggered in _create_job_step() when reserved ports are being used. You have them set in your slurm.conf: >MpiParams=ports=13000-18000 I set those in my config, retested and that indeed triggered the bug--so no further information is needed from you. The fix looks very simple and has nothing to do with singularity (other than using it with the lone ":" triggered a het step in srun which in turn triggered the failing code path) but until then to minimize the exposure from this particular use case (of course they are others), you may want to: * Make sure TaskProlog is setting LOCAL_SCRATCH to a value (assuming that's where you are setting it and this is the reason why LOCAL_SCRATCH didn't have a value in the example output you sent). * Have the user call singularity without the : when using -B to be safe. The singularity docs say about the -B option spec: >spec has the format src[:dest[:opts]], where src and dest are outside and inside paths. If dest is not given, it is set equal to src. So call like this: >singularity exec -B $LOCAL_SCRATCH ... I will supply more info once the fix is formalized. Hello again. A fix for this has been committed. See: https://github.com/SchedMD/slurm/commit/27cb6f3c59 Thanks for the really fast fix! -Tommi |