| Summary: | Unable to launch srun steps when job requests a GRES | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Trey Dockendorf <tdockendorf> |
| Component: | User Commands | Assignee: | Scott Hilton <scott> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 20.02.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=9358 | ||
| Site: | Ohio State OSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 20.02.4 20.11.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
gres.conf parallel.sbatch Proposed fix. |
||
Created attachment 14950 [details]
gres.conf
Trey, I'm trying to figure out how to reproduce your error. I assume it has only to do with gpfs:scratch and not the other options you showed? In other words: Does this error only happen with 2 nodes? Does this error only happen with partition parallel? Does this error only happen with $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw? Also, Does it have to be salloc or does this error happen in sbatch as well? -Scott I tried serial partition with 1 task & 1 node and the issue does not occur: $ salloc -p serial --gres=gpfs:scratch:1 $ srun hostname p0001.ten.osc.edu Also with debug partition which allows partial node allocation when multiple nodes: $ salloc -N 2 -p debug --gres=gpfs:scratch:1 $ srun hostname p0001.ten.osc.edu p0002.ten.osc.edu With parallel partition where we force jobs to be exclusive via partition config, that's where the issue seems to occur. Also the error is not just the OSU benchmarks but with just executing hostname: $ salloc -N 2 -p parallel --gres=gpfs:scratch:1 $ srun hostname srun: error: Unable to create step for job 1077: Invalid generic resource (gres) specification Also GPU partition doesn't have this issue. We only have 1 GPU node currently in our test bed so I can't yet test if parallel GPU partition has this issue: $ salloc --gpus=1 -p gpuserial --gres=gpfs:scratch:1 $ srun hostname p0253.ten.osc.edu I verified this happens under the same conditions as above using sbatch: This was error from sbatch (script attached as parallel.sbatch): srun: error: Unable to create step for job 896: Invalid generic resource (gres) specification The part not in submit script is I did --gres=gpfs:scratch:1 at command line. Created attachment 14954 [details]
parallel.sbatch
One part of our config that may not be clear is that the parallel partition is like a Torque routing queue. When you submit to parallel the Lua job submit filter modifies the request to be partition=parallel-40core,parallel-48core. Using that partition directly does not change the behavior: $ salloc -N 2 -p parallel-40core --gres=gpfs:scratch:1 $ srun hostname srun: error: Unable to create step for job 1080: Invalid generic resource (gres) specification Trey, Thanks for the extra info. I have now reproduced the bug on my machine. I agree with you, it seems that "OverSubscribe=EXCLUSIVE" seems to be the common factor. I will look into the bug and let you know what I find. -Scott Trey, I found the issue and am working out how to make a proper fix now. The bug comes up when you allocate a whole node (through --exclusive or "OverSubscribe=EXCLUSIVE") which has gres that are no_consume. The current code ignores all the no_consume and only allocates the others. When srun looks for the gres it isn't there. For now you could run without either exclusive or no_consume tag. Together they produce the bug. -Scott Trey, I uploaded the patch and you are free to try it. If you do let us know how it works for you. It should come out on the next release of 20.02. Take care, Scott I patched our RPM builds and tested that the patch fixes the issue. Thanks, - Trey Trey, I'm glad it worked for you. Let us know if you have any other questions, -Scott |
Created attachment 14949 [details] slurm.conf At OSC we are trying to assign a GRES to each job based on the filesystems used by the job. This is being done in the submit filter. What we've discovered is if a salloc or sbatch command is assigned GRES gpfs:scratch:1 and you execute srun without --gres=none then the job fails. I have commented out our submit filter automatic gres assignment so that I can illustrate the issue more clearly: $ salloc -N 2 -p parallel --gres=gpfs:scratch:1 $ srun $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw srun: error: Unable to create step for job 897: Invalid generic resource (gres) specification $ srun --gres=none $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw <NO ERROR> Because these GRESes are automatically assigned this is not something we can ask our users to do, is adding --gres=none. The idea of using these GRESes for filesystems was from quickstart consulting with OSC from SchedMD about ways to have filesystem maintenance windows and block jobs using specific filesystems. The idea was to use a GRES and set GRES count to 0 during filesystem maintenance. In practice this appears like it might not work.