Ticket 9357

Summary: Unable to launch srun steps when job requests a GRES
Product: Slurm Reporter: Trey Dockendorf <tdockendorf>
Component: User CommandsAssignee: Scott Hilton <scott>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: alex
Version: 20.02.3   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=9358
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.4 20.11.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
gres.conf
parallel.sbatch
Proposed fix.

Description Trey Dockendorf 2020-07-08 07:30:09 MDT
Created attachment 14949 [details]
slurm.conf

At OSC we are trying to assign a GRES to each job based on the filesystems used by the job. This is being done in the submit filter.  What we've discovered is if a salloc or sbatch command is assigned GRES gpfs:scratch:1 and you execute srun without --gres=none then the job fails.  I have commented out our submit filter automatic gres assignment so that I can illustrate the issue more clearly:

$ salloc -N 2 -p parallel --gres=gpfs:scratch:1
$ srun $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
srun: error: Unable to create step for job 897: Invalid generic resource (gres) specification
$ srun --gres=none $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
<NO ERROR>

Because these GRESes are automatically assigned this is not something we can ask our users to do, is adding --gres=none.  The idea of using these GRESes for filesystems was from quickstart consulting with OSC from SchedMD about ways to have filesystem maintenance windows and block jobs using specific filesystems.  The idea was to use a GRES and set GRES count to 0 during filesystem maintenance.  In practice this appears like it might not work.
Comment 1 Trey Dockendorf 2020-07-08 07:30:33 MDT
Created attachment 14950 [details]
gres.conf
Comment 2 Scott Hilton 2020-07-08 10:56:30 MDT
Trey,

I'm trying to figure out how to reproduce your error. 

I assume it has only to do with gpfs:scratch and not the other options you showed? In other words:
Does this error only happen with 2 nodes? 
Does this error only happen with partition parallel?
Does this error only happen with $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw?

Also, Does it have to be salloc or does this error happen in sbatch as well?

-Scott
Comment 3 Trey Dockendorf 2020-07-08 11:26:14 MDT
I tried serial partition with 1 task & 1 node and the issue does not occur:

$ salloc -p serial --gres=gpfs:scratch:1
$ srun hostname
p0001.ten.osc.edu

Also with debug partition which allows partial node allocation when multiple nodes:

$ salloc -N 2 -p debug --gres=gpfs:scratch:1
$ srun hostname
p0001.ten.osc.edu
p0002.ten.osc.edu

With parallel partition where we force jobs to be exclusive via partition config, that's where the issue seems to occur.  Also the error is not just the OSU benchmarks but with just executing hostname:

$ salloc -N 2 -p parallel --gres=gpfs:scratch:1
$ srun hostname
srun: error: Unable to create step for job 1077: Invalid generic resource (gres) specification

Also GPU partition doesn't have this issue. We only have 1 GPU node currently in our test bed so I can't yet test if parallel GPU partition has this issue:

$ salloc --gpus=1 -p gpuserial --gres=gpfs:scratch:1
$ srun hostname
p0253.ten.osc.edu

I verified this happens under the same conditions as above using sbatch:

This was error from sbatch (script attached as parallel.sbatch):

srun: error: Unable to create step for job 896: Invalid generic resource (gres) specification

The part not in submit script is I did --gres=gpfs:scratch:1 at command line.
Comment 4 Trey Dockendorf 2020-07-08 11:26:52 MDT
Created attachment 14954 [details]
parallel.sbatch
Comment 5 Trey Dockendorf 2020-07-08 11:28:59 MDT
One part of our config that may not be clear is that the parallel partition is like a Torque routing queue. When you submit to parallel the Lua job submit filter modifies the request to be partition=parallel-40core,parallel-48core.  Using that partition directly does not change the behavior:

$ salloc -N 2 -p parallel-40core --gres=gpfs:scratch:1
$ srun hostname
srun: error: Unable to create step for job 1080: Invalid generic resource (gres) specification
Comment 6 Scott Hilton 2020-07-08 11:45:15 MDT
Trey, 

Thanks for the extra info. I have now reproduced the bug on my machine.

I agree with you, it seems that "OverSubscribe=EXCLUSIVE" seems to be the common factor. 

I will look into the bug and let you know what I find.

-Scott
Comment 8 Scott Hilton 2020-07-08 17:16:41 MDT
Trey,

I found the issue and am working out how to make a proper fix now.

The bug comes up when you allocate a whole node (through --exclusive or "OverSubscribe=EXCLUSIVE") which has gres that are no_consume. The current code ignores all the no_consume and only allocates the others. When srun looks for the gres it isn't there.

For now you could run without either exclusive or no_consume tag. Together they produce the bug.

-Scott
Comment 11 Scott Hilton 2020-07-09 11:04:15 MDT
Trey, 

I uploaded the patch and you are free to try it. If you do let us know how it works for you. 

It should come out on the next release of 20.02.

Take care,

Scott
Comment 12 Trey Dockendorf 2020-07-09 15:52:34 MDT
I patched our RPM builds and tested that the patch fixes the issue.

Thanks,
- Trey
Comment 21 Scott Hilton 2020-07-22 12:36:12 MDT
Trey, 

I'm glad it worked for you.

Let us know if you have any other questions,

-Scott