Ticket 9357

Summary:	Unable to launch srun steps when job requests a GRES
Product:	Slurm	Reporter:	Trey Dockendorf <tdockendorf>
Component:	User Commands	Assignee:	Scott Hilton <scott>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex
Version:	20.02.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9358
Site:	Ohio State OSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.4 20.11.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf gres.conf parallel.sbatch Proposed fix.

Description Trey Dockendorf 2020-07-08 07:30:09 MDT

Created attachment 14949 [details]
slurm.conf

At OSC we are trying to assign a GRES to each job based on the filesystems used by the job. This is being done in the submit filter.  What we've discovered is if a salloc or sbatch command is assigned GRES gpfs:scratch:1 and you execute srun without --gres=none then the job fails.  I have commented out our submit filter automatic gres assignment so that I can illustrate the issue more clearly:

$ salloc -N 2 -p parallel --gres=gpfs:scratch:1
$ srun $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
srun: error: Unable to create step for job 897: Invalid generic resource (gres) specification
$ srun --gres=none $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw
<NO ERROR>

Because these GRESes are automatically assigned this is not something we can ask our users to do, is adding --gres=none.  The idea of using these GRESes for filesystems was from quickstart consulting with OSC from SchedMD about ways to have filesystem maintenance windows and block jobs using specific filesystems.  The idea was to use a GRES and set GRES count to 0 during filesystem maintenance.  In practice this appears like it might not work.

Comment 1 Trey Dockendorf 2020-07-08 07:30:33 MDT

Created attachment 14950 [details]
gres.conf

Comment 2 Scott Hilton 2020-07-08 10:56:30 MDT

Trey,

I'm trying to figure out how to reproduce your error. 

I assume it has only to do with gpfs:scratch and not the other options you showed? In other words:
Does this error only happen with 2 nodes? 
Does this error only happen with partition parallel?
Does this error only happen with $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw?

Also, Does it have to be salloc or does this error happen in sbatch as well?

-Scott

Comment 3 Trey Dockendorf 2020-07-08 11:26:14 MDT

I tried serial partition with 1 task & 1 node and the issue does not occur:

$ salloc -p serial --gres=gpfs:scratch:1
$ srun hostname
p0001.ten.osc.edu

Also with debug partition which allows partial node allocation when multiple nodes:

$ salloc -N 2 -p debug --gres=gpfs:scratch:1
$ srun hostname
p0001.ten.osc.edu
p0002.ten.osc.edu

With parallel partition where we force jobs to be exclusive via partition config, that's where the issue seems to occur.  Also the error is not just the OSU benchmarks but with just executing hostname:

$ salloc -N 2 -p parallel --gres=gpfs:scratch:1
$ srun hostname
srun: error: Unable to create step for job 1077: Invalid generic resource (gres) specification

Also GPU partition doesn't have this issue. We only have 1 GPU node currently in our test bed so I can't yet test if parallel GPU partition has this issue:

$ salloc --gpus=1 -p gpuserial --gres=gpfs:scratch:1
$ srun hostname
p0253.ten.osc.edu

I verified this happens under the same conditions as above using sbatch:

This was error from sbatch (script attached as parallel.sbatch):

srun: error: Unable to create step for job 896: Invalid generic resource (gres) specification

The part not in submit script is I did --gres=gpfs:scratch:1 at command line.

Comment 4 Trey Dockendorf 2020-07-08 11:26:52 MDT

Created attachment 14954 [details]
parallel.sbatch

Comment 5 Trey Dockendorf 2020-07-08 11:28:59 MDT

One part of our config that may not be clear is that the parallel partition is like a Torque routing queue. When you submit to parallel the Lua job submit filter modifies the request to be partition=parallel-40core,parallel-48core.  Using that partition directly does not change the behavior:

$ salloc -N 2 -p parallel-40core --gres=gpfs:scratch:1
$ srun hostname
srun: error: Unable to create step for job 1080: Invalid generic resource (gres) specification

Comment 6 Scott Hilton 2020-07-08 11:45:15 MDT

Trey, 

Thanks for the extra info. I have now reproduced the bug on my machine.

I agree with you, it seems that "OverSubscribe=EXCLUSIVE" seems to be the common factor. 

I will look into the bug and let you know what I find.

-Scott

Comment 8 Scott Hilton 2020-07-08 17:16:41 MDT

Trey,

I found the issue and am working out how to make a proper fix now.

The bug comes up when you allocate a whole node (through --exclusive or "OverSubscribe=EXCLUSIVE") which has gres that are no_consume. The current code ignores all the no_consume and only allocates the others. When srun looks for the gres it isn't there.

For now you could run without either exclusive or no_consume tag. Together they produce the bug.

-Scott

Comment 11 Scott Hilton 2020-07-09 11:04:15 MDT

Trey, 

I uploaded the patch and you are free to try it. If you do let us know how it works for you. 

It should come out on the next release of 20.02.

Take care,

Scott

Comment 12 Trey Dockendorf 2020-07-09 15:52:34 MDT

I patched our RPM builds and tested that the patch fixes the issue.

Thanks,
- Trey

Comment 21 Scott Hilton 2020-07-22 12:36:12 MDT

Trey, 

I'm glad it worked for you.

Let us know if you have any other questions,

-Scott