Ticket 9691

Summary:	job dependency type checking
Product:	Slurm	Reporter:	Michael DiDomenico <mdidomenico>
Component:	User Commands	Assignee:	Scott Hilton <scott>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.02.4
Hardware:	Linux
OS:	Linux
Site:	IDACCR	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.6 20.11.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Michael DiDomenico 2020-08-28 10:53:06 MDT

i'm not sure if this is a bug or not.

sbatch -n1 <somejob>
jobid 1

sbatch -d afterof:1 -n1 <somejob>
jobid 2

so both jobs were submitted and ran, but the dependency was ignored.  this is clearly because 'afterof' should have been 'afterok'

but it surprised me that sbatch didn't kick out an error, saying something like 'afterof not a recoginized type'

Comment 4 Scott Hilton 2020-09-04 16:58:16 MDT

Hi Michael,

Sending a user message sounds like a good idea. This data is processed on the server side but there are ways we could send back a message. I haven't looked deep enough yet to say if we will add it or not.

There was also a second issue. 'afterof' is actually interpreted as 'after' in this instance. Similarly if you had 'afterokx' it would be interpreted as 'afterok'. We may fix this depending on if fixing it would break other stuff.

Thanks for pointing this out. I will let you know if/when we fix these issues.

-Scott

Comment 5 Michael DiDomenico 2020-09-05 05:48:25 MDT

I'll only offer that; based on the way the manpage is written the 'type' field in the dependencies appaers like an enumerated field.  at the very least i would have expected the job to not run with a bad param error.

in my specific case i had dependencies and large job arrays in the same command, which were meant to control the job.  in my instance it basically submitted everything all at once and caused the scheduler to thrash.

i understand the need to parse on the server, but even a really simple check on the client would have avoided 30 mins of self-inflicted pain

Comment 9 Scott Hilton 2020-09-17 12:22:08 MDT

Michael,

We fixed the issue where 'afterof' is actually interpreted as 'after' in this instance. Similarly if you had 'afterokx' it would be interpreted as 'afterok'. 

Other errors should already have a user message. Misspelled options should now have these messages as well.

This fix is coming in slurm 20.02.6, in commit 81bd3ca68382937b2dc58e314d57ec14f5709301.

-Scott