Ticket 9691

Summary: job dependency type checking
Product: Slurm Reporter: Michael DiDomenico <mdidomenico>
Component: User CommandsAssignee: Scott Hilton <scott>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: IDACCR Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 20.02.6 20.11.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Michael DiDomenico 2020-08-28 10:53:06 MDT
i'm not sure if this is a bug or not.

sbatch -n1 <somejob>
jobid 1

sbatch -d afterof:1 -n1 <somejob>
jobid 2

so both jobs were submitted and ran, but the dependency was ignored.  this is clearly because 'afterof' should have been 'afterok'

but it surprised me that sbatch didn't kick out an error, saying something like 'afterof not a recoginized type'
Comment 4 Scott Hilton 2020-09-04 16:58:16 MDT
Hi Michael,

Sending a user message sounds like a good idea. This data is processed on the server side but there are ways we could send back a message. I haven't looked deep enough yet to say if we will add it or not.

There was also a second issue. 'afterof' is actually interpreted as 'after' in this instance. Similarly if you had 'afterokx' it would be interpreted as 'afterok'. We may fix this depending on if fixing it would break other stuff.

Thanks for pointing this out. I will let you know if/when we fix these issues.

-Scott
Comment 5 Michael DiDomenico 2020-09-05 05:48:25 MDT
I'll only offer that; based on the way the manpage is written the 'type' field in the dependencies appaers like an enumerated field.  at the very least i would have expected the job to not run with a bad param error.

in my specific case i had dependencies and large job arrays in the same command, which were meant to control the job.  in my instance it basically submitted everything all at once and caused the scheduler to thrash.

i understand the need to parse on the server, but even a really simple check on the client would have avoided 30 mins of self-inflicted pain
Comment 9 Scott Hilton 2020-09-17 12:22:08 MDT
Michael,

We fixed the issue where 'afterof' is actually interpreted as 'after' in this instance. Similarly if you had 'afterokx' it would be interpreted as 'afterok'. 

Other errors should already have a user message. Misspelled options should now have these messages as well.

This fix is coming in slurm 20.02.6, in commit 81bd3ca68382937b2dc58e314d57ec14f5709301.

-Scott