Ticket 1324

Summary: Job scheduled with afterok actually start BEFORE all jobarray jobs completed
Product: Slurm Reporter: daniele.didomizio
Component: SchedulingAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: brian, da, jacob
Version: 14.11.1   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.11.3 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description daniele.didomizio 2014-12-15 02:32:48 MST
On SLURM 14.11, when scheduling a job with a "afterok" dependency depending on a job array, the job depending on the jobarray sometimes starts BEFORE all the jobs in the jobarray ran to completion (with an exit code of 0).

This is not happening every time but happens quite often. You can try to replicate it this way:

#!/bin/bash

# schedule a jobarray with two commands for every job:
## sleep a random number of seconds between 1 and 10
## print the time
jobid=$(sbatch --parsable --array=0-300 --wrap "sleep \$(( ( RANDOM % 10 ) + 1 )); echo \"ending job \$SLURM_ARRAY_TASK_ID at \$(date +%s.%N)\"")

# start the job depending on the jobarray and just print the date
sbatch --dependency="afterok:$jobid" --wrap "echo starting afterok dependency job at \$(date +%s.%N)"

If you look at the file contents and/or the slurm-*.out modification date, you will see that sometimes the date written in the outfile of the job depending on the jobarray is not the most recent and that some of the jobs in the jobarray actually finished AFTER the "afterok" job.

It also seems to me that this is happening only when the number of jobs in the jobarray is more than the maximum number of cores available, so jobs must wait for resources before starting.

This was not happening on previous SLURM releases (SLURM 14.03.4 was working perfectly fine).
Comment 1 Moe Jette 2015-01-05 10:52:49 MST
Still testing, but this patch should fix the problem:
https://github.com/SchedMD/slurm/commit/744f114b1a2e0c9cb1ef9eab8e9e023d18b9cf4a
Comment 2 Moe Jette 2015-01-06 03:32:41 MST
There was one more changed needed:
https://github.com/SchedMD/slurm/commit/745208e8c0520deba8f866ac41ccf485230d248d