Ticket 8073

Summary: Job steps fail due to erroneously reported "completing job" when time limit combined with OverTimeLimit
Product: Slurm Reporter: Mark Titorenko <mark.titorenko>
Component: slurmctldAssignee: Jacob Jenson <jacob>
Status: RESOLVED FIXED QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 20.02.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 22.05.x Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Mark Titorenko 2019-11-08 06:23:34 MST
Hi!

After finding a simple reproducible case and performing some diagnostics, we believe that we've discovered a bug in step creation when OverTimeLimit is in use.

In summary: job step requests within an allocation may be erroneously rejected with an error that the "Job/step already completing or completed" when the job is still running and is within the soft limit of "time limit + OverTimeLimit".

This can be reproduced as follows:

1. Set an OverTimeLimit for a partition (or in general) of 2 (minutes) (note that any value greater than 2 [including UNLIMITED] will suffice for this repro).
2. Create a job script such as the following:

---cut---
#!/bin/bash
date
sleep 10

echo "Step 1"
date
srun hostname
sleep 60

echo "Step 2"
date
srun hostname
sleep 60

echo "Step 3"
date
srun hostname

echo "Script complete"
---cut---

3. Submit the job script with a time limit of 2 (minutes):

sbatch --time=2 test.sh

Expected results:

All steps should run, and the job should run to completion; the first two steps should start within the soft time limit of 2 minutes specified for the job (at ~20s in and ~80s in), while the third step should start within the first additional minute allowed by the OverTimeLimit configuration (at ~140s in).

Actual results:

The first two steps run as expected. The third step fails with an error:

srun: error: Unable to create step for job <id>: Job/step already completing or completed

Diagnosis:

Lines 2400-2402 of slurmctld/step_mgr.c (in the 'step_create' function) read as follows:

	if (IS_JOB_FINISHED(job_ptr) ||
	    ((job_ptr->end_time <= time(NULL)) && !IS_JOB_CONFIGURING(job_ptr)))
		return ESLURM_ALREADY_DONE;

Ref: https://github.com/SchedMD/slurm/blob/master/src/slurmctld/step_mgr.c#L2400-2402

The job_ptr->end_time is compared against the current time, without taking into account any configured OverTimeLimit leading to a refusal of the creation of the step with ESLURM_ALREADY_DONE.

As mentioned above, any values for OverTimeLimit (including UNLIMITED) also exhibit this failure, which unfortunately renders the use of time-limited jobs, OverTimeLimit and srun (including srun invocations by OpenMPI) non-functional.

Please let me know if you'd like any further details.

Thanks,

Mark.