Ticket 8073 - Job steps fail due to erroneously reported "completing job" when time limit combined with OverTimeLimit
Summary: Job steps fail due to erroneously reported "completing job" when time limit c...
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 20.02.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-11-08 06:23 MST by Mark Titorenko
Modified: 2022-06-22 07:17 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.x
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Mark Titorenko 2019-11-08 06:23:34 MST
Hi!

After finding a simple reproducible case and performing some diagnostics, we believe that we've discovered a bug in step creation when OverTimeLimit is in use.

In summary: job step requests within an allocation may be erroneously rejected with an error that the "Job/step already completing or completed" when the job is still running and is within the soft limit of "time limit + OverTimeLimit".

This can be reproduced as follows:

1. Set an OverTimeLimit for a partition (or in general) of 2 (minutes) (note that any value greater than 2 [including UNLIMITED] will suffice for this repro).
2. Create a job script such as the following:

---cut---
#!/bin/bash
date
sleep 10

echo "Step 1"
date
srun hostname
sleep 60

echo "Step 2"
date
srun hostname
sleep 60

echo "Step 3"
date
srun hostname

echo "Script complete"
---cut---

3. Submit the job script with a time limit of 2 (minutes):

sbatch --time=2 test.sh

Expected results:

All steps should run, and the job should run to completion; the first two steps should start within the soft time limit of 2 minutes specified for the job (at ~20s in and ~80s in), while the third step should start within the first additional minute allowed by the OverTimeLimit configuration (at ~140s in).

Actual results:

The first two steps run as expected. The third step fails with an error:

srun: error: Unable to create step for job <id>: Job/step already completing or completed

Diagnosis:

Lines 2400-2402 of slurmctld/step_mgr.c (in the 'step_create' function) read as follows:

	if (IS_JOB_FINISHED(job_ptr) ||
	    ((job_ptr->end_time <= time(NULL)) && !IS_JOB_CONFIGURING(job_ptr)))
		return ESLURM_ALREADY_DONE;

Ref: https://github.com/SchedMD/slurm/blob/master/src/slurmctld/step_mgr.c#L2400-2402

The job_ptr->end_time is compared against the current time, without taking into account any configured OverTimeLimit leading to a refusal of the creation of the step with ESLURM_ALREADY_DONE.

As mentioned above, any values for OverTimeLimit (including UNLIMITED) also exhibit this failure, which unfortunately renders the use of time-limited jobs, OverTimeLimit and srun (including srun invocations by OpenMPI) non-functional.

Please let me know if you'd like any further details.

Thanks,

Mark.