Ticket 4542

Summary: advanced reservation automated nodelist updates may prevent previously submitted jobs from starting
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.02.9   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Doug Jacobsen 2017-12-19 12:55:31 MST
Hello,

We have a full scale reservation today (again).  The user submitted a job, then sometime after a node was marked down and slurmctld came up with a new list of a nodes for the reservation.

The pending jobs that were already submitted were not able to run with reason "Reservation".  New jobs could start.

In the end I ended up updating the time limit of the job and it started (and then updated the time limit back).

My guess is that the job update caused the reservation data structure in slurmctld to be re-evaluated allowing the job to run.  I'm presupposing that the nodelist update may have invalidated a pointer or other reference to the reservation for the preexisting pending job.

Thanks,
Doug
Comment 2 Dominik Bartkiewicz 2017-12-20 08:58:29 MST
Hi

I can't recreate this.
Could you send us slurmctld.log
and output from 'scontrol show res'?

Dominik
Comment 3 Dominik Bartkiewicz 2018-01-03 04:34:10 MST
Hi

Any news?

Dominik
Comment 4 Dominik Bartkiewicz 2018-01-08 05:03:49 MST
Hi
Doug, Could you send me slurmctld.log containing this situation?
Could you describe how you created this reservation?
Thanks
Dominik
Comment 5 Dominik Bartkiewicz 2018-01-15 07:11:51 MST
Hi
Doug, I know you are busy, but I need more info to move on with this.
Dominik
Comment 6 Dominik Bartkiewicz 2018-01-18 06:52:03 MST
Hi

This is the last call  :)

Dominik
Comment 7 Dominik Bartkiewicz 2018-01-19 03:05:11 MST
I am closing this as timedout. Please,reopen if needed.

Dominik