Ticket 4542 - advanced reservation automated nodelist updates may prevent previously submitted jobs from starting
Summary: advanced reservation automated nodelist updates may prevent previously submit...
Status: RESOLVED TIMEDOUT
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 17.02.9
Hardware: Cray XC Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-19 12:55 MST by Doug Jacobsen
Modified: 2018-01-19 03:05 MST (History)
0 users

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2017-12-19 12:55:31 MST
Hello,

We have a full scale reservation today (again).  The user submitted a job, then sometime after a node was marked down and slurmctld came up with a new list of a nodes for the reservation.

The pending jobs that were already submitted were not able to run with reason "Reservation".  New jobs could start.

In the end I ended up updating the time limit of the job and it started (and then updated the time limit back).

My guess is that the job update caused the reservation data structure in slurmctld to be re-evaluated allowing the job to run.  I'm presupposing that the nodelist update may have invalidated a pointer or other reference to the reservation for the preexisting pending job.

Thanks,
Doug
Comment 2 Dominik Bartkiewicz 2017-12-20 08:58:29 MST
Hi

I can't recreate this.
Could you send us slurmctld.log
and output from 'scontrol show res'?

Dominik
Comment 3 Dominik Bartkiewicz 2018-01-03 04:34:10 MST
Hi

Any news?

Dominik
Comment 4 Dominik Bartkiewicz 2018-01-08 05:03:49 MST
Hi
Doug, Could you send me slurmctld.log containing this situation?
Could you describe how you created this reservation?
Thanks
Dominik
Comment 5 Dominik Bartkiewicz 2018-01-15 07:11:51 MST
Hi
Doug, I know you are busy, but I need more info to move on with this.
Dominik
Comment 6 Dominik Bartkiewicz 2018-01-18 06:52:03 MST
Hi

This is the last call  :)

Dominik
Comment 7 Dominik Bartkiewicz 2018-01-19 03:05:11 MST
I am closing this as timedout. Please,reopen if needed.

Dominik