Ticket 9859

Summary: Unexpected reservation PURGE_COMP flag behavior
Product: Slurm Reporter: Troy Baer <troy>
Component: reservationsAssignee: Albert Gil <albert.gil>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: tdockendorf
Version: 20.02.4   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Troy Baer 2020-09-18 12:16:25 MDT
I'm seeing some behavior with the PURGE_COMP flag on reservations that I don't understand.

I have the following reservation present on one of my systems:

troy@pitzer-login04:~$ scontrol show reservation rsv-test
ReservationName=rsv-test StartTime=2020-09-18T14:00:00 EndTime=2020-09-18T14:45:00 Duration=00:45:00
   Nodes=p0501 NodeCnt=1 CoreCnt=48 Features=c6420&48core PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00
   TRES=cpu=48
   Users=(null) Accounts=CLIENTACCT Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

Note that I have PURGE_COMP set to 5 minutes.  There have been no jobs requesting that reservation today, so I would have thought that it would have expired and moved on to the next day after 5 minutes.  However, that is not the case:

troy@pitzer-login04:~$ date ; sinfo -T
Fri Sep 18 14:11:42 EDT 2020
RESV_NAME          STATE           START_TIME             END_TIME     DURATION  NODELIST
[...]
rsv-test          ACTIVE  2020-09-18T14:00:00  2020-09-18T14:45:00     00:45:00  p0501

Is this the expected behavior?  Have I misunderstood how PURGE_COMP is supposed to work?
Comment 2 Jason Booth 2020-09-18 16:30:49 MDT
Troy, I will have Albert look into this in more detail. In the meantime, your understanding is correct and this reservation should have been purged.


The code in "job_resv_check" is responsible for this.
https://github.com/SchedMD/slurm/blob/4ac3db0e90912521c6ef5a3aed35880cb44cc419/src/slurmctld/reservation.c#L6830


There may be a job count still active and we could place an info statement here to check.

https://github.com/SchedMD/slurm/blob/4ac3db0e90912521c6ef5a3aed35880cb44cc419/src/slurmctld/reservation.c#L6824


			if (resv_ptr->job_run_cnt || resv_ptr->job_pend_cnt) {
				resv_ptr->idle_start_time = 0;
+				info("Reservation %s job count %d", resv_ptr->name, resv_ptr->job_pend_cnt);
+			}



Or just query all jobs "scontrol show jobs | grep Reservation\=rsv-test".
Comment 3 Troy Baer 2020-09-21 08:58:29 MDT
FWIW, we've applied that logging patch in our test environment, but we haven't been able to reproduce this behavior there:

-----
# circa 10:19 local time
troy@pitzer-slurm-submit-test:~$ scontrol create reservation=rsv-test nodecnt=1 starttime=10:30:00 duration=00:45:00 flags=maint,daily,purge_comp=00:05:00 users=root
Reservation created: rsv-test
 
troy@pitzer-slurm-submit-test:~$ scontrol show reservation rsv-test
ReservationName=rsv-test StartTime=2020-09-21T10:30:00 EndTime=2020-09-21T11:15:00 Duration=00:45:00
   Nodes=p0301 NodeCnt=1 CoreCnt=48 Features=(null) PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00
   TRES=cpu=48
   Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
 
troy@pitzer-slurm-submit-test:~$ date ; sinfo -T
Mon Sep 21 10:31:57 EDT 2020
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
troy       INACTIVE  2020-09-21T13:00:00  2020-09-21T13:10:00     00:10:00  slurmd02
rsv-test     ACTIVE  2020-09-21T10:30:00  2020-09-21T11:15:00     00:45:00  p0301
 
[…wait ~5 minutes…]
 
troy@pitzer-slurm-submit-test:~$ date ; sinfo -T
Mon Sep 21 10:35:43 EDT 2020
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
troy       INACTIVE  2020-09-21T13:00:00  2020-09-21T13:10:00     00:10:00  slurmd02
rsv-test   INACTIVE  2020-09-22T10:30:00  2020-09-22T11:15:00     00:45:00  p0301

# However there was a slurmctld restart during this,
# so that might not be a proper test.  Try again:

troy@pitzer-slurm-submit-test:~$ scontrol update reservation=rsv-test starttime=10:45:00
Reservation updated.
 
troy@pitzer-slurm-submit-test:~$ scontrol show reservation rsv-test
ReservationName=rsv-test StartTime=2020-09-21T10:45:00 EndTime=2020-09-21T11:30:00 Duration=00:45:00
   Nodes=p0301 NodeCnt=1 CoreCnt=48 Features=(null) PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00
   TRES=cpu=48
   Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
 
troy@pitzer-slurm-submit-test:~$ date ; sinfo -T
Mon Sep 21 10:42:25 EDT 2020
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
troy       INACTIVE  2020-09-21T13:00:00  2020-09-21T13:10:00     00:10:00  slurmd02
rsv-test   INACTIVE  2020-09-21T10:45:00  2020-09-21T11:30:00     00:45:00  p0301
 
troy@pitzer-slurm-submit-test:~$ date ; sinfo -T
Mon Sep 21 10:45:14 EDT 2020
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
troy       INACTIVE  2020-09-21T13:00:00  2020-09-21T13:10:00     00:10:00  slurmd02
rsv-test     ACTIVE  2020-09-21T10:45:00  2020-09-21T11:30:00     00:45:00  p0301
 
troy@pitzer-slurm-submit-test:~$ date ; sinfo -T
Mon Sep 21 10:50:28 EDT 2020
RESV_NAME     STATE           START_TIME             END_TIME     DURATION  NODELIST
troy       INACTIVE  2020-09-21T13:00:00  2020-09-21T13:10:00     00:10:00  slurmd02
rsv-test   INACTIVE  2020-09-22T10:45:00  2020-09-22T11:30:00     00:45:00  p0301
-----
 
So there’s something different between test and production, due to having either more (read: any) jobs in flight or a much greater number of nodes.

I'll watch when this reservation in the production environment comes around again this afternoon and see if the same behavior recurs.
Comment 4 Troy Baer 2020-09-21 10:42:49 MDT
Looking through our production logs more closely, we see the following:

[...]
Sep 18 14:31:26 pitzer-slurm01 slurmctld[45082]: Reservation rsv-test has no more jobs for 00:05:00, ending it
[...]
Sep 19 14:05:15 pitzer-slurm01 slurmctld[45082]: Reservation rsv-test has no more jobs for 00:05:00, ending it
[...]
Sep 20 14:05:05 pitzer-slurm01 slurmctld[45082]: Reservation rsv-test has no more jobs for 00:05:00, ending it
[...]

So on the 18th, this reservation *was* eventually released by purge_comp, but much later than I would have expected; on Saturday and Sunday, it was released after 5 minutes as I had expected.  It turns out that there were 3 jobs that used this reservation on Friday for ~25 minutes, and then it was released 5 minutes after that.  Mea culpa.

(It looks like part of my mistake was that I forgot the -a flag to sacct, and so it was filtering out jobs that ran in this timeframe and used these reservations.)
Comment 5 Troy Baer 2020-09-29 08:37:55 MDT
I think this ticket can be closed.
Comment 6 Albert Gil 2020-09-30 10:29:52 MDT
Thanks Troy.

Let me just just recommend you, if you didn't already remove the custom patch that Jason provided with the extra info() line, just remove it now.
We'll discuss internally if we can add some similar debug message only visible when DebugFlags=Reservation, and I don't want you to have merge conflicts! ;-)

Regards,
Albert
Comment 8 Albert Gil 2020-10-07 09:33:37 MDT
Hi Troy,

> We'll discuss internally if we can add some similar debug message only
> visible when DebugFlags=Reservation, and I don't want you to have merge
> conflicts! ;-)

I just want to let you know that we've improved a bit the logging of PURGE_COMP reservations (for 20.11) on this commit:
- https://github.com/SchedMD/slurm/commit/738fcbb99b3cd4cf6fa2f6896a43f4eb4e62120d

On the next release 20.11, if you enable DebugFlags=Reservation, you'll see logs when a reservation with PURGE_COMPT starts to accrue idle time (because there is no jobs on the reservation), and when that time is reset to 0 again (due a job is added to the reservation).

Just a minor improvement, but it could help to troubleshoot similar scenarios than yours.