| Summary: | Unexpected reservation PURGE_COMP flag behavior | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Troy Baer <troy> |
| Component: | reservations | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | tdockendorf |
| Version: | 20.02.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ohio State OSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Troy Baer
2020-09-18 12:16:25 MDT
Troy, I will have Albert look into this in more detail. In the meantime, your understanding is correct and this reservation should have been purged. The code in "job_resv_check" is responsible for this. https://github.com/SchedMD/slurm/blob/4ac3db0e90912521c6ef5a3aed35880cb44cc419/src/slurmctld/reservation.c#L6830 There may be a job count still active and we could place an info statement here to check. https://github.com/SchedMD/slurm/blob/4ac3db0e90912521c6ef5a3aed35880cb44cc419/src/slurmctld/reservation.c#L6824 if (resv_ptr->job_run_cnt || resv_ptr->job_pend_cnt) { resv_ptr->idle_start_time = 0; + info("Reservation %s job count %d", resv_ptr->name, resv_ptr->job_pend_cnt); + } Or just query all jobs "scontrol show jobs | grep Reservation\=rsv-test". FWIW, we've applied that logging patch in our test environment, but we haven't been able to reproduce this behavior there: ----- # circa 10:19 local time troy@pitzer-slurm-submit-test:~$ scontrol create reservation=rsv-test nodecnt=1 starttime=10:30:00 duration=00:45:00 flags=maint,daily,purge_comp=00:05:00 users=root Reservation created: rsv-test troy@pitzer-slurm-submit-test:~$ scontrol show reservation rsv-test ReservationName=rsv-test StartTime=2020-09-21T10:30:00 EndTime=2020-09-21T11:15:00 Duration=00:45:00 Nodes=p0301 NodeCnt=1 CoreCnt=48 Features=(null) PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00 TRES=cpu=48 Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) troy@pitzer-slurm-submit-test:~$ date ; sinfo -T Mon Sep 21 10:31:57 EDT 2020 RESV_NAME STATE START_TIME END_TIME DURATION NODELIST troy INACTIVE 2020-09-21T13:00:00 2020-09-21T13:10:00 00:10:00 slurmd02 rsv-test ACTIVE 2020-09-21T10:30:00 2020-09-21T11:15:00 00:45:00 p0301 […wait ~5 minutes…] troy@pitzer-slurm-submit-test:~$ date ; sinfo -T Mon Sep 21 10:35:43 EDT 2020 RESV_NAME STATE START_TIME END_TIME DURATION NODELIST troy INACTIVE 2020-09-21T13:00:00 2020-09-21T13:10:00 00:10:00 slurmd02 rsv-test INACTIVE 2020-09-22T10:30:00 2020-09-22T11:15:00 00:45:00 p0301 # However there was a slurmctld restart during this, # so that might not be a proper test. Try again: troy@pitzer-slurm-submit-test:~$ scontrol update reservation=rsv-test starttime=10:45:00 Reservation updated. troy@pitzer-slurm-submit-test:~$ scontrol show reservation rsv-test ReservationName=rsv-test StartTime=2020-09-21T10:45:00 EndTime=2020-09-21T11:30:00 Duration=00:45:00 Nodes=p0301 NodeCnt=1 CoreCnt=48 Features=(null) PartitionName=batch Flags=MAINT,DAILY,PURGE_COMP=00:05:00 TRES=cpu=48 Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) troy@pitzer-slurm-submit-test:~$ date ; sinfo -T Mon Sep 21 10:42:25 EDT 2020 RESV_NAME STATE START_TIME END_TIME DURATION NODELIST troy INACTIVE 2020-09-21T13:00:00 2020-09-21T13:10:00 00:10:00 slurmd02 rsv-test INACTIVE 2020-09-21T10:45:00 2020-09-21T11:30:00 00:45:00 p0301 troy@pitzer-slurm-submit-test:~$ date ; sinfo -T Mon Sep 21 10:45:14 EDT 2020 RESV_NAME STATE START_TIME END_TIME DURATION NODELIST troy INACTIVE 2020-09-21T13:00:00 2020-09-21T13:10:00 00:10:00 slurmd02 rsv-test ACTIVE 2020-09-21T10:45:00 2020-09-21T11:30:00 00:45:00 p0301 troy@pitzer-slurm-submit-test:~$ date ; sinfo -T Mon Sep 21 10:50:28 EDT 2020 RESV_NAME STATE START_TIME END_TIME DURATION NODELIST troy INACTIVE 2020-09-21T13:00:00 2020-09-21T13:10:00 00:10:00 slurmd02 rsv-test INACTIVE 2020-09-22T10:45:00 2020-09-22T11:30:00 00:45:00 p0301 ----- So there’s something different between test and production, due to having either more (read: any) jobs in flight or a much greater number of nodes. I'll watch when this reservation in the production environment comes around again this afternoon and see if the same behavior recurs. Looking through our production logs more closely, we see the following: [...] Sep 18 14:31:26 pitzer-slurm01 slurmctld[45082]: Reservation rsv-test has no more jobs for 00:05:00, ending it [...] Sep 19 14:05:15 pitzer-slurm01 slurmctld[45082]: Reservation rsv-test has no more jobs for 00:05:00, ending it [...] Sep 20 14:05:05 pitzer-slurm01 slurmctld[45082]: Reservation rsv-test has no more jobs for 00:05:00, ending it [...] So on the 18th, this reservation *was* eventually released by purge_comp, but much later than I would have expected; on Saturday and Sunday, it was released after 5 minutes as I had expected. It turns out that there were 3 jobs that used this reservation on Friday for ~25 minutes, and then it was released 5 minutes after that. Mea culpa. (It looks like part of my mistake was that I forgot the -a flag to sacct, and so it was filtering out jobs that ran in this timeframe and used these reservations.) I think this ticket can be closed. Thanks Troy. Let me just just recommend you, if you didn't already remove the custom patch that Jason provided with the extra info() line, just remove it now. We'll discuss internally if we can add some similar debug message only visible when DebugFlags=Reservation, and I don't want you to have merge conflicts! ;-) Regards, Albert Hi Troy, > We'll discuss internally if we can add some similar debug message only > visible when DebugFlags=Reservation, and I don't want you to have merge > conflicts! ;-) I just want to let you know that we've improved a bit the logging of PURGE_COMP reservations (for 20.11) on this commit: - https://github.com/SchedMD/slurm/commit/738fcbb99b3cd4cf6fa2f6896a43f4eb4e62120d On the next release 20.11, if you enable DebugFlags=Reservation, you'll see logs when a reservation with PURGE_COMPT starts to accrue idle time (because there is no jobs on the reservation), and when that time is reset to 0 again (due a job is added to the reservation). Just a minor improvement, but it could help to troubleshoot similar scenarios than yours. |