Ticket 11831

Summary: Jobs stuck with Reason "ReqNodeNotAvail, Reserved for maintenance"
Product: Slurm Reporter: Juergen Salk <juergen.salk>
Component: reservationsAssignee: Skyler Malinowski <skyler>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 20.11.5   
Hardware: Linux   
OS: Linux   
Site: Ulm University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmctld.log

Description Juergen Salk 2021-06-15 08:20:22 MDT
Hi,

we'll have to migrate some home directories and I wanted to create an "inverted" reservation by blacklisting access to the nodes for affected users at a certain time.

So I did a test in that I denied one user account access the reservation and submitted jobs as this user and as another user. 

This is how I created the reservation:

$ scontrol create reservation starttime=now+1hours duration=1:00:00 flags=ignore_jobs,magnetic,flex ReservationName=migrate_home nodes=ALL users=-xxx

$ scontrol show res
ReservationName=migrate_home StartTime=2021-06-15T14:50:09 EndTime=2021-06-15T15:50:09 Duration=01:00:00
   Nodes=n[0101-0108,0111-0114,0121-0124,0201-0240,0301-0340,0401-0440,0501-0528,0601-0640,0701-0740,0801-0840,0901-0940,1001-1040,1101-1140,1201-1240,1301-1340,1401-1440,1501-1540,1601-16
40,1701-1740,1801-1840,1901-1910] NodeCnt=694 CoreCnt=33312 Features=(null) PartitionName=(null) Flags=FLEX,IGNORE_JOBS,SPEC_NODES,ALL_NODES,MAGNETIC
   TRES=cpu=33312
   Users=-xxx Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)

This all worked as expected, but I noticed that also pending jobs for unaffected users suddenly showed a Reason of "ReqNodeNotAvail, Reserved for maintenance" in squeue as well. After some searching I found bug #11404 which seems to address the very same issue.

So I have removed the reservation before its regular end time in order to prevent potential annoyance for (and with) users:

$ scontrol delete ReservationName=migrate_home

I did expect this would unlink all the jobs from the no longer existing reservation. However, it seems, pending jobs that once got Reason "ReqNodeNotAvail, Reserved for maintenance" stick with that and are not automatically reset to their original state reason. For example:

 $ scontrol show job 2727930
JobId=2727930 JobName=dir_482
   UserId=yyy(942983) GroupId=(500004) MCS_label=N/A
   Priority=6131 Nice=0 Account=bw14l004 QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_Reserved_for_maintenance Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=20:00:00 TimeMin=N/A
   SubmitTime=2021-06-11T14:05:15 EligibleTime=2021-06-11T14:05:15
   AccrueTime=2021-06-11T14:05:15
   StartTime=2021-06-16T11:41:31 EndTime=2021-06-17T07:41:31 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-15T16:12:26
   Partition=standard AllocNode:Sid=login03:2448445
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=n0902
   NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,mem=96G,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=48:0:*:* CoreSpec=*
   MinCPUsNode=48 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/yyy/1_scripts/1_MP2/MP2-singlepoints-charge0
   WorkDir=/lustre/work/ws/ws1/yyy-qm/1_methylsulfid/1_iodine/martin
   StdErr=/lustre/work/ws/ws1/yyy-qm/1_methylsulfid/1_iodine/martin/slurm-2727930.out
   StdIn=/dev/null
   StdOut=/lustre/work/ws/ws1/yyy-qm/1_methylsulfid/1_iodine/martin/slurm-2727930.out
   Power=
   NtasksPerTRES:0

Can you offer me advice on how to trigger an update of the job state reason for affected jobs in the queue? 

Best regards
Jürgen
Comment 1 Skyler Malinowski 2021-06-16 12:31:36 MDT
Hello Jürgen,

> So I have removed the reservation before its regular end time in order to
> prevent potential annoyance for (and with) users:
> 
> $ scontrol delete ReservationName=migrate_home
> 
> I did expect this would unlink all the jobs from the no longer existing
> reservation. However, it seems, pending jobs that once got Reason
> "ReqNodeNotAvail, Reserved for maintenance" stick with that and are not
> automatically reset to their original state reason.
I cannot reproduce the behavior. For me the user who was blacklisted from the reservation does have their jobs exit pending into running and the remainder job their jobs get new pending state reasons.

If the scheduler had not tested those jobs again because of the amount of jobs in queue. I would expect time to resolve this with no impact to production. If this is not the case, then please provide logs capturing the event. If you can reproduce the issue and can capture it in the logs, then please increase the loglevel to at least debug2 and send those logs.

Thanks,
Skyler
Comment 2 Juergen Salk 2021-06-16 12:31:42 MDT
Vielen Dank für Ihre Nachricht.

Bis zum 18.06.2021 bin ich leider nicht erreichbar. Ich werde mich aber schnellstmöglichst mit Ihnen in Verbindung setzen.

In dringenden Fällen wenden Sie sich bitte an: kiz.hpc-admin@uni-ulm.de

Mit freundlichen Grüßen,

Jürgen Salk

-----------------------------------------------------------------------------------------

Thank you for your e-mail.

I am out of office until June 18th 2021. I will have limited access to my e-mail during this period but will answer your message as soon as possible.

If you have immediate questions or concerns, please contact 
kiz.hpc-admin@uni-ulm.de

Best regards,

Juergen Salk
Comment 3 Juergen Salk 2021-06-16 14:26:19 MDT
(In reply to Skyler Malinowski from comment #1)

> > I did expect this would unlink all the jobs from the no longer existing
> > reservation. However, it seems, pending jobs that once got Reason
> > "ReqNodeNotAvail, Reserved for maintenance" stick with that and are not
> > automatically reset to their original state reason.
> I cannot reproduce the behavior. For me the user who was blacklisted from
> the reservation does have their jobs exit pending into running and the
> remainder job their jobs get new pending state reasons.
> 
> If the scheduler had not tested those jobs again because of the amount of
> jobs in queue. I would expect time to resolve this with no impact to
> production. 
> If this is not the case, then please provide logs capturing the
> event. If you can reproduce the issue and can capture it in the logs, then
> please increase the loglevel to at least debug2 and send those logs.
> 

Hi Skyler,

thanks for taking care of this.

Ok, most of the jobs that had reason "ReqNodeNotAvail, Reserved for maintenance" were completed, running or - if still pending - turned to reason "AssocGrpCPURunMinutesLimit". However there were still a couple of jobs pending with reason "ReqNodeNotAvail, Reserved for maintenance" about 7 hours after I've deleted the reservation. And even now, more than 24 hours later, there are still 2 jobs that show that reason.  

$ date
Mi 16. Jun 22:00:20 CEST 2021
$ squeue -t pd | grep Reser
           2754012  standard   lisrp- xxxx PD       0:00      6 (ReqNodeNotAvail, Reserved for maintenance)
           2754014  standard   lisrp+ xxxx PD       0:00      6 (ReqNodeNotAvail, Reserved for maintenance)

We currently have about 2000 jobs running and 3000 jobs pending according to squeue:
 
$ squeue -h -t r | wc -l
2048
$ squeue -h -t pd | wc -l
3116
$

This is what sdiag scheduling stats shows:

$ sdiag 
*******************************************************
sdiag output at Wed Jun 16 22:01:01 2021 (1623873661)
Data since      Wed Jun 16 02:00:00 2021 (1623801600)
*******************************************************
Server thread count:  3
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 6551
Jobs started:   4340
Jobs completed: 4831
Jobs canceled:  1166
Jobs failed:    0

Job states ts:  Wed Jun 16 22:00:37 2021 (1623873637)
Jobs pending:   4098
Jobs running:   2046

Main schedule statistics (microseconds):
        Last cycle:   17613
        Max cycle:    261497
        Total cycles: 7545
        Mean cycle:   31752
        Mean depth cycle:  518
        Cycles per minute: 6
        Last queue length: 3045

Backfilling stats
        Total backfilled jobs (since last slurm start): 284666
        Total backfilled jobs (since last stats cycle start): 4138
        Total backfilled heterogeneous job components: 0
        Total cycles: 2000
        Last cycle when: Wed Jun 16 22:00:37 2021 (1623873637)
        Last cycle: 4016229
        Max cycle:  9326470
        Mean cycle: 4983926
        Last depth cycle: 3046
        Last depth cycle (try sched): 1715
        Depth Mean: 2762
        Depth Mean (try depth): 1709
        Last queue length: 3046
        Queue length mean: 2743
        Last table size: 77
        Mean table size: 123
[...]
$

Does that look suspicious in some way? I see that "Last depth cycle" equals "Last queue length" which should indicate that all pending jobs should have been tested already. Or am I missing something?

I will reassign severity to "minor issue" as this issue does not seem to impact scheduling in general (but, after all, causes confusion for users who try to understand the reason why their jobs are pending).
  
Best regards
Jürgen
Comment 4 Skyler Malinowski 2021-06-17 08:29:08 MDT
> Does that look suspicious in some way? I see that "Last depth cycle" equals
> "Last queue length" which should indicate that all pending jobs should have
> been tested already. Or am I missing something?
It looks like the scheduler has considered all the jobs pending in queue. But the fact that some jobs remain with Reason=ReqNodeNotAvail,_Reserved_for_maintenance is curious to say the least. It may be worth trying to hold those jobs then release them. This should reset the job state.
> scontrol hold {JobId}
> scontrol release {JobId}

If those jobs settle back to Reason=ReqNodeNotAvail,_Reserved_for_maintenance, then I would most certainly like to see the slurmctld.log capturing the scheduling of those jobs in questions.

Thanks,
Skyler
Comment 5 Juergen Salk 2021-06-18 12:47:06 MDT
(In reply to Skyler Malinowski from comment #4)

> It looks like the scheduler has considered all the jobs pending in queue.
> But the fact that some jobs remain with
> Reason=ReqNodeNotAvail,_Reserved_for_maintenance is curious to say the
> least. It may be worth trying to hold those jobs then release them. This
> should reset the job state.
> > scontrol hold {JobId}
> > scontrol release {JobId}
> 
> If those jobs settle back to
> Reason=ReqNodeNotAvail,_Reserved_for_maintenance, then I would most
> certainly like to see the slurmctld.log capturing the scheduling of those
> jobs in questions.

Hi Skyler,

there is still one job pending with reason "ReqNodeNotAvail,_Reserved_for_maintenance".

$ squeue | grep Reser
           2754014  standard   lisrp+ xxxx PD       0:00      6 (ReqNodeNotAvail, Reserved for maintenance)
$
$ scontrol show job 2754014
JobId=2754014 JobName=lisrp+
   UserId=xxxx(933314) GroupId=tu_tu(500004) MCS_label=N/A
   Priority=15915 Nice=0 Account=bw18j004 QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_Reserved_for_maintenance Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=4-04:00:00 TimeMin=N/A
   SubmitTime=2021-06-15T14:19:05 EligibleTime=2021-06-15T14:19:05
   AccrueTime=2021-06-15T14:19:05
   StartTime=2021-06-22T01:19:36 EndTime=2021-06-26T05:19:36 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-18T20:07:01
   Partition=standard AllocNode:Sid=login01:911362
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=n[1120,1123,1130,1133,1136,1138]
   NumNodes=6-6 NumCPUs=144 NumTasks=144 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=144,mem=187.50G,node=6,billing=144
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryNode=32000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+/justus.run
   WorkDir=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+
   StdErr=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out
   StdIn=/dev/null
   StdOut=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out
   Power=
   MailUser=xxxx@dal.ca MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
   NtasksPerTRES:0

$

I've increase the log level to debug3 for about two minutes and think I could capture scheduling attempts for this job. 
I'll attach the corresponding slurmctld.log.

I have then put the job on hold and released it. This did reset its reason back to 'Priority'. 

$ scontrol hold 2754014
$ squeue | grep 2754014
           2754014  standard   lisrp+ xxxx PD       0:00      6 (JobHeldAdmin)

$ scontrol release 2754014
$ squeue | grep 2754014
           2754014  standard   lisrp+ xxxx PD       0:00      6 (Priority)
$

That's good to know. But the question still stands why the pending reason for the affected jobs did not revert to 'Priority' automatically in our cases?

BTW: I also noticed that an scontrol hold/release <job> sequence always resets the accrue time (Age) of a job back to 0.
Is there way to prevent this, i.e. to retain the accrue time that the job already had collected?

Best regards
Jürgen
Comment 6 Juergen Salk 2021-06-18 12:48:06 MDT
Created attachment 20022 [details]
slurmctld.log
Comment 7 Skyler Malinowski 2021-06-21 09:00:04 MDT
Thank you sending the log.

> That's good to know. But the question still stands why the pending reason
> for the affected jobs did not revert to 'Priority' automatically in our
> cases?
Without being able to reproduce the issue, it is tough to say with certainty.

With that being said, there does not appear to be anything unusual in the log. From the scontrol show job output, the job is a rather large request and Slurm has allotted SchedNodeList=n[1120,1123,1130,1133,1136,1138]. The reservation from comment #0 reserved all nodes in the job's SchedNodeList which should be the initial reason why the job had Reason=ReqNodeNotAvail. Without knowing the current cluster state it is tough to say more. Can you please provide an "sinfo" output containing those nodes.

Also, it looks like JobId=2754014 has StartTime=2021-06-22T01:19:36. The job should start tomorrow morning. It may be worth seeing if Slurm can make good on that estimation.

> BTW: I also noticed that an scontrol hold/release <job> sequence always
> resets the accrue time (Age) of a job back to 0.
> Is there way to prevent this, i.e. to retain the accrue time that the job
> already had collected?
Unfortunately scontrol hold/release will reset the accrue time.
Comment 8 Juergen Salk 2021-06-21 14:54:53 MDT
(In reply to Skyler Malinowski from comment #7)

> Without being able to reproduce the issue, it is tough to say with certainty.

Hi Skyler,

yes, I do totally understand that.

> With that being said, there does not appear to be anything unusual in the
> log. From the scontrol show job output, the job is a rather large request
> and Slurm has allotted SchedNodeList=n[1120,1123,1130,1133,1136,1138]. The
> reservation from comment #0 reserved all nodes in the job's SchedNodeList
> which should be the initial reason why the job had Reason=ReqNodeNotAvail.
> Without knowing the current cluster state it is tough to say more. Can you
> please provide an "sinfo" output containing those nodes.
> 
> Also, it looks like JobId=2754014 has StartTime=2021-06-22T01:19:36. The job
> should start tomorrow morning. It may be worth seeing if Slurm can make good
> on that estimation.

Well, I think the current state no longer accurately reflects the state we had when the problem occurred. Also, job 2754014 has already started yesterday and is currently running, but on a different set of nodes:

$ scontrol show job 2754014
JobId=2754014 JobName=lisrp+
   UserId=xxxx(933314) GroupId=tu_tu(500004) MCS_label=N/A
   Priority=13866 Nice=0 Account=bw18j004 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-06:15:22 TimeLimit=4-04:00:00 TimeMin=N/A
   SubmitTime=2021-06-15T14:19:05 EligibleTime=2021-06-18T20:30:15
   AccrueTime=2021-06-18T20:30:15
   StartTime=2021-06-20T15:54:39 EndTime=2021-06-24T19:54:39 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-20T15:54:39
   Partition=standard AllocNode:Sid=login01:911362
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=n[0603-0604,0614,0630,0636-0637]
   BatchHost=n0603
   NumNodes=6 NumCPUs=144 NumTasks=144 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=144,mem=187.50G,node=6,billing=144
   Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=*
   MinCPUsNode=24 MinMemoryNode=32000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+/justus.run
   WorkDir=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+
   StdErr=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out
   StdIn=/dev/null
   StdOut=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out
   Power=
   MailUser=xxxxx@dal.ca MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
   NtasksPerTRES:0

$

Since the scenario that triggered the problem is rather rare and the problem itself was not that severe in the end and also kind of disappeared into thin air, I would suggest that we stop chasing the root cause and close the ticket. 

Best regards
Jürgen
Comment 9 Skyler Malinowski 2021-06-22 06:52:59 MDT
(In reply to Juergen Salk from comment #8)
> Since the scenario that triggered the problem is rather rare and the problem
> itself was not that severe in the end and also kind of disappeared into thin
> air, I would suggest that we stop chasing the root cause and close the
> ticket. 
Understood. If you manage to reproduce the issue, then by all means submit a new ticket with that information. In the meantime, I will close this ticket.

Best,
Skyler