| Summary: | Jobs stuck with Reason "ReqNodeNotAvail, Reserved for maintenance" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Juergen Salk <juergen.salk> |
| Component: | reservations | Assignee: | Skyler Malinowski <skyler> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 20.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ulm University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurmctld.log | ||
|
Description
Juergen Salk
2021-06-15 08:20:22 MDT
Hello Jürgen,
> So I have removed the reservation before its regular end time in order to
> prevent potential annoyance for (and with) users:
>
> $ scontrol delete ReservationName=migrate_home
>
> I did expect this would unlink all the jobs from the no longer existing
> reservation. However, it seems, pending jobs that once got Reason
> "ReqNodeNotAvail, Reserved for maintenance" stick with that and are not
> automatically reset to their original state reason.
I cannot reproduce the behavior. For me the user who was blacklisted from the reservation does have their jobs exit pending into running and the remainder job their jobs get new pending state reasons.
If the scheduler had not tested those jobs again because of the amount of jobs in queue. I would expect time to resolve this with no impact to production. If this is not the case, then please provide logs capturing the event. If you can reproduce the issue and can capture it in the logs, then please increase the loglevel to at least debug2 and send those logs.
Thanks,
Skyler
Vielen Dank für Ihre Nachricht. Bis zum 18.06.2021 bin ich leider nicht erreichbar. Ich werde mich aber schnellstmöglichst mit Ihnen in Verbindung setzen. In dringenden Fällen wenden Sie sich bitte an: kiz.hpc-admin@uni-ulm.de Mit freundlichen Grüßen, Jürgen Salk ----------------------------------------------------------------------------------------- Thank you for your e-mail. I am out of office until June 18th 2021. I will have limited access to my e-mail during this period but will answer your message as soon as possible. If you have immediate questions or concerns, please contact kiz.hpc-admin@uni-ulm.de Best regards, Juergen Salk (In reply to Skyler Malinowski from comment #1) > > I did expect this would unlink all the jobs from the no longer existing > > reservation. However, it seems, pending jobs that once got Reason > > "ReqNodeNotAvail, Reserved for maintenance" stick with that and are not > > automatically reset to their original state reason. > I cannot reproduce the behavior. For me the user who was blacklisted from > the reservation does have their jobs exit pending into running and the > remainder job their jobs get new pending state reasons. > > If the scheduler had not tested those jobs again because of the amount of > jobs in queue. I would expect time to resolve this with no impact to > production. > If this is not the case, then please provide logs capturing the > event. If you can reproduce the issue and can capture it in the logs, then > please increase the loglevel to at least debug2 and send those logs. > Hi Skyler, thanks for taking care of this. Ok, most of the jobs that had reason "ReqNodeNotAvail, Reserved for maintenance" were completed, running or - if still pending - turned to reason "AssocGrpCPURunMinutesLimit". However there were still a couple of jobs pending with reason "ReqNodeNotAvail, Reserved for maintenance" about 7 hours after I've deleted the reservation. And even now, more than 24 hours later, there are still 2 jobs that show that reason. $ date Mi 16. Jun 22:00:20 CEST 2021 $ squeue -t pd | grep Reser 2754012 standard lisrp- xxxx PD 0:00 6 (ReqNodeNotAvail, Reserved for maintenance) 2754014 standard lisrp+ xxxx PD 0:00 6 (ReqNodeNotAvail, Reserved for maintenance) We currently have about 2000 jobs running and 3000 jobs pending according to squeue: $ squeue -h -t r | wc -l 2048 $ squeue -h -t pd | wc -l 3116 $ This is what sdiag scheduling stats shows: $ sdiag ******************************************************* sdiag output at Wed Jun 16 22:01:01 2021 (1623873661) Data since Wed Jun 16 02:00:00 2021 (1623801600) ******************************************************* Server thread count: 3 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 6551 Jobs started: 4340 Jobs completed: 4831 Jobs canceled: 1166 Jobs failed: 0 Job states ts: Wed Jun 16 22:00:37 2021 (1623873637) Jobs pending: 4098 Jobs running: 2046 Main schedule statistics (microseconds): Last cycle: 17613 Max cycle: 261497 Total cycles: 7545 Mean cycle: 31752 Mean depth cycle: 518 Cycles per minute: 6 Last queue length: 3045 Backfilling stats Total backfilled jobs (since last slurm start): 284666 Total backfilled jobs (since last stats cycle start): 4138 Total backfilled heterogeneous job components: 0 Total cycles: 2000 Last cycle when: Wed Jun 16 22:00:37 2021 (1623873637) Last cycle: 4016229 Max cycle: 9326470 Mean cycle: 4983926 Last depth cycle: 3046 Last depth cycle (try sched): 1715 Depth Mean: 2762 Depth Mean (try depth): 1709 Last queue length: 3046 Queue length mean: 2743 Last table size: 77 Mean table size: 123 [...] $ Does that look suspicious in some way? I see that "Last depth cycle" equals "Last queue length" which should indicate that all pending jobs should have been tested already. Or am I missing something? I will reassign severity to "minor issue" as this issue does not seem to impact scheduling in general (but, after all, causes confusion for users who try to understand the reason why their jobs are pending). Best regards Jürgen > Does that look suspicious in some way? I see that "Last depth cycle" equals > "Last queue length" which should indicate that all pending jobs should have > been tested already. Or am I missing something? It looks like the scheduler has considered all the jobs pending in queue. But the fact that some jobs remain with Reason=ReqNodeNotAvail,_Reserved_for_maintenance is curious to say the least. It may be worth trying to hold those jobs then release them. This should reset the job state. > scontrol hold {JobId} > scontrol release {JobId} If those jobs settle back to Reason=ReqNodeNotAvail,_Reserved_for_maintenance, then I would most certainly like to see the slurmctld.log capturing the scheduling of those jobs in questions. Thanks, Skyler (In reply to Skyler Malinowski from comment #4) > It looks like the scheduler has considered all the jobs pending in queue. > But the fact that some jobs remain with > Reason=ReqNodeNotAvail,_Reserved_for_maintenance is curious to say the > least. It may be worth trying to hold those jobs then release them. This > should reset the job state. > > scontrol hold {JobId} > > scontrol release {JobId} > > If those jobs settle back to > Reason=ReqNodeNotAvail,_Reserved_for_maintenance, then I would most > certainly like to see the slurmctld.log capturing the scheduling of those > jobs in questions. Hi Skyler, there is still one job pending with reason "ReqNodeNotAvail,_Reserved_for_maintenance". $ squeue | grep Reser 2754014 standard lisrp+ xxxx PD 0:00 6 (ReqNodeNotAvail, Reserved for maintenance) $ $ scontrol show job 2754014 JobId=2754014 JobName=lisrp+ UserId=xxxx(933314) GroupId=tu_tu(500004) MCS_label=N/A Priority=15915 Nice=0 Account=bw18j004 QOS=normal JobState=PENDING Reason=ReqNodeNotAvail,_Reserved_for_maintenance Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=4-04:00:00 TimeMin=N/A SubmitTime=2021-06-15T14:19:05 EligibleTime=2021-06-15T14:19:05 AccrueTime=2021-06-15T14:19:05 StartTime=2021-06-22T01:19:36 EndTime=2021-06-26T05:19:36 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-18T20:07:01 Partition=standard AllocNode:Sid=login01:911362 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=n[1120,1123,1130,1133,1136,1138] NumNodes=6-6 NumCPUs=144 NumTasks=144 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=144,mem=187.50G,node=6,billing=144 Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryNode=32000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null) Command=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+/justus.run WorkDir=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+ StdErr=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out StdIn=/dev/null StdOut=/lustre/work/ws/ws1/xxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out Power= MailUser=xxxx@dal.ca MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT NtasksPerTRES:0 $ I've increase the log level to debug3 for about two minutes and think I could capture scheduling attempts for this job. I'll attach the corresponding slurmctld.log. I have then put the job on hold and released it. This did reset its reason back to 'Priority'. $ scontrol hold 2754014 $ squeue | grep 2754014 2754014 standard lisrp+ xxxx PD 0:00 6 (JobHeldAdmin) $ scontrol release 2754014 $ squeue | grep 2754014 2754014 standard lisrp+ xxxx PD 0:00 6 (Priority) $ That's good to know. But the question still stands why the pending reason for the affected jobs did not revert to 'Priority' automatically in our cases? BTW: I also noticed that an scontrol hold/release <job> sequence always resets the accrue time (Age) of a job back to 0. Is there way to prevent this, i.e. to retain the accrue time that the job already had collected? Best regards Jürgen Created attachment 20022 [details]
slurmctld.log
Thank you sending the log. > That's good to know. But the question still stands why the pending reason > for the affected jobs did not revert to 'Priority' automatically in our > cases? Without being able to reproduce the issue, it is tough to say with certainty. With that being said, there does not appear to be anything unusual in the log. From the scontrol show job output, the job is a rather large request and Slurm has allotted SchedNodeList=n[1120,1123,1130,1133,1136,1138]. The reservation from comment #0 reserved all nodes in the job's SchedNodeList which should be the initial reason why the job had Reason=ReqNodeNotAvail. Without knowing the current cluster state it is tough to say more. Can you please provide an "sinfo" output containing those nodes. Also, it looks like JobId=2754014 has StartTime=2021-06-22T01:19:36. The job should start tomorrow morning. It may be worth seeing if Slurm can make good on that estimation. > BTW: I also noticed that an scontrol hold/release <job> sequence always > resets the accrue time (Age) of a job back to 0. > Is there way to prevent this, i.e. to retain the accrue time that the job > already had collected? Unfortunately scontrol hold/release will reset the accrue time. (In reply to Skyler Malinowski from comment #7) > Without being able to reproduce the issue, it is tough to say with certainty. Hi Skyler, yes, I do totally understand that. > With that being said, there does not appear to be anything unusual in the > log. From the scontrol show job output, the job is a rather large request > and Slurm has allotted SchedNodeList=n[1120,1123,1130,1133,1136,1138]. The > reservation from comment #0 reserved all nodes in the job's SchedNodeList > which should be the initial reason why the job had Reason=ReqNodeNotAvail. > Without knowing the current cluster state it is tough to say more. Can you > please provide an "sinfo" output containing those nodes. > > Also, it looks like JobId=2754014 has StartTime=2021-06-22T01:19:36. The job > should start tomorrow morning. It may be worth seeing if Slurm can make good > on that estimation. Well, I think the current state no longer accurately reflects the state we had when the problem occurred. Also, job 2754014 has already started yesterday and is currently running, but on a different set of nodes: $ scontrol show job 2754014 JobId=2754014 JobName=lisrp+ UserId=xxxx(933314) GroupId=tu_tu(500004) MCS_label=N/A Priority=13866 Nice=0 Account=bw18j004 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=1-06:15:22 TimeLimit=4-04:00:00 TimeMin=N/A SubmitTime=2021-06-15T14:19:05 EligibleTime=2021-06-18T20:30:15 AccrueTime=2021-06-18T20:30:15 StartTime=2021-06-20T15:54:39 EndTime=2021-06-24T19:54:39 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-20T15:54:39 Partition=standard AllocNode:Sid=login01:911362 ReqNodeList=(null) ExcNodeList=(null) NodeList=n[0603-0604,0614,0630,0636-0637] BatchHost=n0603 NumNodes=6 NumCPUs=144 NumTasks=144 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=144,mem=187.50G,node=6,billing=144 Socks/Node=* NtasksPerN:B:S:C=24:0:*:* CoreSpec=* MinCPUsNode=24 MinMemoryNode=32000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null) Command=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+/justus.run WorkDir=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+ StdErr=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out StdIn=/dev/null StdOut=/lustre/work/ws/ws1/xxxxx-abinit/lisrb4cn2/phon+/slurm-2754014.out Power= MailUser=xxxxx@dal.ca MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT NtasksPerTRES:0 $ Since the scenario that triggered the problem is rather rare and the problem itself was not that severe in the end and also kind of disappeared into thin air, I would suggest that we stop chasing the root cause and close the ticket. Best regards Jürgen (In reply to Juergen Salk from comment #8) > Since the scenario that triggered the problem is rather rare and the problem > itself was not that severe in the end and also kind of disappeared into thin > air, I would suggest that we stop chasing the root cause and close the > ticket. Understood. If you manage to reproduce the issue, then by all means submit a new ticket with that information. In the meantime, I will close this ticket. Best, Skyler |