11974 – SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Ticket 11974 - SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Summary: SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-07-06 07:32 MDT by Michael Hebenstreit
Modified:	2021-09-04 14:50 MDT (History)
CC List:	1 user (show)

See Also:	12102
Site:	Intel CRT
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (4.73 KB, application/gzip) 2021-07-06 07:32 MDT, Michael Hebenstreit	Details
sched.log (3.83 MB, application/gzip) 2021-07-06 07:33 MDT, Michael Hebenstreit	Details
slurmctl.log (817.42 KB, application/gzip) 2021-07-06 07:33 MDT, Michael Hebenstreit	Details
commands.out.gz (36.86 KB, application/x-gzip) 2021-07-06 08:04 MDT, Michael Hebenstreit	Details
current slurm.con (5.38 KB, application/gzip) 2021-07-24 07:48 MDT, Michael Hebenstreit	Details
current sched.log (4.92 MB, application/gzip) 2021-07-24 07:49 MDT, Michael Hebenstreit	Details
current slurmctl.log (2.13 MB, application/gzip) 2021-07-24 07:50 MDT, Michael Hebenstreit	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael Hebenstreit 2021-07-06 07:32:42 MDT

Created attachment 20246 [details]
slurm.conf

since last Friday slurmctld again and again goes into a state, where jobs are not longer scheduled even though nodes are available. A restart of slurmctld fixes the issue for a few hours, but after 24h the situation is the same. 
Symptoms - jobs with high priority show in "squeue" that nodes are not available, listing even nodes that do not meet the constraints, or nodes that certainly are up!

            153387 inteldevq     1run ysmorodi PD       0:00     20 (ReqNodeNotAvail, UnavailableNodes:eea[001-018,020-021,023-024,026-036,038-040],eia[071-079,081-082,084-099,101-104,106-126,145-149,151-153,155-173,175-177,179-182,184-189,191-203,205,207-209,211-216,287-288],eid[356-357],eie[362-369],eif[372-378,395-396],eii[142-144,217-225,227,229-235,237-267,271-286,307-309,311,313-314,316-317,319,321-322,324,328-354,379],eik[387-389],eil[390-393],eit[289,291,293,295-306],ekln[01-02],els02,epb[087,136,147,160,177,812])

I wrote a tool that uses scontrol show to print out priorities and starttimes as well. When the cluster is in this condition, for some jobs not starttime is displayed:

[mhebenst@endeavour5 ~]$ /opt/slurm/crtdc/bjobs
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            155380      atsq interact rozanova  R    6:28:38      2 eii[310,319]
            155888      cnvq     bash  achaibi  R      13:43      1 einv001
            155876 inteldevq     bash knektyag  R      51:42      1 eii249

  Priority   JobID Partition       UserId             Features   NumNodes               Reason    ExpectedStartTime Dependency
4294755883  146012     workq     artemaro              clx8280        1-1 ReqNodeNotAvail,_Res              Unknown (null)
4294755534  146361     workq     artemaro              clx8280        1-1 ReqNodeNotAvail,_Res              Unknown (null)
4294755153  146838     workq     artemaro              clx8280        1-1 ReqNodeNotAvail,_Res              Unknown (null)
4294752200  149789 inteldevq        dkuts             icx8352Y         12            Resources  2021-07-06T07:23:22 (null)
4294752199  149790 inteldevq        dkuts             icx8352Y         12             Priority  2021-07-06T17:00:00 (null)
4294752198  149791 inteldevq        dkuts             icx8352Y      12-12             Priority  2021-07-06T18:30:00 (null)
4294752197  149792 inteldevq        dkuts             icx8352Y      12-12 ReqNodeNotAvail,_Una  2021-07-06T20:00:00 (null)

Comment 1 Michael Hebenstreit 2021-07-06 07:33:13 MDT

Created attachment 20247 [details]
sched.log

Comment 2 Michael Hebenstreit 2021-07-06 07:33:37 MDT

Created attachment 20248 [details]
slurmctl.log

Comment 3 Michael Hebenstreit 2021-07-06 07:37:54 MDT

after restart - now displaying correctly "reserved for maintenance"

[root@endeavour5 icsmoke3.1]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            143697      atsq  hpc2021 emelnich PD       0:00      1 (PartitionConfig)
            155380      atsq interact rozanova  R    6:42:23      2 eii[310,319]
            155056      cnvq     bash   akundu PD       0:00      1 (ReqNodeNotAvail, Reserved for maintenance)
            155888      cnvq     bash  achaibi  R      27:28      1 einv001
            155873    idealq  starccm   Xkdean PD       0:00     13 (ReqNodeNotAvail, Reserved for maintenance)
            154329 inteldevq run-omp1    dkuts CG       0:00      1 eit289
            149789 inteldevq run-base    dkuts PD       0:00     12 (ReqNodeNotAvail, Reserved for maintenance)
            149790 inteldevq run-snc2    dkuts PD       0:00     12 (ReqNodeNotAvail, Reserved for maintenance)
            149791 inteldevq run-snc4    dkuts PD       0:00     12 (ReqNodeNotAvail, Reserved for maintenance)
            149792 inteldevq run-ht.s    dkuts PD       0:00     12 (ReqNodeNotAvail, Reserved for maintenance)
            152707 inteldevq OFMa_185 dmishura PD       0:00     29 (ReqNodeNotAvail, Reserved for maintenance)
            153387 inteldevq     1run ysmorodi PD       0:00     20 (ReqNodeNotAvail, Reserved for maintenance)
            153388 inteldevq     1run ysmorodi PD       0:00     40 (ReqNodeNotAvail, Reserved for maintenance)
            153392 inteldevq     1run ysmorodi PD       0:00     10 (ReqNodeNotAvail, Reserved for maintenance)
            153393 inteldevq     1run ysmorodi PD       0:00     20 (ReqNodeNotAvail, Reserved for maintenance)
            153394 inteldevq     1run ysmorodi PD       0:00     40 (ReqNodeNotAvail, Reserved for maintenance)
.....

Comment 4 Dominik Bartkiewicz 2021-07-06 07:52:36 MDT

Hi

Could you send us output from "scontrol show job" and "scontrol show nodes", and "scontrol show res"?


It looks that you have planned_maintenance reservation, which can block jobs from starting.

Dominik

Comment 5 Michael Hebenstreit 2021-07-06 08:04:53 MDT

Created attachment 20251 [details]
commands.out.gz

Yes, I currently have a maintenance window. But before I restarted slurmctld squeue was showing

            153388 inteldevq     1run ysmorodi PD       0:00     40 (ReqNodeNotAvail, UnavailableNodes:eea[001-018,020-021,023-024,026-036,038-040],eia[071-079,081-082,084-099,101-104,106-126,145-149,151-153,155-173,175-177,179-182,184-189,191-203,205,207-209,211-216,287-288],eid[356-357],eie[362-369],eif[372-378,395-396],eii[142-144,217-225,227,229-235,237-267,271-286,307-309,311,313-314,316-317,319,321-322,324,328-354,379],eik[387-389],eil[390-393],eit[289,291,293,295-306],ekln[01-02],els02,epb[087,136,147,160,177,812])

after the restart the same job was listed as:

            153388 inteldevq     1run ysmorodi PD       0:00     40 (ReqNodeNotAvail, Reserved for maintenance)

The message of “ReqNodeNotAvail” is certainly in the logs for some time for jobs not affected by the maintenance window.

Attaching ouput from requested commands

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, July 6, 2021 7:53 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Dominik Bartkiewicz<mailto:bart@schedmd.com> changed bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974>             153388 inteldevq     1run ysmorodi PD       0:00     40 (ReqNodeNotAvail, Reserved for maintenance)
What
Removed
Added
CC

bart@schedmd.com<mailto:bart@schedmd.com>
Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=11974#c4> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Could you send us output from "scontrol show job" and "scontrol show nodes",

and "scontrol show res"?





It looks that you have planned_maintenance reservation, which can block jobs

from starting.



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 6 Jason Booth 2021-07-06 10:08:46 MDT

Michael

Comment 7 Jason Booth 2021-07-06 10:11:06 MDT

My apologies for the last message. I have Michael Hinton looking into this issue for you.

Comment 8 Michael Hinton 2021-07-06 10:52:13 MDT

Hi Michael,

I'll go ahead and look into this.

(In reply to Michael Hebenstreit from comment #0)
> since last Friday slurmctld again and again goes into a state, where jobs
> are not longer scheduled even though nodes are available. A restart of
> slurmctld fixes the issue for a few hours, but after 24h the situation is
> the same.
The fact that you have a workaround means that this is not a severity 1 issue. I'm going to go ahead and reduce this to a severity 2 while we look into it.

From https://www.schedmd.com/support.php:

"Severity 1 — Major Impact

"A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists.

"Severity 2 — High Impact

"A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system."

Thanks!
-Michael

Comment 9 Michael Hinton 2021-07-06 11:13:58 MDT

(In reply to Michael Hebenstreit from comment #5)
> Yes, I currently have a maintenance window. But before I restarted slurmctld
> squeue was showing...
So it sounds like the underlying behavior is correct, but just that scontrol is showing the wrong reasoning for what it's doing. Is this correct? Or is the behavior itself wrong?

Comment 10 Michael Hebenstreit 2021-07-06 11:17:23 MDT

The behavior itself is wrong – no jobs are scheduled.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, July 6, 2021 11:14 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=11974#c9> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com>

(In reply to Michael Hebenstreit from comment #5<show_bug.cgi?id=11974#c5>)

> Yes, I currently have a maintenance window. But before I restarted slurmctld

> squeue was showing...

So it sounds like the underlying behavior is correct, but just that scontrol is

showing the wrong reasoning for what it's doing. Is this correct? Or is the

behavior itself wrong?

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 13 Michael Hinton 2021-07-06 16:26:11 MDT

Michael,

I'm having a hard time seeing an example of the problem with what you provided.

Let's take job 153387, for example. It was submitted around 2021-07-04 at 2 AM, and then was stuck pending due to priority for over a day. Then, on 2021-07-05 at 10 AM, it started pending (I believe) because of the maintenance reservation with the error ReqNodeNotAvail. The maintenance reservation started at 2021-07-06 at 9 AM, so I'm guessing that job 153387 may have been pushed back (or maybe it was already scheduled to run later). It's expected start time is StartTime is 2021-07-06 6:20 PM (after the maintenance reservation ends) and will run for 5 hours.

The only thing I see potentially wrong is that squeue shows "ReqNodeNotAvail, UnavailableNodes: eea[001-018]..." instead of "ReqNodeNotAvail, Reserved for maintenance" - however, the UnavailableNodes are all nodes that are part of the maintenance reservation, so I don't really see a difference in effect, just in presentation.

Could you give more examples of a job not running when it should? Any more insight would be helpful.

Thanks!
-Michael

Comment 14 Michael Hebenstreit 2021-07-06 16:30:29 MDT

Not currently – I’ll try to find something tomorrow.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, July 6, 2021 4:26 PM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Comment # 13<https://bugs.schedmd.com/show_bug.cgi?id=11974#c13> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com>

Michael,

I'm having a hard time seeing an example of the problem with what you provided.

Let's take job 153387, for example. It was submitted around 2021-07-04 at 2 AM,

and then was stuck pending due to priority for over a day. Then, on 2021-07-05

at 10 AM, it started pending (I believe) because of the maintenance reservation

with the error ReqNodeNotAvail. The maintenance reservation started at

2021-07-06 at 9 AM, so I'm guessing that job 153387 may have been pushed back

(or maybe it was already scheduled to run later). It's expected start time is

StartTime is 2021-07-06 6:20 PM (after the maintenance reservation ends) and

will run for 5 hours.

The only thing I see potentially wrong is that squeue shows "ReqNodeNotAvail,

UnavailableNodes: eea[001-018]..." instead of "ReqNodeNotAvail, Reserved for

maintenance" - however, the UnavailableNodes are all nodes that are part of the

maintenance reservation, so I don't really see a difference in effect, just in

presentation.

Could you give more examples of a job not running when it should? Any more

insight would be helpful.

Thanks!

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 15 Michael Hebenstreit 2021-07-08 07:18:29 MDT

So far after rebooting master node we did not see the problem again – knocking on wood

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, July 6, 2021 4:26 PM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Comment # 13<https://bugs.schedmd.com/show_bug.cgi?id=11974#c13> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com>

Michael,

I'm having a hard time seeing an example of the problem with what you provided.

Let's take job 153387, for example. It was submitted around 2021-07-04 at 2 AM,

and then was stuck pending due to priority for over a day. Then, on 2021-07-05

at 10 AM, it started pending (I believe) because of the maintenance reservation

with the error ReqNodeNotAvail. The maintenance reservation started at

2021-07-06 at 9 AM, so I'm guessing that job 153387 may have been pushed back

(or maybe it was already scheduled to run later). It's expected start time is

StartTime is 2021-07-06 6:20 PM (after the maintenance reservation ends) and

will run for 5 hours.

The only thing I see potentially wrong is that squeue shows "ReqNodeNotAvail,

UnavailableNodes: eea[001-018]..." instead of "ReqNodeNotAvail, Reserved for

maintenance" - however, the UnavailableNodes are all nodes that are part of the

maintenance reservation, so I don't really see a difference in effect, just in

presentation.

Could you give more examples of a job not running when it should? Any more

insight would be helpful.

Thanks!

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 16 Michael Hinton 2021-07-08 09:05:28 MDT

Ok. I'll go ahead and reduce the severity while we wait to see if it will reoccur.

Comment 17 Michael Hinton 2021-07-14 10:57:43 MDT

Hey Michael, any updates?

Thanks,
-Michael

Comment 18 Michael Hebenstreit 2021-07-14 11:14:23 MDT

We had one more occurrence but I was on vacation that day and could not capture anything

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, July 14, 2021 10:58 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Comment # 17<https://bugs.schedmd.com/show_bug.cgi?id=11974#c17> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com>

Hey Michael, any updates?



Thanks,

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 19 Michael Hinton 2021-07-22 11:41:27 MDT

(In reply to Michael Hebenstreit from comment #18)
> We had one more occurrence but I was on vacation that day and could not
> capture anything
Any recent occurrences that you can share?

Thanks,
-Michael

Comment 20 Michael Hebenstreit 2021-07-22 11:45:47 MDT

No – all occurrences that were reported since then turned out to be priority issues. Still monitoring situation

If you want you can close the issue as “cannot reproduce” – I’ll re-open it if I have new data

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, July 22, 2021 11:41 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason

Comment # 19<https://bugs.schedmd.com/show_bug.cgi?id=11974#c19> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com>

(In reply to Michael Hebenstreit from comment #18<show_bug.cgi?id=11974#c18>)

> We had one more occurrence but I was on vacation that day and could not

> capture anything

Any recent occurrences that you can share?



Thanks,

-Michael

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 21 Michael Hinton 2021-07-22 11:49:41 MDT

(In reply to Michael Hebenstreit from comment #20)
> No – all occurrences that were reported since then turned out to be priority
> issues. Still monitoring situation
Ok, great.

> If you want you can close the issue as “cannot reproduce” – I’ll re-open it
> if I have new data
Will do. Thanks!

Comment 22 Michael Hebenstreit 2021-07-24 07:46:08 MDT

new data

Comment 23 Michael Hebenstreit 2021-07-24 07:47:48 MDT

it happened again. I wrote a tool that draws information from "scontrol show". job 166740 is ready to run, enough nodes available - but it is not scheduled. After restarting slurmctld it is immediately started.

[root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat  | grep clxap9242
 4294734813  166740 inteldevq knektyag            clxap9242      36-36   02:00:00     (null)  2021-07-24T07:37:26 Priority
 4294734811  166742 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T09:37:00 Priority
 4294734765  166788 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T11:25:33 Priority
 4294734493  167066 inteldevq dmishura            clxap9242         20   03:00:00     (null)  2021-07-24T11:37:00 Priority
 4294734492  167067 inteldevq dmishura            clxap9242      24-24   05:00:00     (null)  2021-07-24T13:25:00 Priority
 4294734490  167069 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T14:37:00 Priority
 4294734489  167070 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T18:25:00 Priority
 4294734485  167074 inteldevq dmishura            clxap9242        7-7   05:00:00     (null)  2021-07-24T19:37:00 Priority
 4294734366  167194 inteldevq aknyaze1            clxap9242      21-21   10:00:00     (null)  2021-07-24T19:37:00 Priority
 4294734347  167213 inteldevq  msazhin            clxap9242      12-12   06:00:00     (null)  2021-07-24T23:25:00 Priority
[root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242
eca[002-003]                      2 192/0/0/192      inteldevq        comp       leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[046-047]                      2 192/0/0/192      inteldevq        comp       leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca050                            1 96/0/0/96        inteldevq        comp       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[064-066]                      3 288/0/0/288      inteldevq        comp       leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[004-012,035-036]             11 1056/0/0/1056    inteldevq        alloc      leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[013-018,038-042]             11 1056/0/0/1056    inteldevq        alloc      leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[019-022,024]                  5 480/0/0/480      inteldevq        alloc      leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[025-027,029-030]              5 480/0/0/480      inteldevq        alloc      leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[001,031-034,055-060]         11 0/1056/0/1056    inteldevq        idle       leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[023,043-044,048,067-072]     10 0/960/0/960      inteldevq        idle       leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[028,049,051-054]              6 0/576/0/576      inteldevq        idle       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[061-062]                      2 0/192/0/192      inteldevq        idle       leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
[root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat  -p intelevq | grep clxap9242
[root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat  -p inteldevq | grep clxap9242
 4294734813  166740 inteldevq knektyag            clxap9242      36-36   02:00:00     (null)  2021-07-24T07:38:42 Priority
 4294734811  166742 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T09:38:00 Priority
 4294734765  166788 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T11:25:33 Priority
 4294734493  167066 inteldevq dmishura            clxap9242         20   03:00:00     (null)  2021-07-24T11:38:00 Priority
 4294734492  167067 inteldevq dmishura            clxap9242      24-24   05:00:00     (null)  2021-07-24T13:25:00 Priority
 4294734490  167069 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T14:38:00 Priority
 4294734489  167070 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T18:25:00 Priority
 4294734485  167074 inteldevq dmishura            clxap9242        7-7   05:00:00     (null)  2021-07-24T19:38:00 Priority
 4294734366  167194 inteldevq aknyaze1            clxap9242      21-21   10:00:00     (null)  2021-07-24T19:38:00 Priority
 4294734347  167213 inteldevq  msazhin            clxap9242      12-12   06:00:00     (null)  2021-07-24T23:25:00 Priority
[root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242
eca[002-003]                      2 192/0/0/192      inteldevq        comp       leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[046-047]                      2 192/0/0/192      inteldevq        comp       leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca050                            1 96/0/0/96        inteldevq        comp       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[064-066]                      3 288/0/0/288      inteldevq        comp       leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[004-012,035-036]             11 1056/0/0/1056    inteldevq        alloc      leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[013-018,038-042]             11 1056/0/0/1056    inteldevq        alloc      leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[019-022,024]                  5 480/0/0/480      inteldevq        alloc      leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[025-027,029-030]              5 480/0/0/480      inteldevq        alloc      leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[001,031-034,055-060]         11 0/1056/0/1056    inteldevq        idle       leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[023,043-044,048,067-072]     10 0/960/0/960      inteldevq        idle       leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[028,049,051-054]              6 0/576/0/576      inteldevq        idle       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[061-062]                      2 0/192/0/192      inteldevq        idle       leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
[root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat  -p inteldevq | grep clxap9242
 4294734813  166740 inteldevq knektyag            clxap9242      36-36   02:00:00     (null)  2021-07-24T07:39:33 Priority
 4294734811  166742 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T09:39:00 Priority
 4294734765  166788 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T11:25:33 Priority
 4294734493  167066 inteldevq dmishura            clxap9242         20   03:00:00     (null)  2021-07-24T11:39:00 Priority
 4294734492  167067 inteldevq dmishura            clxap9242      24-24   05:00:00     (null)  2021-07-24T13:25:00 Priority
 4294734490  167069 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T14:39:00 Priority
 4294734489  167070 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T18:25:00 Priority
 4294734485  167074 inteldevq dmishura            clxap9242        7-7   05:00:00     (null)  2021-07-24T19:39:00 Priority
 4294734366  167194 inteldevq aknyaze1            clxap9242      21-21   10:00:00     (null)  2021-07-24T19:39:00 Priority
 4294734347  167213 inteldevq  msazhin            clxap9242      12-12   06:00:00     (null)  2021-07-24T23:25:00 Priority
[root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242
eca[002-003]                      2 192/0/0/192      inteldevq        comp       leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[046-047]                      2 192/0/0/192      inteldevq        comp       leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca050                            1 96/0/0/96        inteldevq        comp       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[064-066]                      3 288/0/0/288      inteldevq        comp       leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[004-012,035-036]             11 1056/0/0/1056    inteldevq        alloc      leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[013-018,038-042]             11 1056/0/0/1056    inteldevq        alloc      leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[019-022,024]                  5 480/0/0/480      inteldevq        alloc      leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[025-027,029-030]              5 480/0/0/480      inteldevq        alloc      leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[001,031-034,055-060]         11 0/1056/0/1056    inteldevq        idle       leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[023,043-044,048,067-072]     10 0/960/0/960      inteldevq        idle       leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[028,049,051-054]              6 0/576/0/576      inteldevq        idle       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[061-062]                      2 0/192/0/192      inteldevq        idle       leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5


after restart of slurmctld


[root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat  -p inteldevq | grep clxap9242
             166740 inteldevq knektyag            clxap9242         36   02:00:00   00:02:31  2021-07-24T09:41:49 eca[001-003,023,031-034,043-044,046-062,064-072]
 4294734811  166742 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T09:41:49 Priority
 4294734765  166788 inteldevq knektyag            clxap9242      34-34   02:00:00     (null)  2021-07-24T11:25:33 Priority
 4294734493  167066 inteldevq dmishura            clxap9242      20-20   03:00:00     (null)  2021-07-24T11:41:00 Priority
 4294734492  167067 inteldevq dmishura            clxap9242      24-24   05:00:00     (null)  2021-07-24T13:25:00 Priority
 4294734490  167069 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T14:41:00 Priority
 4294734489  167070 inteldevq dmishura            clxap9242      32-32   05:00:00     (null)  2021-07-24T18:25:00 Priority
 4294734485  167074 inteldevq dmishura            clxap9242        7-7   05:00:00     (null)  2021-07-24T19:41:00 Priority
 4294734366  167194 inteldevq aknyaze1            clxap9242      21-21   10:00:00     (null)  2021-07-24T19:41:00 Priority
 4294734347  167213 inteldevq  msazhin            clxap9242      12-12   06:00:00     (null)  2021-07-24T23:25:00 Priority
[root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242
eca[001-012,031-036,055-060]             24 2304/0/0/2304    inteldevq        alloc      leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[013-018,038-042,061-062,064-066]     16 1536/0/0/1536    inteldevq        alloc      leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[019-024,043-044,046-048,067-072]     17 1632/0/0/1632    inteldevq        alloc      leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca[025-027,029-030,049-054]             11 1056/0/0/1056    inteldevq        alloc      leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eca028                                    1 0/96/0/96        inteldevq        idle       leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5

Comment 24 Michael Hebenstreit 2021-07-24 07:48:56 MDT

Created attachment 20525 [details]
current slurm.con

Comment 25 Michael Hebenstreit 2021-07-24 07:49:33 MDT

Created attachment 20526 [details]
current sched.log

Comment 26 Michael Hebenstreit 2021-07-24 07:50:00 MDT

Created attachment 20527 [details]
current slurmctl.log

Comment 27 Michael Hinton 2021-09-03 12:43:12 MDT

Hi Michael, sorry for the delay. I'm looking into it. Any changes with this issue, or do all your previous comments still apply?

Comment 28 Michael Hebenstreit 2021-09-04 07:34:50 MDT

this behaviour did not show up again. I think for the moment this ticket can be closed, if I see it again I'll re-open it.

Comment 29 Michael Hinton 2021-09-04 14:50:05 MDT

Ok, great. Closing out.

Thanks!
-Michael