Created attachment 20246 [details] slurm.conf since last Friday slurmctld again and again goes into a state, where jobs are not longer scheduled even though nodes are available. A restart of slurmctld fixes the issue for a few hours, but after 24h the situation is the same. Symptoms - jobs with high priority show in "squeue" that nodes are not available, listing even nodes that do not meet the constraints, or nodes that certainly are up! 153387 inteldevq 1run ysmorodi PD 0:00 20 (ReqNodeNotAvail, UnavailableNodes:eea[001-018,020-021,023-024,026-036,038-040],eia[071-079,081-082,084-099,101-104,106-126,145-149,151-153,155-173,175-177,179-182,184-189,191-203,205,207-209,211-216,287-288],eid[356-357],eie[362-369],eif[372-378,395-396],eii[142-144,217-225,227,229-235,237-267,271-286,307-309,311,313-314,316-317,319,321-322,324,328-354,379],eik[387-389],eil[390-393],eit[289,291,293,295-306],ekln[01-02],els02,epb[087,136,147,160,177,812]) I wrote a tool that uses scontrol show to print out priorities and starttimes as well. When the cluster is in this condition, for some jobs not starttime is displayed: [mhebenst@endeavour5 ~]$ /opt/slurm/crtdc/bjobs JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 155380 atsq interact rozanova R 6:28:38 2 eii[310,319] 155888 cnvq bash achaibi R 13:43 1 einv001 155876 inteldevq bash knektyag R 51:42 1 eii249 Priority JobID Partition UserId Features NumNodes Reason ExpectedStartTime Dependency 4294755883 146012 workq artemaro clx8280 1-1 ReqNodeNotAvail,_Res Unknown (null) 4294755534 146361 workq artemaro clx8280 1-1 ReqNodeNotAvail,_Res Unknown (null) 4294755153 146838 workq artemaro clx8280 1-1 ReqNodeNotAvail,_Res Unknown (null) 4294752200 149789 inteldevq dkuts icx8352Y 12 Resources 2021-07-06T07:23:22 (null) 4294752199 149790 inteldevq dkuts icx8352Y 12 Priority 2021-07-06T17:00:00 (null) 4294752198 149791 inteldevq dkuts icx8352Y 12-12 Priority 2021-07-06T18:30:00 (null) 4294752197 149792 inteldevq dkuts icx8352Y 12-12 ReqNodeNotAvail,_Una 2021-07-06T20:00:00 (null)
Created attachment 20247 [details] sched.log
Created attachment 20248 [details] slurmctl.log
after restart - now displaying correctly "reserved for maintenance" [root@endeavour5 icsmoke3.1]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 143697 atsq hpc2021 emelnich PD 0:00 1 (PartitionConfig) 155380 atsq interact rozanova R 6:42:23 2 eii[310,319] 155056 cnvq bash akundu PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance) 155888 cnvq bash achaibi R 27:28 1 einv001 155873 idealq starccm Xkdean PD 0:00 13 (ReqNodeNotAvail, Reserved for maintenance) 154329 inteldevq run-omp1 dkuts CG 0:00 1 eit289 149789 inteldevq run-base dkuts PD 0:00 12 (ReqNodeNotAvail, Reserved for maintenance) 149790 inteldevq run-snc2 dkuts PD 0:00 12 (ReqNodeNotAvail, Reserved for maintenance) 149791 inteldevq run-snc4 dkuts PD 0:00 12 (ReqNodeNotAvail, Reserved for maintenance) 149792 inteldevq run-ht.s dkuts PD 0:00 12 (ReqNodeNotAvail, Reserved for maintenance) 152707 inteldevq OFMa_185 dmishura PD 0:00 29 (ReqNodeNotAvail, Reserved for maintenance) 153387 inteldevq 1run ysmorodi PD 0:00 20 (ReqNodeNotAvail, Reserved for maintenance) 153388 inteldevq 1run ysmorodi PD 0:00 40 (ReqNodeNotAvail, Reserved for maintenance) 153392 inteldevq 1run ysmorodi PD 0:00 10 (ReqNodeNotAvail, Reserved for maintenance) 153393 inteldevq 1run ysmorodi PD 0:00 20 (ReqNodeNotAvail, Reserved for maintenance) 153394 inteldevq 1run ysmorodi PD 0:00 40 (ReqNodeNotAvail, Reserved for maintenance) .....
Hi Could you send us output from "scontrol show job" and "scontrol show nodes", and "scontrol show res"? It looks that you have planned_maintenance reservation, which can block jobs from starting. Dominik
Created attachment 20251 [details] commands.out.gz Yes, I currently have a maintenance window. But before I restarted slurmctld squeue was showing 153388 inteldevq 1run ysmorodi PD 0:00 40 (ReqNodeNotAvail, UnavailableNodes:eea[001-018,020-021,023-024,026-036,038-040],eia[071-079,081-082,084-099,101-104,106-126,145-149,151-153,155-173,175-177,179-182,184-189,191-203,205,207-209,211-216,287-288],eid[356-357],eie[362-369],eif[372-378,395-396],eii[142-144,217-225,227,229-235,237-267,271-286,307-309,311,313-314,316-317,319,321-322,324,328-354,379],eik[387-389],eil[390-393],eit[289,291,293,295-306],ekln[01-02],els02,epb[087,136,147,160,177,812]) after the restart the same job was listed as: 153388 inteldevq 1run ysmorodi PD 0:00 40 (ReqNodeNotAvail, Reserved for maintenance) The message of “ReqNodeNotAvail” is certainly in the logs for some time for jobs not affected by the maintenance window. Attaching ouput from requested commands From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, July 6, 2021 7:53 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason Dominik Bartkiewicz<mailto:bart@schedmd.com> changed bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> 153388 inteldevq 1run ysmorodi PD 0:00 40 (ReqNodeNotAvail, Reserved for maintenance) What Removed Added CC bart@schedmd.com<mailto:bart@schedmd.com> Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=11974#c4> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Could you send us output from "scontrol show job" and "scontrol show nodes", and "scontrol show res"? It looks that you have planned_maintenance reservation, which can block jobs from starting. Dominik ________________________________ You are receiving this mail because: * You reported the bug.
Michael
My apologies for the last message. I have Michael Hinton looking into this issue for you.
Hi Michael, I'll go ahead and look into this. (In reply to Michael Hebenstreit from comment #0) > since last Friday slurmctld again and again goes into a state, where jobs > are not longer scheduled even though nodes are available. A restart of > slurmctld fixes the issue for a few hours, but after 24h the situation is > the same. The fact that you have a workaround means that this is not a severity 1 issue. I'm going to go ahead and reduce this to a severity 2 while we look into it. From https://www.schedmd.com/support.php: "Severity 1 — Major Impact "A Severity 1 issue occurs when there is a continued system outage that affects a large set of end users. The system is down and non-functional due to Slurm problem(s) and no procedural workaround exists. "Severity 2 — High Impact "A Severity 2 issue is a high-impact problem that is causing sporadic outages or is consistently encountered by end users with adverse impact to end user interaction with the system." Thanks! -Michael
(In reply to Michael Hebenstreit from comment #5) > Yes, I currently have a maintenance window. But before I restarted slurmctld > squeue was showing... So it sounds like the underlying behavior is correct, but just that scontrol is showing the wrong reasoning for what it's doing. Is this correct? Or is the behavior itself wrong?
The behavior itself is wrong – no jobs are scheduled. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, July 6, 2021 11:14 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason Comment # 9<https://bugs.schedmd.com/show_bug.cgi?id=11974#c9> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to Michael Hebenstreit from comment #5<show_bug.cgi?id=11974#c5>) > Yes, I currently have a maintenance window. But before I restarted slurmctld > squeue was showing... So it sounds like the underlying behavior is correct, but just that scontrol is showing the wrong reasoning for what it's doing. Is this correct? Or is the behavior itself wrong? ________________________________ You are receiving this mail because: * You reported the bug.
Michael, I'm having a hard time seeing an example of the problem with what you provided. Let's take job 153387, for example. It was submitted around 2021-07-04 at 2 AM, and then was stuck pending due to priority for over a day. Then, on 2021-07-05 at 10 AM, it started pending (I believe) because of the maintenance reservation with the error ReqNodeNotAvail. The maintenance reservation started at 2021-07-06 at 9 AM, so I'm guessing that job 153387 may have been pushed back (or maybe it was already scheduled to run later). It's expected start time is StartTime is 2021-07-06 6:20 PM (after the maintenance reservation ends) and will run for 5 hours. The only thing I see potentially wrong is that squeue shows "ReqNodeNotAvail, UnavailableNodes: eea[001-018]..." instead of "ReqNodeNotAvail, Reserved for maintenance" - however, the UnavailableNodes are all nodes that are part of the maintenance reservation, so I don't really see a difference in effect, just in presentation. Could you give more examples of a job not running when it should? Any more insight would be helpful. Thanks! -Michael
Not currently – I’ll try to find something tomorrow. From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, July 6, 2021 4:26 PM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason Comment # 13<https://bugs.schedmd.com/show_bug.cgi?id=11974#c13> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com> Michael, I'm having a hard time seeing an example of the problem with what you provided. Let's take job 153387, for example. It was submitted around 2021-07-04 at 2 AM, and then was stuck pending due to priority for over a day. Then, on 2021-07-05 at 10 AM, it started pending (I believe) because of the maintenance reservation with the error ReqNodeNotAvail. The maintenance reservation started at 2021-07-06 at 9 AM, so I'm guessing that job 153387 may have been pushed back (or maybe it was already scheduled to run later). It's expected start time is StartTime is 2021-07-06 6:20 PM (after the maintenance reservation ends) and will run for 5 hours. The only thing I see potentially wrong is that squeue shows "ReqNodeNotAvail, UnavailableNodes: eea[001-018]..." instead of "ReqNodeNotAvail, Reserved for maintenance" - however, the UnavailableNodes are all nodes that are part of the maintenance reservation, so I don't really see a difference in effect, just in presentation. Could you give more examples of a job not running when it should? Any more insight would be helpful. Thanks! -Michael ________________________________ You are receiving this mail because: * You reported the bug.
So far after rebooting master node we did not see the problem again – knocking on wood From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, July 6, 2021 4:26 PM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason Comment # 13<https://bugs.schedmd.com/show_bug.cgi?id=11974#c13> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com> Michael, I'm having a hard time seeing an example of the problem with what you provided. Let's take job 153387, for example. It was submitted around 2021-07-04 at 2 AM, and then was stuck pending due to priority for over a day. Then, on 2021-07-05 at 10 AM, it started pending (I believe) because of the maintenance reservation with the error ReqNodeNotAvail. The maintenance reservation started at 2021-07-06 at 9 AM, so I'm guessing that job 153387 may have been pushed back (or maybe it was already scheduled to run later). It's expected start time is StartTime is 2021-07-06 6:20 PM (after the maintenance reservation ends) and will run for 5 hours. The only thing I see potentially wrong is that squeue shows "ReqNodeNotAvail, UnavailableNodes: eea[001-018]..." instead of "ReqNodeNotAvail, Reserved for maintenance" - however, the UnavailableNodes are all nodes that are part of the maintenance reservation, so I don't really see a difference in effect, just in presentation. Could you give more examples of a job not running when it should? Any more insight would be helpful. Thanks! -Michael ________________________________ You are receiving this mail because: * You reported the bug.
Ok. I'll go ahead and reduce the severity while we wait to see if it will reoccur.
Hey Michael, any updates? Thanks, -Michael
We had one more occurrence but I was on vacation that day and could not capture anything From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, July 14, 2021 10:58 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason Comment # 17<https://bugs.schedmd.com/show_bug.cgi?id=11974#c17> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com> Hey Michael, any updates? Thanks, -Michael ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Michael Hebenstreit from comment #18) > We had one more occurrence but I was on vacation that day and could not > capture anything Any recent occurrences that you can share? Thanks, -Michael
No – all occurrences that were reported since then turned out to be priority issues. Still monitoring situation If you want you can close the issue as “cannot reproduce” – I’ll re-open it if I have new data From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, July 22, 2021 11:41 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 11974] SLURM not scheduling job, squeue lists ReqNodeNotAvail as reason Comment # 19<https://bugs.schedmd.com/show_bug.cgi?id=11974#c19> on bug 11974<https://bugs.schedmd.com/show_bug.cgi?id=11974> from Michael Hinton<mailto:hinton@schedmd.com> (In reply to Michael Hebenstreit from comment #18<show_bug.cgi?id=11974#c18>) > We had one more occurrence but I was on vacation that day and could not > capture anything Any recent occurrences that you can share? Thanks, -Michael ________________________________ You are receiving this mail because: * You reported the bug.
(In reply to Michael Hebenstreit from comment #20) > No – all occurrences that were reported since then turned out to be priority > issues. Still monitoring situation Ok, great. > If you want you can close the issue as “cannot reproduce” – I’ll re-open it > if I have new data Will do. Thanks!
new data
it happened again. I wrote a tool that draws information from "scontrol show". job 166740 is ready to run, enough nodes available - but it is not scheduled. After restarting slurmctld it is immediately started. [root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat | grep clxap9242 4294734813 166740 inteldevq knektyag clxap9242 36-36 02:00:00 (null) 2021-07-24T07:37:26 Priority 4294734811 166742 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T09:37:00 Priority 4294734765 166788 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T11:25:33 Priority 4294734493 167066 inteldevq dmishura clxap9242 20 03:00:00 (null) 2021-07-24T11:37:00 Priority 4294734492 167067 inteldevq dmishura clxap9242 24-24 05:00:00 (null) 2021-07-24T13:25:00 Priority 4294734490 167069 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T14:37:00 Priority 4294734489 167070 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T18:25:00 Priority 4294734485 167074 inteldevq dmishura clxap9242 7-7 05:00:00 (null) 2021-07-24T19:37:00 Priority 4294734366 167194 inteldevq aknyaze1 clxap9242 21-21 10:00:00 (null) 2021-07-24T19:37:00 Priority 4294734347 167213 inteldevq msazhin clxap9242 12-12 06:00:00 (null) 2021-07-24T23:25:00 Priority [root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242 eca[002-003] 2 192/0/0/192 inteldevq comp leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[046-047] 2 192/0/0/192 inteldevq comp leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca050 1 96/0/0/96 inteldevq comp leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[064-066] 3 288/0/0/288 inteldevq comp leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[004-012,035-036] 11 1056/0/0/1056 inteldevq alloc leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[013-018,038-042] 11 1056/0/0/1056 inteldevq alloc leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[019-022,024] 5 480/0/0/480 inteldevq alloc leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[025-027,029-030] 5 480/0/0/480 inteldevq alloc leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[001,031-034,055-060] 11 0/1056/0/1056 inteldevq idle leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[023,043-044,048,067-072] 10 0/960/0/960 inteldevq idle leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[028,049,051-054] 6 0/576/0/576 inteldevq idle leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[061-062] 2 0/192/0/192 inteldevq idle leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat -p intelevq | grep clxap9242 [root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat -p inteldevq | grep clxap9242 4294734813 166740 inteldevq knektyag clxap9242 36-36 02:00:00 (null) 2021-07-24T07:38:42 Priority 4294734811 166742 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T09:38:00 Priority 4294734765 166788 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T11:25:33 Priority 4294734493 167066 inteldevq dmishura clxap9242 20 03:00:00 (null) 2021-07-24T11:38:00 Priority 4294734492 167067 inteldevq dmishura clxap9242 24-24 05:00:00 (null) 2021-07-24T13:25:00 Priority 4294734490 167069 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T14:38:00 Priority 4294734489 167070 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T18:25:00 Priority 4294734485 167074 inteldevq dmishura clxap9242 7-7 05:00:00 (null) 2021-07-24T19:38:00 Priority 4294734366 167194 inteldevq aknyaze1 clxap9242 21-21 10:00:00 (null) 2021-07-24T19:38:00 Priority 4294734347 167213 inteldevq msazhin clxap9242 12-12 06:00:00 (null) 2021-07-24T23:25:00 Priority [root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242 eca[002-003] 2 192/0/0/192 inteldevq comp leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[046-047] 2 192/0/0/192 inteldevq comp leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca050 1 96/0/0/96 inteldevq comp leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[064-066] 3 288/0/0/288 inteldevq comp leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[004-012,035-036] 11 1056/0/0/1056 inteldevq alloc leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[013-018,038-042] 11 1056/0/0/1056 inteldevq alloc leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[019-022,024] 5 480/0/0/480 inteldevq alloc leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[025-027,029-030] 5 480/0/0/480 inteldevq alloc leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[001,031-034,055-060] 11 0/1056/0/1056 inteldevq idle leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[023,043-044,048,067-072] 10 0/960/0/960 inteldevq idle leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[028,049,051-054] 6 0/576/0/576 inteldevq idle leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[061-062] 2 0/192/0/192 inteldevq idle leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat -p inteldevq | grep clxap9242 4294734813 166740 inteldevq knektyag clxap9242 36-36 02:00:00 (null) 2021-07-24T07:39:33 Priority 4294734811 166742 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T09:39:00 Priority 4294734765 166788 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T11:25:33 Priority 4294734493 167066 inteldevq dmishura clxap9242 20 03:00:00 (null) 2021-07-24T11:39:00 Priority 4294734492 167067 inteldevq dmishura clxap9242 24-24 05:00:00 (null) 2021-07-24T13:25:00 Priority 4294734490 167069 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T14:39:00 Priority 4294734489 167070 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T18:25:00 Priority 4294734485 167074 inteldevq dmishura clxap9242 7-7 05:00:00 (null) 2021-07-24T19:39:00 Priority 4294734366 167194 inteldevq aknyaze1 clxap9242 21-21 10:00:00 (null) 2021-07-24T19:39:00 Priority 4294734347 167213 inteldevq msazhin clxap9242 12-12 06:00:00 (null) 2021-07-24T23:25:00 Priority [root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242 eca[002-003] 2 192/0/0/192 inteldevq comp leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[046-047] 2 192/0/0/192 inteldevq comp leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca050 1 96/0/0/96 inteldevq comp leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[064-066] 3 288/0/0/288 inteldevq comp leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[004-012,035-036] 11 1056/0/0/1056 inteldevq alloc leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[013-018,038-042] 11 1056/0/0/1056 inteldevq alloc leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[019-022,024] 5 480/0/0/480 inteldevq alloc leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[025-027,029-030] 5 480/0/0/480 inteldevq alloc leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[001,031-034,055-060] 11 0/1056/0/1056 inteldevq idle leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[023,043-044,048,067-072] 10 0/960/0/960 inteldevq idle leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[028,049,051-054] 6 0/576/0/576 inteldevq idle leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[061-062] 2 0/192/0/192 inteldevq idle leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 after restart of slurmctld [root@endeavour5 icsmoke3.1]# /opt/slurm/crtdc/jobstat -p inteldevq | grep clxap9242 166740 inteldevq knektyag clxap9242 36 02:00:00 00:02:31 2021-07-24T09:41:49 eca[001-003,023,031-034,043-044,046-062,064-072] 4294734811 166742 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T09:41:49 Priority 4294734765 166788 inteldevq knektyag clxap9242 34-34 02:00:00 (null) 2021-07-24T11:25:33 Priority 4294734493 167066 inteldevq dmishura clxap9242 20-20 03:00:00 (null) 2021-07-24T11:41:00 Priority 4294734492 167067 inteldevq dmishura clxap9242 24-24 05:00:00 (null) 2021-07-24T13:25:00 Priority 4294734490 167069 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T14:41:00 Priority 4294734489 167070 inteldevq dmishura clxap9242 32-32 05:00:00 (null) 2021-07-24T18:25:00 Priority 4294734485 167074 inteldevq dmishura clxap9242 7-7 05:00:00 (null) 2021-07-24T19:41:00 Priority 4294734366 167194 inteldevq aknyaze1 clxap9242 21-21 10:00:00 (null) 2021-07-24T19:41:00 Priority 4294734347 167213 inteldevq msazhin clxap9242 12-12 06:00:00 (null) 2021-07-24T23:25:00 Priority [root@endeavour5 icsmoke3.1]# bhosts -p inteldevq | grep clxap9242 eca[001-012,031-036,055-060] 24 2304/0/0/2304 inteldevq alloc leaf5c,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[013-018,038-042,061-062,064-066] 16 1536/0/0/1536 inteldevq alloc leaf5b,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[019-024,043-044,046-048,067-072] 17 1632/0/0/1632 inteldevq alloc leaf5d,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca[025-027,029-030,049-054] 11 1056/0/0/1056 inteldevq alloc leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eca028 1 0/96/0/96 inteldevq idle leaf5a,reconfig,corenode,clxap9242,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
Created attachment 20525 [details] current slurm.con
Created attachment 20526 [details] current sched.log
Created attachment 20527 [details] current slurmctl.log
Hi Michael, sorry for the delay. I'm looking into it. Any changes with this issue, or do all your previous comments still apply?
this behaviour did not show up again. I think for the moment this ticket can be closed, if I see it again I'll re-open it.
Ok, great. Closing out. Thanks! -Michael