12102 – nodes scheduled to 2 jobs although everything should be exclusive

Ticket 12102 - nodes scheduled to 2 jobs although everything should be exclusive

Summary: nodes scheduled to 2 jobs although everything should be exclusive

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	20.11.7
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-07-22 09:30 MDT by Michael Hebenstreit
Modified:	2021-10-05 11:50 MDT (History)
CC List:	1 user (show)

See Also:	11521 11974 12445
Site:	Intel CRT
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (5.38 KB, application/gzip) 2021-07-23 13:22 MDT, Michael Hebenstreit	Details
slurmctl.log (2.13 MB, application/gzip) 2021-07-27 08:56 MDT, Michael Hebenstreit	Details
sched.log (4.92 MB, application/gzip) 2021-07-27 08:57 MDT, Michael Hebenstreit	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Michael Hebenstreit 2021-07-22 09:30:28 MDT

I have instances where nodes are used by multiple jobs in parallel although everything should be exclusive

[root@eslurm1 crtdc]# sacct -j 166326,166336 --format JobID,Start,End,NodeList
       JobID               Start                 End        NodeList
------------ ------------------- ------------------- ---------------
166326       2021-07-22T04:56:29 2021-07-22T04:56:46 eca[027-036,03+
166326.batch 2021-07-22T04:56:29 2021-07-22T05:16:52          eca027
166336       2021-07-22T05:11:42 2021-07-22T05:27:06 eca[009-025,02+


For job 166326 I don't really understand why the base job is very short and has a far longer list, but the main point is 166326.batch, using eca027 until 5:16. At 5:11 job 166336 is scheduled though with the complete nodelist

SLURM_NODELIST=eca[003-036,038-044,046-062,064-069]

# sacct -N eca027 --format JobID,Start,End,NodeList
       JobID               Start                 End        NodeList
------------ ------------------- ------------------- ---------------
.....
166326       2021-07-22T04:56:29 2021-07-22T04:56:46 eca[027-036,03+
166326.batch 2021-07-22T04:56:29 2021-07-22T05:16:52          eca027
166336       2021-07-22T05:11:42 2021-07-22T05:27:06 eca[009-025,02+


Job 166336 was certainly submitted with --exclusive:

srun --pty --exclusive   "-t" "2880" "-N" "20" "--constraint=clxap9242" "-p" "inteldevq" "/bin/bash"
srun: environment addon enabled
srun: found Resource_List_CRTDC_PROJECTCODE=O:
srun: found Resource_List_RESTART_IBSTACK=yes
srun: found Resource_List_IBSTACK=10.11.0.0.577_240.22.1.3_2.12.5
srun: job 166336 queued and waiting for resources
srun: job 166336 has been allocated resources

Comment 1 Skyler Malinowski 2021-07-23 12:22:43 MDT

What version of Slurm are you using?

Would you please attach you slurm.conf.

Comment 2 Skyler Malinowski 2021-07-23 13:05:11 MDT

Additionally, can you provide the following output:
> sacct -j 166326,166336 --format JobID,Start,End,NodeList -p
> sacct -N eca027 --format JobID,Start,End,NodeList -p

Thanks,
Skyler

Comment 3 Michael Hebenstreit 2021-07-23 13:19:53 MDT

[root@eslurm1 crtdc]# sacct -j 166326,166336 --format JobID,Start,End,NodeList -p
JobID|Start|End|NodeList|
166326|2021-07-22T04:56:29|2021-07-22T04:56:46|eca[027-036,039-044,049-050,055-060]|
166326.batch|2021-07-22T04:56:29|2021-07-22T05:16:52|eca027|
166336|2021-07-22T05:11:42|2021-07-22T05:27:06|eca[009-025,027-029]|
[root@eslurm1 crtdc]# sacct -N eca027 --format JobID,Start,End,NodeList -p
JobID|Start|End|NodeList|
166662|2021-07-23T00:04:41|2021-07-23T00:25:19|eca[023-028,035-036,038-044,046-054]|
166686|2021-07-23T01:55:35|Unknown|eca[004-022,024-027,029-030,035-036,038-042]|
166686.0|2021-07-23T02:06:09|Unknown|eca[004-022,024-027,029-030,035-036,038-042]|


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, July 23, 2021 1:05 PM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 12102] nodes scheduled to 2 jobs although everything should be exclusive

Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=12102#c2> on bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> from Skyler Malinowski<mailto:malinowski@schedmd.com>

Additionally, can you provide the following output:

> sacct -j 166326,166336 --format JobID,Start,End,NodeList -p

> sacct -N eca027 --format JobID,Start,End,NodeList -p



Thanks,

Skyler

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Michael Hebenstreit 2021-07-23 13:22:16 MDT

Created attachment 20521 [details]
slurm.conf

Comment 5 Michael Hebenstreit 2021-07-23 13:22:52 MDT

version 20.11.7

Comment 7 Marcin Stolarek 2021-07-27 03:40:52 MDT

Michael,

Could you please attach slurmctld log from the time when jobs were submitted/running?
I'd like to make sure if it's scheduling or just batch step accounting issue. If the allocation (like you called it "base job") was done all the steps were signalled/terminated and I'd expect the allocation to be marked as done when resources were released.

Could you please share the results of:
>sacct -p -a -o jobid,state,start,end -j 166326
I'd like to narrow the code paths that resulted in the nonobvious behaviour.

cheers,
Marcin

Comment 8 Michael Hebenstreit 2021-07-27 08:40:28 MDT

[root@eslurm1 crtdc]# sacct -p -a -o jobid,state,start,end -j 166326
JobID|State|Start|End|
166326|NODE_FAIL|2021-07-22T04:56:29|2021-07-22T04:56:46|
166326.batch|CANCELLED|2021-07-22T04:56:29|2021-07-22T05:16:52|

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, July 27, 2021 3:41 AM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 12102] nodes scheduled to 2 jobs although everything should be exclusive

Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102>
What
Removed
Added
See Also

https://bugs.schedmd.com/show_bug.cgi?id=11521
Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=12102#c7> on bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> from Marcin Stolarek<mailto:cinek@schedmd.com>

Michael,



Could you please attach slurmctld log from the time when jobs were

submitted/running?

I'd like to make sure if it's scheduling or just batch step accounting issue.

If the allocation (like you called it "base job") was done all the steps were

signalled/terminated and I'd expect the allocation to be marked as done when

resources were released.



Could you please share the results of:

>sacct -p -a -o jobid,state,start,end -j 166326

I'd like to narrow the code paths that resulted in the nonobvious behaviour.



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 9 Michael Hebenstreit 2021-07-27 08:54:35 MDT

[root@eslurm1 crtdc]# grep 166326 /opt/slurm/current/logs/slurm/slurmctl.log /opt/slurm/current/logs/slurm/slurmctl.log.6
/opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:07.942] update_job_str: invalid JobId=166326
/opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:07.943] _slurm_rpc_update_job: JobId=166326 uid=28990: Invalid job id specified
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:50:35.859] _slurm_rpc_submit_batch_job: JobId=166326 InitPrio=4294735248 usec=527
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:53:07.981] sched: Allocate JobId=166326 NodeList=eca[020-034,036,039-044,049-050] #CPUs=2304 Partition=inteldevq
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:53:08.032] prolog_running_decr: Configuration for JobId=166326 is complete
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:53:27.002] requeue job JobId=166326 due to failure of node eca023
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:54:11.959] Requeuing JobId=166326
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:56:29.892] sched/backfill: _start_job: Started JobId=166326 in inteldevq on eca[027-036,039-044,049-050,055-060]
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:56:29.951] prolog_running_decr: Configuration for JobId=166326 is complete
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:56:46.909] requeue job JobId=166326 due to failure of node eca057
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:58:09.217] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:00:09.553] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:01:09.040] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:02:09.616] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:03:09.229] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:04:09.798] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:05:09.430] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:06:09.803] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:07:09.290] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:08:09.114] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:09:10.303] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:10:10.929] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:10.393] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:12:10.044] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:13:10.473] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:13:24.413] update_node: node eca057 reason set to: downed by 166326 prologue, failed could not get report from node with 254
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:13:26.624] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=166326 uid 0
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:14:10.148] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:15:10.501] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:16:10.203] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:16:52.875] cleanup_completing: JobId=166326 completion process took 1206 seconds
[root@eslurm1 crtdc]# grep 166336 /opt/slurm/current/logs/slurm/slurmctl.log /opt/slurm/current/logs/slurm/slurmctl.log.6
/opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:08.224] update_job_str: invalid JobId=166336
/opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:08.224] _slurm_rpc_update_job: JobId=166336 uid=28990: Invalid job id specified
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:39.260] sched: _slurm_rpc_allocate_resources JobId=166336 NodeList=(null) usec=536
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:42.199] sched: Allocate JobId=166336 NodeList=eca[009-025,027-029] #CPUs=1920 Partition=inteldevq
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:42.251] prolog_running_decr: Configuration for JobId=166336 is complete
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:27:06.638] _job_complete: JobId=166336 WEXITSTATUS 0
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:27:06.641] _job_complete: JobId=166336 done
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:28:11.837] Resending TERMINATE_JOB request JobId=166336 Nodelist=eca009
/opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:30:49.905] cleanup_completing: JobId=166336 completion process took 223 seconds

Comment 10 Michael Hebenstreit 2021-07-27 08:56:51 MDT

Created attachment 20556 [details]
slurmctl.log

Comment 11 Michael Hebenstreit 2021-07-27 08:57:21 MDT

Created attachment 20557 [details]
sched.log

Comment 13 Michael Hebenstreit 2021-08-03 14:59:53 MDT

I had such a situation again.

[root@eslurm1 crtdc]# sacct -j 169771,169905 --format JobID,Start,End,user,NodeList -p
JobID			|Start		    |End			  |User	 |NodeList								|
169771		|2021-08-03T12:11:13|2021-08-03T13:57:06|dmishura|eca[001,013,015,017,026-028,041-044,046-056]	|
169771.batch	|2021-08-03T12:11:13|2021-08-03T13:57:06|	 	|eca001								|
169905		|2021-08-03T13:58:27|Unknown		  |dmishura|eca[051-052]							|
169905.batch	|2021-08-03T13:58:27|Unknown		  |	 	|eca051								|

but job epilogue for 169771 run until 14:13 and that caused issues.

Comment 14 Marcin Stolarek 2021-08-03 15:09:55 MDT

Michael,

I was looking into that and I think that the issue is actually more related to the accounting than with jobs sharing the node.

Could you please share sacct with job states (as I asked before) for this situation?

I've also noticed that you may be modifying node state quite frequently by some external automation. Am I correct, could you please explain how this works?

cheers,
Marcin

Comment 15 Michael Hebenstreit 2021-08-03 15:58:47 MDT

What commands exactly do you need to run?

In epilogue nodes are taken offline while epilogue is running until epilogue is complete (using scrontrol commands)

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, August 3, 2021 3:10 PM
To: Hebenstreit, Michael <michael.hebenstreit@intel.com>
Subject: [Bug 12102] nodes scheduled to 2 jobs although everything should be exclusive

Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=12102#c14> on bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> from Marcin Stolarek<mailto:cinek@schedmd.com>

Michael,



I was looking into that and I think that the issue is actually more related to

the accounting than with jobs sharing the node.



Could you please share sacct with job states (as I asked before) for this

situation?



I've also noticed that you may be modifying node state quite frequently by some

external automation. Am I correct, could you please explain how this works?



cheers,

Marcin

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 18 Michael Hebenstreit 2021-08-05 15:37:59 MDT

that last answer might be confusing as I replied via e-mail. Here my answer:

What commands exactly do you need to run?

In epilogue nodes are taken offline while epilogue is running until epilogue is complete (using scrontrol commands)

Comment 19 Jason Booth 2021-08-05 16:23:50 MDT

Michael, I am stepping in for Marcin as it is past his working hours at the moment.

>What commands exactly do you need to run?

 sacct -j 169771,169905 --format JobID,Start,End,user,NodeList,State -p


After looking over this bug, this seems to be an accounting issue - the end-time of the first job is probably not accurate. It's very likely caused by the automation done on your side. It is important to note that we discussed this with you and as far as I know you understood this not only not recommended, but also results in limited support.

I'm afraid this is one of the edge-case issues you experience because of that.

Comment 20 Michael Hebenstreit 2021-08-05 16:35:18 MDT

[root@eslurm1 crtdc]# sacct -j 169771,169905 --format JobID,Start,End,user,NodeList,State -p
JobID|Start|End|User|NodeList|State|
169771|2021-08-03T12:11:13|2021-08-03T13:57:06|dmishura|eca[001,013,015,017,026-028,041-044,046-056]|COMPLETED|
169771.batch|2021-08-03T12:11:13|2021-08-03T13:57:06||eca001|COMPLETED|
169905|2021-08-03T16:39:00|2021-08-03T17:13:26|dmishura|eca[020,038]|COMPLETED|
169905.batch|2021-08-03T16:39:00|2021-08-03T17:13:26||eca020|COMPLETED|

Comment 21 Marcin Stolarek 2021-08-06 00:32:00 MDT

Michael,

I'm looking at the Start/End times you just shared and I don't see anything strange here. (I did simplify the output for instance removing the date since it was all 2021-08-03):

>Job/Step    |Start   |  End     |
>169771      |12:11:13|  13:57:06|
>169771.batch|12:11:13|  13:57:06|
>169905      |16:39:00|  17:13:26|
>169905.batch|16:39:00|  17:13:26|

The way I read it is that batch step and allocation start times are aligned.
Job 169771 started at 12:11 and ended 13:57,  job 169905 started 16:39 and ended 17:13, so there is no time overlap?

I see that in comment 13 you stated:
>but job epilogue for 169771 run until 14:13 and that caused issues.
Is that something measured your own way? Is it a part of your configuration that fills in (Jason's comment 18):
>[...] It is important to note that we discussed this with you and as far as I know you understood this not only not recommended, but also results in limited support.
I'm asking just to make sure that we'll not spend a few hours just to have very similar conclusions.

I see that 169905 times differ in both sacct executions, could you please check:
>sacct -j 169771,169905 -D --format JobID,Start,End,user,NodeList,State -p

cheers,
Marcin

Comment 22 Michael Hebenstreit 2021-08-06 05:56:59 MDT

> >but job epilogue for 169771 run until 14:13 and that caused issues.
> Is that something measured your own way? Is it a part of your configuration that fills in....

I know this from epilogue log, which has a last modification timestamp of 14:13

Comment 23 Michael Hebenstreit 2021-08-06 06:44:35 MDT

[root@eslurm1 crtdc]# sacct -j 169771,169905 -D --format JobID,Start,End,user,NodeList,State -p
JobID|Start|End|User|NodeList|State|
169771|2021-08-03T12:11:13|2021-08-03T13:57:06|dmishura|eca[001,013,015,017,026-028,041-044,046-056]|COMPLETED|
169771.batch|2021-08-03T12:11:13|2021-08-03T13:57:06||eca001|COMPLETED|
169905|2021-08-03T13:58:27|2021-08-03T15:03:38|dmishura|eca[051-052]|NODE_FAIL|
169905.batch|2021-08-03T13:58:27|2021-08-03T15:04:54||eca051|CANCELLED|
169905|2021-08-03T16:39:00|2021-08-03T17:13:26|dmishura|eca[020,038]|COMPLETED|
169905.batch|2021-08-03T16:39:00|2021-08-03T17:13:26||eca020|COMPLETED|

Comment 24 Marcin Stolarek 2021-08-11 09:09:56 MDT

Michael,

What we see in both cases is that allocation was canceled because of the NODE_DOWN event? Do you know what may be the reason for nodes going down?

cheers,
Marcin

Comment 25 Michael Hebenstreit 2021-08-11 10:41:12 MDT

We have pretty extensive prologue/epilogue procedures (runtime 30s to 15min including a reboot). I suspect now something like this happened:

Job A starts on node1/node2/node3
Job A prologue runs
Job A user code starts
Job A epilogue runs but takes longer on node1/node3
Job B starts on node2
Job B prologue runs
Job B user code starts
Job A epilogue finishes on node1/node3, clears all nodes[1-3] and possible restarts slurmd on the nodes
Job B node2 is considered faulty now by slurm and Job B is canceled

I made some adjustments to epilogue that ensures the epilogue on all nodes is running the same amount of time.

Comment 26 Michael Hebenstreit 2021-08-12 13:45:17 MDT

I just caught this situation:
[root@endeavour5 icsmoke3.1]# squeue -w eia059
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            173372  nextgenq bld_LAMM schernyk  R      15:52      1 eia059
            173308     workq gnn_laun hmostafa CG      17:44      2 eia[059-060]
[root@endeavour5 icsmoke3.1]# date
Thu Aug 12 13:43:15 MDT 2021
[root@endeavour5 icsmoke3.1]# squeue -w eia059
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            173372  nextgenq bld_LAMM schernyk  R      16:44      1 eia059
            173308     workq gnn_laun hmostafa CG      17:44      2 eia[059-060]
[root@endeavour5 icsmoke3.1]# date
Thu Aug 12 13:43:19 MDT 2021

173308 is not yet done, but 173372 is already running. How can I avoid such a situation?

Comment 28 Michael Hebenstreit 2021-08-16 08:24:33 MDT

today I had another of those cases

[root@endeavour5 icsmoke3.1]# squeue -p idealq
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            174481    idealq   mgraph   Xgcisn CG       9:43     16 eii[339-354]
            174489    idealq   mgraph   Xgcisn PD       0:00     16 (Dependency)
            174486    idealq   mgraph   Xgcisn  R       5:28     16 eii[339-354]

174486 was started before the epilogue of 174481 had run to completion!

Comment 29 Jason Booth 2021-08-16 19:54:02 MDT

Marcin and Michael - I took some time to go over this bug and try to duplicate it today. I was able to get something close to what Michael reported. What happens in my instance is that the first job/node has some type of failure and the node's state is set to down. I think* you Michael, or some automated script, is setting the nodes back to idle. For example:


>[2021-07-21T06:54:00.459] update_node: node eca027 state set to DOWN
>[2021-07-21T10:16:02.103] update_node: node eca027 state set to IDLE
>[2021-07-21T10:30:02.710] update_node: node eca027 state set to IDLE
>[2021-07-21T10:44:49.026] update_node: node eca027 state set to IDLE
>[2021-07-21T11:32:57.289] update_node: node eca027 state set to IDLE
>[2021-07-21T19:34:54.640] update_node: node eca027 state set to IDLE
>[2021-07-21T23:06:08.775] update_node: node eca027 state set to IDLE
>[2021-07-21T23:43:18.720] update_node: node eca027 state set to IDLE
>[2021-07-22T00:17:44.746] update_node: node eca027 state set to IDLE
>[2021-07-22T00:18:55.941] update_node: node eca027 state set to IDLE
>[2021-07-22T02:38:01.853] update_node: node eca027 state set to IDLE
>[2021-07-22T03:55:58.125] update_node: node eca027 state set to IDLE
>[2021-07-22T03:57:11.764] update_node: node eca027 state set to IDLE
>[2021-07-22T04:02:55.420] update_node: node eca027 state set to IDLE
>[2021-07-22T04:05:55.947] update_node: node eca027 state set to IDLE
>[2021-07-22T04:08:36.385] update_node: node eca027 state set to DOWN
>[2021-07-22T04:09:24.259] update_node: node eca027 state set to DOWN
>[2021-07-22T04:17:46.091] update_node: node eca027 state set to IDLE
>[2021-07-22T04:20:44.068] update_node: node eca027 state set to IDLE
>[2021-07-22T04:29:42.502] update_node: node eca027 state set to IDLE
>[2021-07-22T04:53:58.797] update_node: node eca027 state set to IDLE
>[2021-07-22T04:57:23.467] update_node: node eca027 state set to IDLE
>[2021-07-22T05:00:34.423] update_node: node eca027 state set to IDLE
>[2021-07-22T05:29:41.979] update_node: node eca027 state set to IDLE
>[2021-07-22T06:17:19.462] update_node: node eca027 state set to IDLE
>[2021-07-22T06:59:04.146] update_node: node eca027 state set to IDLE
>[2021-07-22T07:35:56.323] update_node: node eca027 state set to IDLE
>[2021-07-22T09:15:25.796] update_node: node eca027 state set to IDLE
>[2021-07-22T11:26:37.214] update_node: node eca027 state set to IDLE
>[2021-07-22T14:55:48.290] update_node: node eca027 state set to IDLE
>[2021-07-22T15:36:23.582] update_node: node eca027 state set to IDLE
>[2021-07-22T16:03:59.347] update_node: node eca027 state set to DOWN
>[2021-07-22T16:32:26.009] update_node: node eca027 state set to IDLE
>[2021-07-22T19:27:24.682] update_node: node eca027 state set to IDLE
>[2021-07-22T20:12:27.913] update_node: node eca027 state set to IDLE
>[2021-07-22T21:32:25.096] update_node: node eca027 state set to IDLE
>[2021-07-22T22:08:54.775] update_node: node eca027 state set to IDLE
>[2021-07-22T22:49:53.980] update_node: node eca027 state set to IDLE
>[2021-07-23T00:27:51.422] update_node: node eca027 state set to IDLE
>[2021-07-23T01:08:14.706] update_node: node eca027 state set to IDLE




There is an edge case here where if a node was downed due to some failed, (job or prolog), and the prolog is still running, and then the node is update from DOWN -> Idle, then slurmctld will start a new job on that node. You will then see overlapping jobs mentioned in the slurmd logs.

So, my first comments here are as follows. Michael, we can not support long-running epilog or prolog script.
> https://slurm.schedmd.com/prolog_epilog.html
> "Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands"

 This can lead to strange behavior such as what you are seeing. What complicate this further is that nodes are legitimately being marked down due to some type of node failure and then, to the best of our understanding, being pushed back into an idle state which then allows new jobs to start.

This workflow is outside the current supported workflow of Slurm and I don't think we will make you happy here since we can not support this solution.

Comment 30 Michael Hebenstreit 2021-08-16 20:13:38 MDT

that's taking the easy way out :(

in the case for today this certainly did not happen - the prologue on the main node was still running, only at the very end of prologue nodes are restored to state idle - but ALL nodes were used in duplicate. 

Also - none of these problems were visible with the older version 20.02.5. 

So you essentially are saying if we have prologues running for 15min then we should go back to LSF?

Comment 31 Michael Hebenstreit 2021-08-17 09:43:20 MDT

we just found a new error - a job that was scheduled to the completely wrong nodes. Requested was constraint leaf113, but it was scheduled to leaf213 (bhosts is a script calling sinfo with different format to show node contraints):

[root@eslurm1 crtdc]# bhosts -p idealq -n eii[217-232]
NODELIST      NODES CPUS(A/I/O/T)    PARTITION        STATE      AVAIL_FEATURES
eii[219-232]     14 1008/0/0/1008    idealq           alloc      reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
eii[217-218]      2 0/144/0/144      idealq           idle       reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5

[root@eslurm1 crtdc]# sacct -j 174615 --format Constraints,NodeList
        Constraints        NodeList
------------------- ---------------
 [leaf113&reconfig]    eii[217-232]
                             eii217
                       eii[218-232]
                       eii[218-232]
                       eii[218-232]
[root@eslurm1 crtdc]# grep leaf213 slurm.conf
NodeName=eii[217-234]  Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
[root@eslurm1 crtdc]# grep leaf113 slurm.conf
NodeName=eia[073-077]  Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
NodeName=eia[078-080]  Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
NodeName=eia[081-084]  Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5
NodeName=eia[085-093]  Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5

[root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/slurmctl.log
[2021-08-16T23:15:42.523] _slurm_rpc_submit_batch_job: JobId=174615 InitPrio=4294726966 usec=762
[2021-08-16T23:15:45.809] sched: Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq
[2021-08-16T23:24:10.821] JobId=174615 boot complete for all 16 nodes
[2021-08-16T23:24:10.821] prolog_running_decr: Configuration for JobId=174615 is complete
[2021-08-16T23:24:49.968] _job_complete: JobId=174615 WEXITSTATUS 1
[2021-08-16T23:24:49.970] _job_complete: JobId=174615 done
[2021-08-16T23:26:03.364] cleanup_completing: JobId=174615 completion process took 74 seconds

[root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/sched.log
sched: [2021-08-16T23:15:42.523] JobId=174615 allocated resources: NodeList=(null)
sched: [2021-08-16T23:15:45.809] JobId=174615 initiated
sched: [2021-08-16T23:15:45.809] Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq


despite all what you are saying I think you have a serious problem in your scheduler since we switched to 20.11.7

Comment 32 Jason Booth 2021-08-17 11:00:14 MDT

> we just found a new error - a job that was scheduled to the completely wrong nodes. Requested was constraint leaf113, but it was scheduled to leaf213 (bhosts is a script calling sinfo with different format to show node contraints):

Please do log this as a separate issue. This is something we will want to look into.

Comment 33 Michael Hebenstreit 2021-08-17 12:01:26 MDT

https://bugs.schedmd.com/show_bug.cgi?id=12295

Comment 34 Michael Hebenstreit 2021-08-26 11:44:52 MDT

from SLURM 19.05.3 the release notes say:

“Nodes in COMPLETING state treated as being currently available for job will-run test.”

(reference https://github.com/SchedMD/slurm/blob/master/NEWS). As jobs still performing epilog are COMPLETING, this makes it possible for the scheduler to schedule on nodes still processing through epilog. In my case, as the nodes reset the ATS cards during epilog, I ran into cases where I got a node with no GPU until a few seconds after the allocation start. For scripting this is very bad as the run starts quick enough to not find any GPUs and fails. "

Is there a way to prevent that? aka configure SLURM not to schedule anything on any of the nodes before a job is completely done?

Comment 35 Michael Hebenstreit 2021-08-26 14:35:33 MDT

Let's say I set "CompleteWait" to 3000. Will nodes be available again as soon as they are idle or will the scheduler wait until the 3000s are up?

Comment 36 Michael Hebenstreit 2021-08-26 14:50:52 MDT

tested it, that works. 

Can you please confirm my understanding of prologue/epilogue order and the the impact of setting "CompleteWait":

slurmd starts prologue and epilogue on each node.
If prologue succeeds on EVERY node, the user code will start; if prologue fails on a single node, execution of user code will be skipped. Then epilogue will be executed on each node and status will go to "COMPLETING" aka "CG".
"CompleteWait" seconds after epilogue begins the node is considered free for next job, even if epilogue could still be running.

So by setting "CompleteWait" to a (ridiculous) value like 1000 I can prevent this behaviour?

Comment 37 Michael Hebenstreit 2021-09-05 08:25:12 MDT

still waiting for an answer to my last comment on "completewait" setting

Comment 38 Jason Booth 2021-09-07 11:38:06 MDT

Michael would you please log this additional request for information in a new ticket.

> still waiting for an answer to my last comment on "completewait" setting

This helps us keep bug reports on the topic and easier to follow as more comments are added over time.

Comment 39 Michael Hebenstreit 2021-09-08 09:22:31 MDT

https://bugs.schedmd.com/show_bug.cgi?id=12445

Comment 42 Marcin Stolarek 2021-10-05 11:50:40 MDT

Michael,

The last issues we discussed here got split and are handled separately.

I took one more look at the initial report (comments 0 - 12). What we see there is  a non-obvious accounting record showing
1) a .batch step EndTime is after allocation EndTime.
2) looking at that (non-obvious) end time it looks like there were two jobs running on a node at a time. Allocation times of two jobs don't overlap.

From slurmctld logs we see repeated messages like:
>[2021-07-22T04:58:09.217] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027
This happens as part of periodic activity (slurmctld background thread) attempting to purge completed jobs. My guess is that it happens because slurmstepd of a batch step didn't send its completion message back to slurmctld (maybe got killed or node was rebooted). Finally we assume that job is already done since we notice that node is down:
>[2021-07-22T05:16:52.875] cleanup_completing: JobId=166326 completion process took 1206 seconds [1].

After all we discussed I believe we're unable to reproduce the exact issue without "tricks" we consider not only inadvisable, but also something we can't fully support.

As I mentioned in Bug 12445 - it looks like new node_featureas/helpers plugin may be a way for you. If you want to discuss this please create a new ticket and describe the goals you're trying to achieve so we can help you in a ticket with clean history.

I'm closing the bug report now as "won't fix status".

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-20.11/src/slurmctld/node_scheduler.c#L4255