I have instances where nodes are used by multiple jobs in parallel although everything should be exclusive [root@eslurm1 crtdc]# sacct -j 166326,166336 --format JobID,Start,End,NodeList JobID Start End NodeList ------------ ------------------- ------------------- --------------- 166326 2021-07-22T04:56:29 2021-07-22T04:56:46 eca[027-036,03+ 166326.batch 2021-07-22T04:56:29 2021-07-22T05:16:52 eca027 166336 2021-07-22T05:11:42 2021-07-22T05:27:06 eca[009-025,02+ For job 166326 I don't really understand why the base job is very short and has a far longer list, but the main point is 166326.batch, using eca027 until 5:16. At 5:11 job 166336 is scheduled though with the complete nodelist SLURM_NODELIST=eca[003-036,038-044,046-062,064-069] # sacct -N eca027 --format JobID,Start,End,NodeList JobID Start End NodeList ------------ ------------------- ------------------- --------------- ..... 166326 2021-07-22T04:56:29 2021-07-22T04:56:46 eca[027-036,03+ 166326.batch 2021-07-22T04:56:29 2021-07-22T05:16:52 eca027 166336 2021-07-22T05:11:42 2021-07-22T05:27:06 eca[009-025,02+ Job 166336 was certainly submitted with --exclusive: srun --pty --exclusive "-t" "2880" "-N" "20" "--constraint=clxap9242" "-p" "inteldevq" "/bin/bash" srun: environment addon enabled srun: found Resource_List_CRTDC_PROJECTCODE=O: srun: found Resource_List_RESTART_IBSTACK=yes srun: found Resource_List_IBSTACK=10.11.0.0.577_240.22.1.3_2.12.5 srun: job 166336 queued and waiting for resources srun: job 166336 has been allocated resources
What version of Slurm are you using? Would you please attach you slurm.conf.
Additionally, can you provide the following output: > sacct -j 166326,166336 --format JobID,Start,End,NodeList -p > sacct -N eca027 --format JobID,Start,End,NodeList -p Thanks, Skyler
[root@eslurm1 crtdc]# sacct -j 166326,166336 --format JobID,Start,End,NodeList -p JobID|Start|End|NodeList| 166326|2021-07-22T04:56:29|2021-07-22T04:56:46|eca[027-036,039-044,049-050,055-060]| 166326.batch|2021-07-22T04:56:29|2021-07-22T05:16:52|eca027| 166336|2021-07-22T05:11:42|2021-07-22T05:27:06|eca[009-025,027-029]| [root@eslurm1 crtdc]# sacct -N eca027 --format JobID,Start,End,NodeList -p JobID|Start|End|NodeList| 166662|2021-07-23T00:04:41|2021-07-23T00:25:19|eca[023-028,035-036,038-044,046-054]| 166686|2021-07-23T01:55:35|Unknown|eca[004-022,024-027,029-030,035-036,038-042]| 166686.0|2021-07-23T02:06:09|Unknown|eca[004-022,024-027,029-030,035-036,038-042]| From: bugs@schedmd.com <bugs@schedmd.com> Sent: Friday, July 23, 2021 1:05 PM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 12102] nodes scheduled to 2 jobs although everything should be exclusive Comment # 2<https://bugs.schedmd.com/show_bug.cgi?id=12102#c2> on bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> from Skyler Malinowski<mailto:malinowski@schedmd.com> Additionally, can you provide the following output: > sacct -j 166326,166336 --format JobID,Start,End,NodeList -p > sacct -N eca027 --format JobID,Start,End,NodeList -p Thanks, Skyler ________________________________ You are receiving this mail because: * You reported the bug.
Created attachment 20521 [details] slurm.conf
version 20.11.7
Michael, Could you please attach slurmctld log from the time when jobs were submitted/running? I'd like to make sure if it's scheduling or just batch step accounting issue. If the allocation (like you called it "base job") was done all the steps were signalled/terminated and I'd expect the allocation to be marked as done when resources were released. Could you please share the results of: >sacct -p -a -o jobid,state,start,end -j 166326 I'd like to narrow the code paths that resulted in the nonobvious behaviour. cheers, Marcin
[root@eslurm1 crtdc]# sacct -p -a -o jobid,state,start,end -j 166326 JobID|State|Start|End| 166326|NODE_FAIL|2021-07-22T04:56:29|2021-07-22T04:56:46| 166326.batch|CANCELLED|2021-07-22T04:56:29|2021-07-22T05:16:52| From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, July 27, 2021 3:41 AM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 12102] nodes scheduled to 2 jobs although everything should be exclusive Marcin Stolarek<mailto:cinek@schedmd.com> changed bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> What Removed Added See Also https://bugs.schedmd.com/show_bug.cgi?id=11521 Comment # 7<https://bugs.schedmd.com/show_bug.cgi?id=12102#c7> on bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> from Marcin Stolarek<mailto:cinek@schedmd.com> Michael, Could you please attach slurmctld log from the time when jobs were submitted/running? I'd like to make sure if it's scheduling or just batch step accounting issue. If the allocation (like you called it "base job") was done all the steps were signalled/terminated and I'd expect the allocation to be marked as done when resources were released. Could you please share the results of: >sacct -p -a -o jobid,state,start,end -j 166326 I'd like to narrow the code paths that resulted in the nonobvious behaviour. cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug.
[root@eslurm1 crtdc]# grep 166326 /opt/slurm/current/logs/slurm/slurmctl.log /opt/slurm/current/logs/slurm/slurmctl.log.6 /opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:07.942] update_job_str: invalid JobId=166326 /opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:07.943] _slurm_rpc_update_job: JobId=166326 uid=28990: Invalid job id specified /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:50:35.859] _slurm_rpc_submit_batch_job: JobId=166326 InitPrio=4294735248 usec=527 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:53:07.981] sched: Allocate JobId=166326 NodeList=eca[020-034,036,039-044,049-050] #CPUs=2304 Partition=inteldevq /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:53:08.032] prolog_running_decr: Configuration for JobId=166326 is complete /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:53:27.002] requeue job JobId=166326 due to failure of node eca023 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:54:11.959] Requeuing JobId=166326 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:56:29.892] sched/backfill: _start_job: Started JobId=166326 in inteldevq on eca[027-036,039-044,049-050,055-060] /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:56:29.951] prolog_running_decr: Configuration for JobId=166326 is complete /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:56:46.909] requeue job JobId=166326 due to failure of node eca057 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T04:58:09.217] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:00:09.553] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:01:09.040] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:02:09.616] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:03:09.229] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:04:09.798] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:05:09.430] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:06:09.803] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:07:09.290] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:08:09.114] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:09:10.303] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:10:10.929] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:10.393] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:12:10.044] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:13:10.473] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:13:24.413] update_node: node eca057 reason set to: downed by 166326 prologue, failed could not get report from node with 254 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:13:26.624] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=166326 uid 0 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:14:10.148] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:15:10.501] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:16:10.203] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:16:52.875] cleanup_completing: JobId=166326 completion process took 1206 seconds [root@eslurm1 crtdc]# grep 166336 /opt/slurm/current/logs/slurm/slurmctl.log /opt/slurm/current/logs/slurm/slurmctl.log.6 /opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:08.224] update_job_str: invalid JobId=166336 /opt/slurm/current/logs/slurm/slurmctl.log:[2021-07-27T02:36:08.224] _slurm_rpc_update_job: JobId=166336 uid=28990: Invalid job id specified /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:39.260] sched: _slurm_rpc_allocate_resources JobId=166336 NodeList=(null) usec=536 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:42.199] sched: Allocate JobId=166336 NodeList=eca[009-025,027-029] #CPUs=1920 Partition=inteldevq /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:11:42.251] prolog_running_decr: Configuration for JobId=166336 is complete /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:27:06.638] _job_complete: JobId=166336 WEXITSTATUS 0 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:27:06.641] _job_complete: JobId=166336 done /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:28:11.837] Resending TERMINATE_JOB request JobId=166336 Nodelist=eca009 /opt/slurm/current/logs/slurm/slurmctl.log.6:[2021-07-22T05:30:49.905] cleanup_completing: JobId=166336 completion process took 223 seconds
Created attachment 20556 [details] slurmctl.log
Created attachment 20557 [details] sched.log
I had such a situation again. [root@eslurm1 crtdc]# sacct -j 169771,169905 --format JobID,Start,End,user,NodeList -p JobID |Start |End |User |NodeList | 169771 |2021-08-03T12:11:13|2021-08-03T13:57:06|dmishura|eca[001,013,015,017,026-028,041-044,046-056] | 169771.batch |2021-08-03T12:11:13|2021-08-03T13:57:06| |eca001 | 169905 |2021-08-03T13:58:27|Unknown |dmishura|eca[051-052] | 169905.batch |2021-08-03T13:58:27|Unknown | |eca051 | but job epilogue for 169771 run until 14:13 and that caused issues.
Michael, I was looking into that and I think that the issue is actually more related to the accounting than with jobs sharing the node. Could you please share sacct with job states (as I asked before) for this situation? I've also noticed that you may be modifying node state quite frequently by some external automation. Am I correct, could you please explain how this works? cheers, Marcin
What commands exactly do you need to run? In epilogue nodes are taken offline while epilogue is running until epilogue is complete (using scrontrol commands) From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, August 3, 2021 3:10 PM To: Hebenstreit, Michael <michael.hebenstreit@intel.com> Subject: [Bug 12102] nodes scheduled to 2 jobs although everything should be exclusive Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=12102#c14> on bug 12102<https://bugs.schedmd.com/show_bug.cgi?id=12102> from Marcin Stolarek<mailto:cinek@schedmd.com> Michael, I was looking into that and I think that the issue is actually more related to the accounting than with jobs sharing the node. Could you please share sacct with job states (as I asked before) for this situation? I've also noticed that you may be modifying node state quite frequently by some external automation. Am I correct, could you please explain how this works? cheers, Marcin ________________________________ You are receiving this mail because: * You reported the bug.
that last answer might be confusing as I replied via e-mail. Here my answer: What commands exactly do you need to run? In epilogue nodes are taken offline while epilogue is running until epilogue is complete (using scrontrol commands)
Michael, I am stepping in for Marcin as it is past his working hours at the moment. >What commands exactly do you need to run? sacct -j 169771,169905 --format JobID,Start,End,user,NodeList,State -p After looking over this bug, this seems to be an accounting issue - the end-time of the first job is probably not accurate. It's very likely caused by the automation done on your side. It is important to note that we discussed this with you and as far as I know you understood this not only not recommended, but also results in limited support. I'm afraid this is one of the edge-case issues you experience because of that.
[root@eslurm1 crtdc]# sacct -j 169771,169905 --format JobID,Start,End,user,NodeList,State -p JobID|Start|End|User|NodeList|State| 169771|2021-08-03T12:11:13|2021-08-03T13:57:06|dmishura|eca[001,013,015,017,026-028,041-044,046-056]|COMPLETED| 169771.batch|2021-08-03T12:11:13|2021-08-03T13:57:06||eca001|COMPLETED| 169905|2021-08-03T16:39:00|2021-08-03T17:13:26|dmishura|eca[020,038]|COMPLETED| 169905.batch|2021-08-03T16:39:00|2021-08-03T17:13:26||eca020|COMPLETED|
Michael, I'm looking at the Start/End times you just shared and I don't see anything strange here. (I did simplify the output for instance removing the date since it was all 2021-08-03): >Job/Step |Start | End | >169771 |12:11:13| 13:57:06| >169771.batch|12:11:13| 13:57:06| >169905 |16:39:00| 17:13:26| >169905.batch|16:39:00| 17:13:26| The way I read it is that batch step and allocation start times are aligned. Job 169771 started at 12:11 and ended 13:57, job 169905 started 16:39 and ended 17:13, so there is no time overlap? I see that in comment 13 you stated: >but job epilogue for 169771 run until 14:13 and that caused issues. Is that something measured your own way? Is it a part of your configuration that fills in (Jason's comment 18): >[...] It is important to note that we discussed this with you and as far as I know you understood this not only not recommended, but also results in limited support. I'm asking just to make sure that we'll not spend a few hours just to have very similar conclusions. I see that 169905 times differ in both sacct executions, could you please check: >sacct -j 169771,169905 -D --format JobID,Start,End,user,NodeList,State -p cheers, Marcin
> >but job epilogue for 169771 run until 14:13 and that caused issues. > Is that something measured your own way? Is it a part of your configuration that fills in.... I know this from epilogue log, which has a last modification timestamp of 14:13
[root@eslurm1 crtdc]# sacct -j 169771,169905 -D --format JobID,Start,End,user,NodeList,State -p JobID|Start|End|User|NodeList|State| 169771|2021-08-03T12:11:13|2021-08-03T13:57:06|dmishura|eca[001,013,015,017,026-028,041-044,046-056]|COMPLETED| 169771.batch|2021-08-03T12:11:13|2021-08-03T13:57:06||eca001|COMPLETED| 169905|2021-08-03T13:58:27|2021-08-03T15:03:38|dmishura|eca[051-052]|NODE_FAIL| 169905.batch|2021-08-03T13:58:27|2021-08-03T15:04:54||eca051|CANCELLED| 169905|2021-08-03T16:39:00|2021-08-03T17:13:26|dmishura|eca[020,038]|COMPLETED| 169905.batch|2021-08-03T16:39:00|2021-08-03T17:13:26||eca020|COMPLETED|
Michael, What we see in both cases is that allocation was canceled because of the NODE_DOWN event? Do you know what may be the reason for nodes going down? cheers, Marcin
We have pretty extensive prologue/epilogue procedures (runtime 30s to 15min including a reboot). I suspect now something like this happened: Job A starts on node1/node2/node3 Job A prologue runs Job A user code starts Job A epilogue runs but takes longer on node1/node3 Job B starts on node2 Job B prologue runs Job B user code starts Job A epilogue finishes on node1/node3, clears all nodes[1-3] and possible restarts slurmd on the nodes Job B node2 is considered faulty now by slurm and Job B is canceled I made some adjustments to epilogue that ensures the epilogue on all nodes is running the same amount of time.
I just caught this situation: [root@endeavour5 icsmoke3.1]# squeue -w eia059 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 173372 nextgenq bld_LAMM schernyk R 15:52 1 eia059 173308 workq gnn_laun hmostafa CG 17:44 2 eia[059-060] [root@endeavour5 icsmoke3.1]# date Thu Aug 12 13:43:15 MDT 2021 [root@endeavour5 icsmoke3.1]# squeue -w eia059 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 173372 nextgenq bld_LAMM schernyk R 16:44 1 eia059 173308 workq gnn_laun hmostafa CG 17:44 2 eia[059-060] [root@endeavour5 icsmoke3.1]# date Thu Aug 12 13:43:19 MDT 2021 173308 is not yet done, but 173372 is already running. How can I avoid such a situation?
today I had another of those cases [root@endeavour5 icsmoke3.1]# squeue -p idealq JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 174481 idealq mgraph Xgcisn CG 9:43 16 eii[339-354] 174489 idealq mgraph Xgcisn PD 0:00 16 (Dependency) 174486 idealq mgraph Xgcisn R 5:28 16 eii[339-354] 174486 was started before the epilogue of 174481 had run to completion!
Marcin and Michael - I took some time to go over this bug and try to duplicate it today. I was able to get something close to what Michael reported. What happens in my instance is that the first job/node has some type of failure and the node's state is set to down. I think* you Michael, or some automated script, is setting the nodes back to idle. For example: >[2021-07-21T06:54:00.459] update_node: node eca027 state set to DOWN >[2021-07-21T10:16:02.103] update_node: node eca027 state set to IDLE >[2021-07-21T10:30:02.710] update_node: node eca027 state set to IDLE >[2021-07-21T10:44:49.026] update_node: node eca027 state set to IDLE >[2021-07-21T11:32:57.289] update_node: node eca027 state set to IDLE >[2021-07-21T19:34:54.640] update_node: node eca027 state set to IDLE >[2021-07-21T23:06:08.775] update_node: node eca027 state set to IDLE >[2021-07-21T23:43:18.720] update_node: node eca027 state set to IDLE >[2021-07-22T00:17:44.746] update_node: node eca027 state set to IDLE >[2021-07-22T00:18:55.941] update_node: node eca027 state set to IDLE >[2021-07-22T02:38:01.853] update_node: node eca027 state set to IDLE >[2021-07-22T03:55:58.125] update_node: node eca027 state set to IDLE >[2021-07-22T03:57:11.764] update_node: node eca027 state set to IDLE >[2021-07-22T04:02:55.420] update_node: node eca027 state set to IDLE >[2021-07-22T04:05:55.947] update_node: node eca027 state set to IDLE >[2021-07-22T04:08:36.385] update_node: node eca027 state set to DOWN >[2021-07-22T04:09:24.259] update_node: node eca027 state set to DOWN >[2021-07-22T04:17:46.091] update_node: node eca027 state set to IDLE >[2021-07-22T04:20:44.068] update_node: node eca027 state set to IDLE >[2021-07-22T04:29:42.502] update_node: node eca027 state set to IDLE >[2021-07-22T04:53:58.797] update_node: node eca027 state set to IDLE >[2021-07-22T04:57:23.467] update_node: node eca027 state set to IDLE >[2021-07-22T05:00:34.423] update_node: node eca027 state set to IDLE >[2021-07-22T05:29:41.979] update_node: node eca027 state set to IDLE >[2021-07-22T06:17:19.462] update_node: node eca027 state set to IDLE >[2021-07-22T06:59:04.146] update_node: node eca027 state set to IDLE >[2021-07-22T07:35:56.323] update_node: node eca027 state set to IDLE >[2021-07-22T09:15:25.796] update_node: node eca027 state set to IDLE >[2021-07-22T11:26:37.214] update_node: node eca027 state set to IDLE >[2021-07-22T14:55:48.290] update_node: node eca027 state set to IDLE >[2021-07-22T15:36:23.582] update_node: node eca027 state set to IDLE >[2021-07-22T16:03:59.347] update_node: node eca027 state set to DOWN >[2021-07-22T16:32:26.009] update_node: node eca027 state set to IDLE >[2021-07-22T19:27:24.682] update_node: node eca027 state set to IDLE >[2021-07-22T20:12:27.913] update_node: node eca027 state set to IDLE >[2021-07-22T21:32:25.096] update_node: node eca027 state set to IDLE >[2021-07-22T22:08:54.775] update_node: node eca027 state set to IDLE >[2021-07-22T22:49:53.980] update_node: node eca027 state set to IDLE >[2021-07-23T00:27:51.422] update_node: node eca027 state set to IDLE >[2021-07-23T01:08:14.706] update_node: node eca027 state set to IDLE There is an edge case here where if a node was downed due to some failed, (job or prolog), and the prolog is still running, and then the node is update from DOWN -> Idle, then slurmctld will start a new job on that node. You will then see overlapping jobs mentioned in the slurmd logs. So, my first comments here are as follows. Michael, we can not support long-running epilog or prolog script. > https://slurm.schedmd.com/prolog_epilog.html > "Prolog and Epilog scripts should be designed to be as short as possible and should not call Slurm commands" This can lead to strange behavior such as what you are seeing. What complicate this further is that nodes are legitimately being marked down due to some type of node failure and then, to the best of our understanding, being pushed back into an idle state which then allows new jobs to start. This workflow is outside the current supported workflow of Slurm and I don't think we will make you happy here since we can not support this solution.
that's taking the easy way out :( in the case for today this certainly did not happen - the prologue on the main node was still running, only at the very end of prologue nodes are restored to state idle - but ALL nodes were used in duplicate. Also - none of these problems were visible with the older version 20.02.5. So you essentially are saying if we have prologues running for 15min then we should go back to LSF?
we just found a new error - a job that was scheduled to the completely wrong nodes. Requested was constraint leaf113, but it was scheduled to leaf213 (bhosts is a script calling sinfo with different format to show node contraints): [root@eslurm1 crtdc]# bhosts -p idealq -n eii[217-232] NODELIST NODES CPUS(A/I/O/T) PARTITION STATE AVAIL_FEATURES eii[219-232] 14 1008/0/0/1008 idealq alloc reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 eii[217-218] 2 0/144/0/144 idealq idle reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# sacct -j 174615 --format Constraints,NodeList Constraints NodeList ------------------- --------------- [leaf113&reconfig] eii[217-232] eii217 eii[218-232] eii[218-232] eii[218-232] [root@eslurm1 crtdc]# grep leaf213 slurm.conf NodeName=eii[217-234] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=reconfig,leaf213,icx8360Y,icx8360Yf3,icx8360Yopa,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# grep leaf113 slurm.conf NodeName=eia[073-077] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[078-080] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[081-084] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 NodeName=eia[085-093] Boards=1 SocketsPerBoard=2 CoresPerSocket=36 State=UNKNOWN Feature=leaf113,icx8360Y,icx8360Yf2,corenode,IBSTACK=mlnx-5.1-2.5.8.0_240.22.1.3_2.12.5 [root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/slurmctl.log [2021-08-16T23:15:42.523] _slurm_rpc_submit_batch_job: JobId=174615 InitPrio=4294726966 usec=762 [2021-08-16T23:15:45.809] sched: Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq [2021-08-16T23:24:10.821] JobId=174615 boot complete for all 16 nodes [2021-08-16T23:24:10.821] prolog_running_decr: Configuration for JobId=174615 is complete [2021-08-16T23:24:49.968] _job_complete: JobId=174615 WEXITSTATUS 1 [2021-08-16T23:24:49.970] _job_complete: JobId=174615 done [2021-08-16T23:26:03.364] cleanup_completing: JobId=174615 completion process took 74 seconds [root@eslurm1 crtdc]# grep 174615 /opt/slurm/current/logs/slurm/sched.log sched: [2021-08-16T23:15:42.523] JobId=174615 allocated resources: NodeList=(null) sched: [2021-08-16T23:15:45.809] JobId=174615 initiated sched: [2021-08-16T23:15:45.809] Allocate JobId=174615 NodeList=eii[217-232] #CPUs=1152 Partition=idealq despite all what you are saying I think you have a serious problem in your scheduler since we switched to 20.11.7
> we just found a new error - a job that was scheduled to the completely wrong nodes. Requested was constraint leaf113, but it was scheduled to leaf213 (bhosts is a script calling sinfo with different format to show node contraints): Please do log this as a separate issue. This is something we will want to look into.
https://bugs.schedmd.com/show_bug.cgi?id=12295
from SLURM 19.05.3 the release notes say: “Nodes in COMPLETING state treated as being currently available for job will-run test.” (reference https://github.com/SchedMD/slurm/blob/master/NEWS). As jobs still performing epilog are COMPLETING, this makes it possible for the scheduler to schedule on nodes still processing through epilog. In my case, as the nodes reset the ATS cards during epilog, I ran into cases where I got a node with no GPU until a few seconds after the allocation start. For scripting this is very bad as the run starts quick enough to not find any GPUs and fails. " Is there a way to prevent that? aka configure SLURM not to schedule anything on any of the nodes before a job is completely done?
Let's say I set "CompleteWait" to 3000. Will nodes be available again as soon as they are idle or will the scheduler wait until the 3000s are up?
tested it, that works. Can you please confirm my understanding of prologue/epilogue order and the the impact of setting "CompleteWait": slurmd starts prologue and epilogue on each node. If prologue succeeds on EVERY node, the user code will start; if prologue fails on a single node, execution of user code will be skipped. Then epilogue will be executed on each node and status will go to "COMPLETING" aka "CG". "CompleteWait" seconds after epilogue begins the node is considered free for next job, even if epilogue could still be running. So by setting "CompleteWait" to a (ridiculous) value like 1000 I can prevent this behaviour?
still waiting for an answer to my last comment on "completewait" setting
Michael would you please log this additional request for information in a new ticket. > still waiting for an answer to my last comment on "completewait" setting This helps us keep bug reports on the topic and easier to follow as more comments are added over time.
https://bugs.schedmd.com/show_bug.cgi?id=12445
Michael, The last issues we discussed here got split and are handled separately. I took one more look at the initial report (comments 0 - 12). What we see there is a non-obvious accounting record showing 1) a .batch step EndTime is after allocation EndTime. 2) looking at that (non-obvious) end time it looks like there were two jobs running on a node at a time. Allocation times of two jobs don't overlap. From slurmctld logs we see repeated messages like: >[2021-07-22T04:58:09.217] Resending TERMINATE_JOB request JobId=166326 Nodelist=eca027 This happens as part of periodic activity (slurmctld background thread) attempting to purge completed jobs. My guess is that it happens because slurmstepd of a batch step didn't send its completion message back to slurmctld (maybe got killed or node was rebooted). Finally we assume that job is already done since we notice that node is down: >[2021-07-22T05:16:52.875] cleanup_completing: JobId=166326 completion process took 1206 seconds [1]. After all we discussed I believe we're unable to reproduce the exact issue without "tricks" we consider not only inadvisable, but also something we can't fully support. As I mentioned in Bug 12445 - it looks like new node_featureas/helpers plugin may be a way for you. If you want to discuss this please create a new ticket and describe the goals you're trying to achieve so we can help you in a ticket with clean history. I'm closing the bug report now as "won't fix status". cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-20.11/src/slurmctld/node_scheduler.c#L4255