12801 – Shrinking a job can lead to misordered prologs/epilogs

Ticket 12801 - Shrinking a job can lead to misordered prologs/epilogs

Summary: Shrinking a job can lead to misordered prologs/epilogs

Status:	RESOLVED DUPLICATE of ticket 11635

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	20.11.8
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-11-01 18:14 MDT by Felix Abecassis
Modified:	2022-04-19 11:57 MDT (History)
CC List:	5 users (show)

See Also:	12445 11635
Site:	NVIDIA (PSLA)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
20.11 v1 (10.05 KB, patch) 2021-11-08 16:34 MST, Michael Hinton	Details \| Diff
v2 (12.51 KB, patch) 2021-11-22 13:35 MST, Michael Hinton	Details \| Diff
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Felix Abecassis 2021-11-01 18:14:58 MDT

On a production cluster running Slurm 20.11.8 with "PrologFlags = Alloc,Serial", we suddenly had ~50 nodes drained for epilog errors. The epilog script failed because a filesystem that should have been mounted was not present.

Looking at the log for this node, we saw that Slurm scheduled 2 prologs in sequence for jobs 2056194 and 2057060, then 2 epilogs for those same jobs:
Oct 29 22:50:22 node-0442 slurm-prolog: START user=A job=2056194
Oct 29 22:51:41 node-0442 slurm-prolog: END user=A job=2056194
Oct 29 22:51:41 node-0442 slurm-prolog: START user=B job=2057060
Oct 29 22:52:30 node-0442 slurm-prolog: END user=B job=2057060
Oct 29 22:52:30 node-0442 slurm-epilog: START user=root job=2056194
Oct 29 22:52:54 node-0442 slurm-epilog: END user=root job=2056194
Oct 29 22:53:30 node-0442 slurm-epilog: START user=B job=2057060
Oct 29 22:53:51 node-0442 slurm-epilog: END user=B job=2057060

We expected prolog job 1 -> epilog job 1 -> prolog job 2 -> epilog job 2.

The first job shrunk its initial node allocation with the approach described in the Slurm FAQ (see https://bugs.schedmd.com/show_bug.cgi?id=12266 on why we do something like this).
node-0442 was part of the extra nodes relinquished by this job, but it seems Slurm immediately scheduled a new job on this node, without going through the epilog of the first job. 

This also means that the epilog of the first job was executed while the user application was executing for the second job:
$ sacct -j 2057060 -o jobid,state,start,end,nnodes
       JobID      State               Start                 End   NNodes 
------------ ---------- ------------------- ------------------- -------- 
2057060       COMPLETED 2021-10-29T22:51:40 2021-10-29T22:53:25       40 
2057060.bat+  COMPLETED 2021-10-29T22:51:40 2021-10-29T22:53:25        1 
2057060.0        FAILED 2021-10-29T22:52:30 2021-10-29T22:54:42       40 

And copy/pasting the log above:
Oct 29 22:52:30 node-0442 slurm-epilog: START user=root job=2056194
Oct 29 22:52:54 node-0442 slurm-epilog: END user=root job=2056194
(note, user=root because of https://bugs.schedmd.com/show_bug.cgi?id=12268)

As we always run as node-exclusive, our epilogs change many things on the system. Running an epilog at the same time than the job caused the job to fail *and* the epilog for this job then also failed for the same reason.

I was able to craft a repro on single node with multiple slurmd, but it is a bit involved so I'll put that in a separate comment.

Comment 1 Felix Abecassis 2021-11-01 18:33:32 MDT

For the standalone repro with multiple slurmd, I have the following slurm.conf:
```
PrologFlags=Alloc,Serial
Prolog=/etc/slurm/prolog.sh
Epilog=/etc/slurm/epilog.sh

NodeName=node-[1-4] NodeHostname=ioctl Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Port=[17001-17004] Weight=1
PartitionName=normal PriorityTier=1 PreemptMode=OFF Nodes=node-[1-4] Default=YES MaxTime=INFINITE Hidden=NO State=UP OverSubscribe=EXCLUSIVE
```

/etc/slurm/prolog.sh is:
```
#!/bin/bash

logger -s -t slurm-prolog "START user=$SLURM_JOB_USER job=$SLURM_JOB_ID node=$SLURMD_NODENAME"

if [ "${SLURMD_NODENAME}" == "node-3" ]; then
    sleep 10
fi

logger -s -t slurm-prolog "END user=$SLURM_JOB_USER job=$SLURM_JOB_ID node=$SLURMD_NODENAME"
```

You can see an artificial delay was added for node-3 only, it is necessary to trigger the repro on a single system. But I don't think this is really what is happening on our production cluster.

/etc/slurm/epilog.sh is:
```
#!/bin/bash

logger -s -t slurm-epilog "START user=$SLURM_JOB_USER job=$SLURM_JOB_ID node=$SLURMD_NODENAME"

sleep 30s

logger -s -t slurm-epilog "END user=$SLURM_JOB_USER job=$SLURM_JOB_ID node=$SLURMD_NODENAME"
```

In this case the epilog always take 30s, on all nodes.

Here is the run.sub that shrinks from 4 nodes to 2 (it does not use a flexible number of nodes at submission time as it is not required for the repro):
```
#!/bin/bash -eu
#SBATCH -N4

scontrol update jobid=${SLURM_JOBID} NumNodes=2 >/dev/null
source ./slurm_job_${SLURM_JOBID}_resize.sh
rm -f ./slurm_job_${SLURM_JOBID}_resize.{csh,sh}

srun sleep 60s
```

And the command to trigger the repro:
$ sbatch run.sub ; sleep 2s ; srun -N2 -t 5 bash -c 'date ; echo $SLURM_NODELIST; sleep 120s'

This indeed causes the second job to immediately start executing on the nodes relinquished by the first job, it shouldn't be the case as the epilog takes 30s for all nodes:
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               295    normal  run.sub fabecass  R       0:02      2 node-[1-2]
               296    normal     bash fabecass  R       0:01      2 node-[3-4]

In the system log, we can indeed see that the delayed node executed 2 prologs then 2 epilogs like on the production cluster:
$ journalctl | grep node-3
Nov 01 17:20:09 slurm-prolog[616812]: START user=fabecassis job=295 node=node-3
Nov 01 17:20:19 slurm-prolog[616902]: END user=fabecassis job=295 node=node-3
Nov 01 17:20:19 slurm-prolog[616906]: START user=fabecassis job=296 node=node-3
Nov 01 17:20:29 slurm-prolog[616915]: END user=fabecassis job=296 node=node-3
Nov 01 17:20:29 slurm-epilog[616918]: START user=root job=295 node=node-3
Nov 01 17:20:59 slurm-epilog[616956]: END user=root job=295 node=node-3
Nov 01 17:22:39 slurm-epilog[617034]: START user=fabecassis job=296 node=node-3
Nov 01 17:23:09 slurm-epilog[617037]: END user=fabecassis job=296 node=node-3

And the second job (296), also overlapped with the epilog of the first job, the output was:
$ srun -N2 -t 5 bash -c 'date ; echo $SLURM_NODELIST; sleep 120s ; date'
Mon Nov  1 05:20:29 PM PDT 2021
node-[3-4]
Mon Nov  1 05:20:39 PM PDT 2021
node-[3-4]

Comment 2 Felix Abecassis 2021-11-01 20:41:23 MDT

BTW I tried 20.11.8, 21.08.2 and the current master branch, the bug was present in all cases.

Comment 3 Jason Booth 2021-11-02 10:18:24 MDT

Felix this may just be a dupe of bug#11656 and is fixed in 21.08.3.

commits cc3344ee5c8b19 and 41464671062fc1f.

Links:

https://github.com/schedMD/slurm/commit/cc3344ee5c8b19 and
https://github.com/schedMD/slurm/commit/41464671062fc1f

Prolog and epilog scripts are supposed to run in parallel unless you have PrologFlags=serial. However, last week we discovered that prolog and epilog scripts were being run serially due to a flaw in the use of locks. 

21.08.3 is scheduled for release today pending any final major issues we run into. 

I have Michel assigned to this to follow up with you with any further comments or to answer any follow-up questions.

Comment 4 Felix Abecassis 2021-11-02 10:28:16 MDT

I don't think it's the same, as I mentioned in the previous comment, the Slurm master branch is still showing this bug, and we are running with the Serial flag (as our jobs are node-exclusive).

Comment 5 Michael Hinton 2021-11-02 13:04:51 MDT

To my knowledge, having prolog/epilog synchronization, especially when you downsize a job, is not something we've tested or considered. This seems like a similar problem to bug 11182, though not the same.

When would you expect the prologs to run on nodes that are removed from the job allocation? At the time of removal? And would you expect that to block the node from having another job scheduled on it? That seems reasonable to me, but again, it's not really a use case we've considered much, especially since the resizing has been neglected and only recently have we fixed some long-standing bugs on it.

I'll look into it some more and see what we can do.

-Michael

Comment 6 Felix Abecassis 2021-11-02 13:18:07 MDT

> When would you expect the prologs to run on nodes that are removed from the job allocation? At the time of removal? And would you expect that to block the node from having another job scheduled on it? 

Do you mean the epilog? 
I would expect the removed nodes to behave similarly to the situation where the job finishes or is cancelled: the node goes through the full epilog before being allocatable to a new job, and once it's allocated, the prolog will run.
In this case, it looks like the allocation of the new job is concurrent to the epilog of the previous job.

In a normal situation we have:
job 1 prolog -> job 1 starts -> job 1 finishes -> job 1 epilog -> job 2 prolog -> ...

But here we have:
job 1 prolog -> job 1 starts -> job 1 relinquishes this node -> (job 2 prolog || job 1 epilog) -> ...


> it's not really a use case we've considered much, especially since the resizing has been neglected and only recently have we fixed some long-standing bugs on it.

Sure, we have been filing some of these issues recently, as we are trying to provide a way for users to have more flexibility in the shape of their jobs according to the current cluster utilization.

Comment 7 Michael Hinton 2021-11-02 13:32:40 MDT

(In reply to Felix Abecassis from comment #6)
> Do you mean the epilog? 
Yes, sorry, the epilog, not the prolog.

Comment 8 Michael Hinton 2021-11-02 13:57:30 MDT

As a workaround, you could probably add some synchronization into your prolog and epilog scripts so that no prolog can run while an epilog is still running on the same node, and vice versa. Do you think this is something that you can implement while we investigate this?

-Michael

Comment 9 Felix Abecassis 2021-11-02 14:49:35 MDT

>  you could probably add some synchronization into your prolog and epilog scripts so that no prolog can run while an epilog is still running on the same node, and vice versa.

This part is working fine I believe, so the prolog/epilog locking doesn't seem to be the issue here. The prolog and epilogs never run concurently, but we have an epilog running concurrently with the second job, and two prologs / two epilogs running back to back:
Oct 29 22:50:22 node-0442 slurm-prolog: START user=A job=2056194
Oct 29 22:51:41 node-0442 slurm-prolog: END user=A job=2056194
Oct 29 22:51:41 node-0442 slurm-prolog: START user=B job=2057060
Oct 29 22:52:30 node-0442 slurm-prolog: END user=B job=2057060
Oct 29 22:52:30 node-0442 slurm-epilog: START user=root job=2056194
Oct 29 22:52:54 node-0442 slurm-epilog: END user=root job=2056194
Oct 29 22:53:30 node-0442 slurm-epilog: START user=B job=2057060
Oct 29 22:53:51 node-0442 slurm-epilog: END user=B job=2057060

You can see that the lock seems fine, the START and END are always matched, but the order is wrong.

Comment 10 Michael Hinton 2021-11-03 14:48:43 MDT

What is CompleteWait set to?

Comment 11 Felix Abecassis 2021-11-03 15:16:56 MDT

On the production cluster it is set to 240s.

On my local machine it was initially set to 0s, but increasing the value doesn't seem to change the situation.

Comment 13 Michael Hinton 2021-11-03 17:01:48 MDT

I'm able to reproduce this with your reproducer. After the resize, it looks like Slurm is able to allocate both freed nodes to a job as long as one of the nodes has already finished the prolog. If both prologs take the same amount of time, then I don't think the job can schedule early. I would expect that all the nodes in the job allocation would need to have their prologs finished before they are available for a new job, but this doesn't seem to be the case.

It looks like the workaround is to try to make sure that you don't resize a job right on the heals of a prolog, and/or to try to make sure that the prologs all end at the same time (maybe keep track of the run time, and make the prologs that finished early sleep up to some maximum time period for all prologs). Can you try that out?

Thanks!
-Michael

Comment 14 Felix Abecassis 2021-11-03 17:53:16 MDT

> It looks like the workaround is to try to make sure that you don't resize a job right on the heals of a prolog

That seems difficult to enforce, as the current use case for resizing is rounding down to a number of nodes that is a multiple of N (e.g. round down to closest multiple of N=20, from 70 nodes -> 60 nodes); so it's not a use case where you need more nodes for phase 1 and then less nodes for phase 2 of the job; the resizing needs to happen immediately.

We could ask those users to inject a small "sleep" in their sbatch script, but I'm not sure what a good value should be.

> and/or to try to make sure that the prologs all end at the same time (maybe keep track of the run time, and make the prologs that finished early sleep up to some maximum time period for all prologs)

That seems hard to implement too, our prologs are fairly complicated so the execution time may vary. And if we enforce that all prologs must last at least 120s, that could also waste a lot of node cycles. Also, as a starting point, are we guaranteed that the prologs will start at approximately the same time?

Overall, I think it might be simpler to ask our users to pause submitting resizing jobs until we get a better solution. We could even reject these jobs automatically in a job_submit lua plugin.

By the way, is there an option similar to sbatch_wait_nodes, but for prolog completion instead of reboot completion? That might work for this use case.

Comment 15 Michael Hinton 2021-11-04 09:40:47 MDT

(In reply to Felix Abecassis from comment #14)
> That seems hard to implement too, our prologs are fairly complicated so the
> execution time may vary. And if we enforce that all prologs must last at
> least 120s, that could also waste a lot of node cycles.
It could waste node cycles, yes, but this is to prevent your nodes from going down until we can get a permanent solution. Having your nodes go down wastes even more cycles.

> Also, as a starting
> point, are we guaranteed that the prologs will start at approximately the
> same time?
The prologs should all roughly start at the same time as long as the Alloc PrologFlag is specified. If Alloc is not specified, then the prologs will run lazily - i.e. only when a step first runs on a node in the allocation, which is up to the job's discretion.

> Overall, I think it might be simpler to ask our users to pause submitting
> resizing jobs until we get a better solution. We could even reject these
> jobs automatically in a job_submit lua plugin.
Ok, that could also work.

> By the way, is there an option similar to sbatch_wait_nodes, but for prolog
> completion instead of reboot completion? That might work for this use case.
It doesn't look like there is, but once we fix this, maybe we can add that flag in to allow for either behavior.

Since we've come up with some workarounds to avoid the issue, I'm going to go ahead and downgrade this issue to a severity 3. But I will keep looking into it to see if I can create a permanent fix.

Thanks!
-Michael

Comment 16 Luke Yeager 2021-11-04 14:13:30 MDT

Raising from sev3 back up to sev2.

As a reminder, this bug is not only draining nodes - it is crashing jobs. The next job after the resize on the relinquished node can run in parallel with the epilog from the previous job, which causes all sorts of problems (removal of tempfiles, etc.) and can cause large jobs to fail for unexpected reasons. The jobs may be large both in terms of absolute size and in terms of scheduling cost.

Thanks for looking into this, Michael, but the workarounds you've proposed so far don't really work for us. Artificially increasing the prolog duration for all jobs just to workaround this bug would be quite painful for all users, since the overall throughput of the cluster would go down. Also, based on your comment#5, I think we’d still be worried about the potential of more out-of-order epilog events even if the fix initially appears to work, since there has not been a lot of testing in this area.

We are considering asking our users to run ‘srun true’ before ‘scontrol update job=X numnodes=Y’. The reasoning is that each node should be forced to complete the prolog before the srun jobstep can be allocated, and that the proper codepath might be triggered ensuring that the epilog runs immediately after the job ends on this node. Does that seem like it should work to you? In a quick test it did appear to work for us, but we don’t feel confident. It’s certainly lower impact than your suggestion, but it does rely on users to police themselves to protect our nodes and other users’ jobs, so it’s not a great solution.

This resizing feature has been an unexpected boon to our users, who achieved a significant increase in cluster throughput because of it. Disabling the feature entirely, even temporarily, would also be quite painful for our advanced users in particular.

We are in a difficult spot here. Do you have any ETA on a fix?

Comment 20 Michael Hinton 2021-11-04 17:57:26 MDT

Hi Luke, Felix,

(In reply to Felix Abecassis from comment #14)
> By the way, is there an option similar to sbatch_wait_nodes, but for prolog
> completion instead of reboot completion? That might work for this use case.
It turns out that Dominik in bug 11635 has been working internally on improvements to the prolog and epilog infrastructure, including a new "DeferBatch" prolog flag, which is exactly what you called for here. I was able to confirm that DeferBatch indeed fixes the problem.

The issue is that the fix currently requires a change in the RPC and packing code, so it's targeted for 22.05. I'm looking to see if there is any way the patch can be backported to 21.08, or if there is another workaround that can hold you over until 22.05.

(In reply to Luke Yeager from comment #16)
> We are considering asking our users to run ‘srun true’ before ‘scontrol
> update job=X numnodes=Y’. The reasoning is that each node should be forced
> to complete the prolog before the srun jobstep can be allocated, and that
> the proper codepath might be triggered ensuring that the epilog runs
> immediately after the job ends on this node. Does that seem like it should
> work to you? In a quick test it did appear to work for us, but we don’t feel
> confident.
That seems like it should work, since -N from sbatch is inherited by srun, so it should cover all the nodes in the allocation. In my reproducer, half the time it works as intended, and the other half of the time the srun on the command line jumps in front of the sbatch and starts first. Either way, the prologs and epilogs don't ever appear to interleave, which is good.

Alternatively, another idea I had is to remove the Alloc PrologFlag. This means that if you don't run an srun on certain nodes, then those nodes presumably will not have run a prolog yet when you resize. The epilog will still run on the nodes that are freed, but if your epilog is smart enough, it can skip the teardown, since nothing was set up to begin with on those nodes. So it would allow you to free the unneeded nodes very quickly.

A drawback to this, though, is that you lose what Alloc does, so the prologs would be lazy again. You could get around that by doing "srun true" after the resize (or just using all nodes in the srun command immediately).

I played around with this alternative workaround, and it seems to work, but you'll have to play around with it to see if it works in your case. I'll continue to look into things and keep you posted.

Thanks!
-Michael

Comment 21 Luke Yeager 2021-11-05 10:28:10 MDT

(In reply to Michael Hinton from comment #20)
> It turns out that Dominik in bug 11635 has been working internally on
> improvements to the prolog and epilog infrastructure, including a new
> "DeferBatch" prolog flag, which is exactly what you called for here. I was
> able to confirm that DeferBatch indeed fixes the problem.
> 
> The issue is that the fix currently requires a change in the RPC and packing
> code, so it's targeted for 22.05. I'm looking to see if there is any way the
> patch can be backported to 21.08, or if there is another workaround that can
> hold you over until 22.05.
That's good news that there's a fix in progress for bug#11635 and that it might help with this bug, too. We would obviously prefer to avoid backporting a patch vs. simply updating to a new minor release, but we do have a small number of patches in-place already and could conceive of taking another one. And if we do need to take a new patch, breaking the RPC ABI is not really a problem for us - we don't need to worry about multiple client versions accessing the controller. However, we're currently running 20.11 on our production cluster.

> That seems like it should work
Ok, thanks for the confirmation that this is a decent idea.

> Alternatively, another idea I had is to remove the Alloc PrologFlag.
Unfortunately, that's a no-go for us. We use pam_slurm_adopt, so we need the Contain flag which implicitly sets Alloc. And even if we didn't, unsetting the Alloc flag would lead to longer delays for multi-node jobs, where you have to wait for 2x the length of a prolog before any multi-node steps can begin (once for the batch host, and again for the other nodes), instead of the current 1x. Right?

> This means that if you don't run an srun on certain nodes, then those nodes
> presumably will not have run a prolog yet when you resize. The epilog will
> still run on the nodes that are freed, but if your epilog is smart enough,
> it can skip the teardown, since nothing was set up to begin with on those
> nodes.
This makes me very uncomfortable anyway. Our prologs and epilogs are fairly complicated, and ensuring that the epilog always works when a matching prolog was never run is not something that we would want to maintain, especially when it's so tricky to even setup the proper test situation to try it out (see comment#1).

> I'll continue to look into things and keep you posted.
Yes, please do. We're still struggling to find a workaround that we're happy with.

Comment 22 Michael Hinton 2021-11-08 14:12:57 MST

Luke, Felix,

How is the `srun true` workaround going?

I have been able to backport the 21.08/22.05 "DeferBatch" fix to 20.11 without messing with RPCs or adding new fields to any structs. But I wanted to see how you want the patch. The options are:

1) Have the "DeferBatch" functionality hardcoded to be always enabled. You will need to rebuild, but you won't need to rerun the configure script, and you will not need to change anything in slurm.conf.

2) Have the "DeferBatch" functionality only enabled if "DeferBatch" is specified in PrologFlags in slurm.conf. You will need to set that in slurm.conf, rerun the configure script, and rebuild.

Or, if you have any other thoughts on how you want the patch, just let me know.

Thanks!
-Michael

Comment 23 Michael Hinton 2021-11-08 16:34:37 MST

Created attachment 22176 [details]
20.11 v1

Here is a v1 for 20.11 that should solve the issue. If you want to go route 1), discard the last two "(Optional)" commits. If you want to go route 2), just use v1 as it is.

I should note that v1 may keep some jobs in prolog forever if you restart slurmctld with running jobs. I would recommend restarting the controller when you can guarantee that a job is in the middle of completing prologs.

Let me know how it goes!

Thanks,
Michael

Comment 24 Michael Hinton 2021-11-09 09:39:52 MST

I'm reducing this to a sev 3 since a workaround and a fix have both been provided.

Thanks,
-Michael

Comment 25 Felix Abecassis 2021-11-12 11:53:07 MST

Quick update: the patch does fix the issue so far for my local machine with multiple slurmd, the next step will be for us to evaluate it on a test cluster.

Comment 26 Michael Hinton 2021-11-12 13:40:32 MST

(In reply to Felix Abecassis from comment #25)
> Quick update: the patch does fix the issue so far for my local machine with
> multiple slurmd, the next step will be for us to evaluate it on a test
> cluster.
Great! Keep us posted on how it goes.

(In reply to Michael Hinton from comment #23)
> I would recommend restarting the controller
> when you can guarantee that a job is in the middle of completing prologs.
I meant to say "is NOT in the middle of completing prologs."

v1 is a quick hack that simply counts the # of prolog completions until it matches the job's node count, and only then allows the batch script to run. So if you applied the patch in the middle of a job's nodes completing their prologs, you would miss counting some prolog completions and that job would wait forever in the WAIT_PROLOG state.

Comment 27 Luke Yeager 2021-11-12 13:50:19 MST

(In reply to Michael Hinton from comment #26)
> v1 is a quick hack that simply counts the # of prolog completions until it
> matches the job's node count, and only then allows the batch script to run.
> So if you applied the patch in the middle of a job's nodes completing their
> prologs, you would miss counting some prolog completions and that job would
> wait forever in the WAIT_PROLOG state.

I understand there is a risk during the initial upgrade since the value of a uint_16 in an RPC is changing. Will there also be an ongoing risk for as long as we have this patch? Will that accumulating counter of prolog completions be persisted or is it lost?

Comment 28 Michael Hinton 2021-11-12 14:13:43 MST

(In reply to Luke Yeager from comment #27)
> I understand there is a risk during the initial upgrade since the value of a
> uint_16 in an RPC is changing.
Well, I don't really see any risk here. Can you elaborate on the concern?

> Will there also be an ongoing risk for as
> long as we have this patch?
It's a pretty simple patch, and I think it should be safe, but you will still need to test it to make sure I didn't overlook anything.

> Will that accumulating counter of prolog
> completions be persisted or is it lost?
It is persistent across ctld reboots, since it's saved to the state file.

v1 doesn't actually touch the RPCs or add a new field to anything. It simply hacks a preexisting field that only used 1 of its 16 bits (kill_on_node_fail). kill_on_node_fail is already part of the RPC layer and is already being saved and restored to the state file. So the prolog counter will, too.

Oh, and there is no case where you will have a job with over 2^15 nodes, right? v1 will break if so.

Comment 29 Luke Yeager 2021-11-12 14:21:04 MST

(In reply to Michael Hinton from comment #28)
> (In reply to Luke Yeager from comment #27)
> > I understand there is a risk during the initial upgrade since the value of a
> > uint_16 in an RPC is changing.
> Well, I don't really see any risk here. Can you elaborate on the concern?
I was just trying to rephrase comment#26. Apparently I did it poorly :).

Are the REQUEST_COMPLETE_PROLOG RPCs only sent once - is that why you're concerned about missing some? Or is it only an issue when the controller is down long enough for slurmd to give up trying to resend it?

> > Will that accumulating counter of prolog
> > completions be persisted or is it lost?
> It is persistent across ctld reboots, since it's saved to the state file.
Ok, great.

> Oh, and there is no case where you will have a job with over 2^15 nodes,
> right? v1 will break if so.
No, that's not a problem.

Comment 30 Michael Hinton 2021-11-12 14:34:38 MST

(In reply to Luke Yeager from comment #29)
> (In reply to Michael Hinton from comment #28)
> > (In reply to Luke Yeager from comment #27)
> > > I understand there is a risk during the initial upgrade since the value of a
> > > uint_16 in an RPC is changing.
> > Well, I don't really see any risk here. Can you elaborate on the concern?
> I was just trying to rephrase comment#26. Apparently I did it poorly :).
Ah, ok. Yeah, just being smart about upgrading should avoid that issue. I think you could put your cluster into drain and make sure no jobs are wait prolog, and you should be good.

> Are the REQUEST_COMPLETE_PROLOG RPCs only sent once - is that why you're
> concerned about missing some? Or is it only an issue when the controller is
> down long enough for slurmd to give up trying to resend it?
Basically, v1 assumes that the # of REQUEST_COMPLETE_PROLOG RPCs matches the number of nodes of the job.

The real patch from bug 11635 targeting 22.05 doesn't do a prolog counter, but rather adds a whole new prolog bitmap to the job record, so that we can properly track what nodes have responded. It also adds the node's name to the prolog complete RPC. Obviously, this is a lot of change, so it wasn't feasible to backport all that to 20.11. The prolog complete counter was a quick hack to hopefully achieve the same result for you, but it's not going to be the final fix.

My only real worry is if there is a situation where a prolog complete RPC is replayed by the same node, for whatever reason. I don't think this is likely, and I'm not even sure if it's possible, but if that happens, it could double count a node and start the batch script earlier than expected. This is where having a node bitmap is a more robust solution, because it would solve the issue of double counting.

Comment 31 Felix Abecassis 2021-11-17 10:27:02 MST

We had to revert the patch on our test cluster running 20.11.8, it caused many jobs to become stuck at startup time for ~15-25 minutes before being cancelled:
$ sacct -X -j 5497 --format jobid,state,exitcode,start,elapsed
       JobID      State ExitCode               Start    Elapsed 
------------ ---------- -------- ------------------- ---------- 
5497          NODE_FAIL      1:0 2021-11-16T13:50:19   00:27:36 

This job usually takes 4-5 minutes to execute.

Many jobs were impacted since we deployed the patch on Friday:
2021-11-12T17:32:36-08:00 slurm-01 slurmctld[251593]: Batch JobId=5355 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-13T00:12:39-08:00 slurm-01 slurmctld[251593]: Batch JobId=5356 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-14T00:05:59-08:00 slurm-01 slurmctld[251593]: Batch JobId=5371 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-14T15:06:00-08:00 slurm-01 slurmctld[251593]: Batch JobId=5372 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-14T23:59:21-08:00 slurm-01 slurmctld[251593]: Batch JobId=5375 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-15T06:06:02-08:00 slurm-01 slurmctld[251593]: Batch JobId=5376 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-15T07:12:42-08:00 slurm-01 slurmctld[251593]: Batch JobId=5377 missing from batch node n0002 (not found BatchStartTime after startup)
2021-11-15T09:26:02-08:00 slurm-01 slurmctld[251593]: Batch JobId=5384 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-15T13:52:43-08:00 slurm-01 slurmctld[251593]: Batch JobId=5425 missing from batch node n0006 (not found BatchStartTime after startup)
2021-11-15T16:39:23-08:00 slurm-01 slurmctld[251593]: Batch JobId=5430 missing from batch node n0005 (not found BatchStartTime after startup)
2021-11-15T23:52:43-08:00 slurm-01 slurmctld[251593]: Batch JobId=5447 missing from batch node n0005 (not found BatchStartTime after startup)
2021-11-16T14:17:55-08:00 slurm-01 slurmctld[251593]: Batch JobId=5497 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-16T14:51:14-08:00 slurm-01 slurmctld[251593]: Batch JobId=5500 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-16T15:57:55-08:00 slurm-01 slurmctld[251593]: Batch JobId=5512 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-16T16:31:14-08:00 slurm-01 slurmctld[251593]: Batch JobId=5514 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-17T06:56:15-08:00 slurm-01 slurmctld[251593]: Batch JobId=5516 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-17T07:29:34-08:00 slurm-01 slurmctld[251593]: Batch JobId=5517 missing from batch node n0006 (not found BatchStartTime after startup)


In addition, some jobs failed for other reasons, but there was other errors in the slurmctld log:
2021-11-15T23:52:43-08:00 slurm-01 slurmctld[251593]: Batch JobId=5447 missing from batch node n0005 (not found BatchStartTime after startup)
2021-11-16T12:26:15-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 2; expected 0
2021-11-16T12:26:16-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 3; expected 0
2021-11-16T12:26:17-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 4; expected 0
2021-11-16T12:26:30-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 5; expected 0
2021-11-16T12:26:31-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 6; expected 0
[... more of the same ...]
2021-11-16T12:30:48-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 57; expected 0
2021-11-16T12:30:48-08:00 slurm-01 slurmctld[251593]: error: BUG-12801: prolog_complete: JobId=5475: Too many prolog completes received! Received 58; expected 0
2021-11-16T14:17:55-08:00 slurm-01 slurmctld[251593]: Batch JobId=5497 missing from batch node n0003 (not found BatchStartTime after startup)

Putting the bug back to "High Impact".

Comment 32 Felix Abecassis 2021-11-17 10:29:48 MST

By the way, is there any plan to have a fix for this issue without relying on the workaround of using DeferBatch in 22.05?

Comment 33 Felix Abecassis 2021-11-17 10:31:49 MST

Forgot to mention that we have the following:
BatchStartTimeout = 240 sec

Which is significantly longer than the usual prolog time: ~60s

Comment 34 Michael Hinton 2021-11-17 12:44:34 MST

(In reply to Felix Abecassis from comment #32)
> By the way, is there any plan to have a fix for this issue without relying
> on the workaround of using DeferBatch in 22.05?
I'm not sure. We usually don't want to force a change in behavior on everyone, so I think we will have to keep the flag. Currently, it's opt-in with DeferBatch, but we could consider making the DeferBatch behavior the default and have a "NoDeferBatch" flag to opt in to the original behavior.

Comment 35 Felix Abecassis 2021-11-17 12:56:17 MST

I wasn't suggesting that DeferBatch should become the default. But I'm under the impression that there is still an underlying problem with how the node state is handled when resizing a job, don't you agree? As illustrated by the start times of the jobs when resizing happens, from comment 1:
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               295    normal  run.sub fabecass  R       0:02      2 node-[1-2]
               296    normal     bash fabecass  R       0:01      2 node-[3-4]


DeferBatch is a way to prevent the symptoms, but we hope that job resizing will be made to work with all possible Slurm configurations. Or if job resizing cannot be made to work without DeferBatch being enabled, it should probably be enforced in the code.

Comment 36 Michael Hinton 2021-11-17 15:11:02 MST

Can you attach complete logs for some of the jobs that failed, particularly for jobs 5497 and 5475? Do you know why job 5497 went into state NODE_FAIL initially?

Comment 37 Michael Hinton 2021-11-17 15:13:04 MST

(In reply to Felix Abecassis from comment #35)
> DeferBatch is a way to prevent the symptoms, but we hope that job resizing
> will be made to work with all possible Slurm configurations. Or if job
> resizing cannot be made to work without DeferBatch being enabled, it should
> probably be enforced in the code.
I think it will be much easier to make sure job resizing doesn't operate on nodes that are still running a prolog once we track those nodes with a bitmap. But yes, I see that as something separate that we should make sure never happens, if possible.

Comment 38 Felix Abecassis 2021-11-17 15:32:36 MST

5497 and 5475 have empty job logs as the user processes never started as far as I can tell.


5475 failed a check during a node prolog (we were deploying some SW upgrades concurrently) and thus it was cancelled:
2021-11-16T12:25:24-08:00 slurm-01 slurmctld[251593]: sched: Allocate JobId=5475 NodeList=n[0003,0005,0007,0009-0010] #CPUs=1280 Partition=...                                                                                                             
2021-11-16T12:25:44-08:00 slurm-01 slurmctld[251593]: update_node: node n0005 reason set to: ...                                                                                                            
2021-11-16T12:25:44-08:00 slurm-01 slurmctld[251593]: update_node: node n0005 state set to DRAINING                                                                                                                             
2021-11-16T12:25:44-08:00 slurm-01 slurmctld[251593]: error: Prolog launch failure, JobId=5475                                                                                                                        
2021-11-16T12:25:44-08:00 slurm-01 slurmctld[251593]: _slurm_rpc_requeue: 5475: Requested operation is presently disabled                                                                                                                                       
2021-11-16T12:25:44-08:00 slurm-01 slurmctld[251593]: _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=5475 uid 0


5497 went into NODE_FAIL because that is what happens after an error of type "not found BatchStartTime after startup":
https://github.com/SchedMD/slurm/blob/slurm-20-11-8-1/src/slurmctld/job_mgr.c#L14875-L14879
The 4th argument is "true", and to indicate a node failure:
https://github.com/SchedMD/slurm/blob/slurm-20-11-8-1/src/slurmctld/job_mgr.c#L6342-L6343

From the slurmctld log:
2021-11-16T13:50:18-08:00 slurm-01 slurmctld[251593]: _slurm_rpc_submit_batch_job: JobId=5497 InitPrio=16976 usec=15380
2021-11-16T13:50:19-08:00 slurm-01 slurmctld[251593]: sched: Allocate JobId=5497 NodeList=n[0003,0005-0007,0009-0010] #CPUs=1536 Partition=backfill-low 
2021-11-16T14:17:55-08:00 slurm-01 slurmctld[251593]: Batch JobId=5497 missing from batch node n0003 (not found BatchStartTime after startup)
2021-11-16T14:17:55-08:00 slurm-01 slurmctld[251593]: _job_complete: JobId=5497 WTERMSIG 126
2021-11-16T14:17:55-08:00 slurm-01 slurmctld[251593]: _job_complete: JobId=5497 cancelled by node failure
2021-11-16T14:17:55-08:00 slurm-01 slurmctld[251593]: _job_complete: JobId=5497 done

Comment 40 Michael Hinton 2021-11-18 11:10:42 MST

(In reply to Felix Abecassis from comment #32)
> By the way, is there any plan to have a fix for this issue without relying
> on the workaround of using DeferBatch in 22.05?
To be clear, the DeferBatch solution is still under review and being developed, so I can't guarantee a release date of 22.05 for that.

Comment 42 Michael Hinton 2021-11-18 15:18:55 MST

(In reply to Felix Abecassis from comment #31)
> We had to revert the patch on our test cluster running 20.11.8, it caused
> many jobs to become stuck at startup time for ~15-25 minutes before being
> cancelled:
> $ sacct -X -j 5497 --format jobid,state,exitcode,start,elapsed
>        JobID      State ExitCode               Start    Elapsed 
> ------------ ---------- -------- ------------------- ---------- 
> 5497          NODE_FAIL      1:0 2021-11-16T13:50:19   00:27:36 
> 
> This job usually takes 4-5 minutes to execute.
Is it possible for you to provide a reproducer with v1 that shows this issue? Is it simply that v1 causes some jobs to never run? Did job 5497 have a prolog failure? Or was job 5497 waiting for all prologs to finish before it could run? When a prolog fails, what actions does it take?  What state was job 5497 in before it became NODE_FAIL? Is there anything else in slurmctld.log or slurmd.log regarding job 5497?

It sounds like the main problem is jobs are stuck in startup and don't run, and are eventually cancelled by a node registration with the "not found BatchStartTime after startup" cleanup error. But I'm still trying to understand what's going on, as the issues caused by v1 are not yet clear to me with the information you've provided so far. 

The other issue - "Too many prolog completes received!" - may or may not really be an issue. Have you seen any real-world problems with those jobs other than that error message?

Thanks!
-Michael

Comment 43 Felix Abecassis 2021-11-18 16:31:56 MST

> Is it possible for you to provide a reproducer with v1 that shows this issue? Is it simply that v1 causes some jobs to never run?

I don't have a simple repro so far on my single-node multiple-slurmd test environment. It only happened on our test cluster with 8 separate nodes.
The only thing I noticed is that it's always the first node of the allocation that fails with BatchStartTime, i.e. the node that executes the sbatch script.

> Did job 5497 have a prolog failure?
> Or was job 5497 waiting for all prologs to finish before it could run? 

No prolog failures.
And it seems like all prologs terminated:
$ grep 'slurm-prolog: .* job=5497' /admin/logs/n00*/syslog/syslog.log
/admin/logs/n0003/syslog/syslog.log:2021-11-16T13:50:19-08:00 n0003 slurm-prolog: START user=user job=5497
/admin/logs/n0003/syslog/syslog.log:2021-11-16T13:50:49-08:00 n0003 slurm-prolog: END user=user job=5497
/admin/logs/n0005/syslog/syslog.log:2021-11-16T13:50:19-08:00 n0005 slurm-prolog: START user=user job=5497
/admin/logs/n0005/syslog/syslog.log:2021-11-16T13:50:49-08:00 n0005 slurm-prolog: END user=user job=5497
/admin/logs/n0006/syslog/syslog.log:2021-11-16T13:50:19-08:00 n0006 slurm-prolog: START user=user job=5497
/admin/logs/n0006/syslog/syslog.log:2021-11-16T13:50:50-08:00 n0006 slurm-prolog: END user=user job=5497
/admin/logs/n0007/syslog/syslog.log:2021-11-16T13:50:19-08:00 n0007 slurm-prolog: START user=user job=5497
/admin/logs/n0007/syslog/syslog.log:2021-11-16T13:50:51-08:00 n0007 slurm-prolog: END user=user job=5497
/admin/logs/n0009/syslog/syslog.log:2021-11-16T13:50:19-08:00 n0009 slurm-prolog: START user=user job=5497
/admin/logs/n0009/syslog/syslog.log:2021-11-16T13:50:50-08:00 n0009 slurm-prolog: END user=user job=5497
/admin/logs/n0010/syslog/syslog.log:2021-11-16T13:50:19-08:00 n0010 slurm-prolog: START user=user job=5497
/admin/logs/n0010/syslog/syslog.log:2021-11-16T13:50:50-08:00 n0010 slurm-prolog: END user=user job=5497


> When a prolog fails, what actions does it take?  What state was job 5497 in before it became NODE_FAIL? Is there anything else in slurmctld.log or slurmd.log regarding job 5497?

It was in the 5497.sbatch state, the sbatch script is simple: it just launches 2 srun. But no job step were recorded:
$ sacct -j 5497 --format jobid,nnodes,nodelist%32,start,end,elapsed
       JobID   NNodes                         NodeList               Start                 End    Elapsed 
------------ -------- -------------------------------- ------------------- ------------------- ---------- 
5497                6      n[0003,0005-0007,0009-0010] 2021-11-16T13:50:19 2021-11-16T14:17:55   00:27:36 
5497.batch          1                            n0003 2021-11-16T13:50:19 2021-11-16T14:18:07   00:27:48 

> The other issue - "Too many prolog completes received!" - may or may not really be an issue. Have you seen any real-world problems with those jobs other than that error message?

No, but we didn't look much more into it given the first issue. We can confirm that the problem is gone after reverting the patch.

Comment 44 Felix Abecassis 2021-11-18 17:12:49 MST

I was finally able to get a repro on my one-node test machine.

In slurm.conf, we are now using 6 nodes and BatchStartTimeout is 30s:
```
NodeName=node-[1-6] NodeHostname=ioctl Sockets=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN Port=[17001-17006] Weight=1
PartitionName=normal PriorityTier=1 PreemptMode=OFF Nodes=node-[1-6] Default=YES MaxTime=INFINITE Hidden=NO State=UP OverSubscribe=EXCLUSIVE

PrologFlags=Alloc,Serial,DeferBatch
BatchStartTimeout = 120
```

/etc/slurm/prolog.sh is now:
```
#!/bin/bash

logger -s -t slurm-prolog "START user=$SLURM_JOB_USER job=$SLURM_JOB_ID node=$SLURMD_NODENAME"

if [ "${SLURMD_NODENAME}" == "node-1" ]; then
    sleep 10
fi

logger -s -t slurm-prolog "END user=$SLURM_JOB_USER job=$SLURM_JOB_ID node=$SLURMD_NODENAME"
```
The difference with the original repro is that the first node is sleeping.

run2.sub:
```
#!/bin/bash -eux
#SBATCH -N6

echo "Running on hosts: $(echo $(scontrol show hostname))"
srun -N ${SLURM_JOB_NUM_NODES} --ntasks-per-node=1 sleep 30s
```

And here is the slurmctld log showing the issue:
```
[2021-11-18T16:05:34.544] _slurm_rpc_submit_batch_job: JobId=22 InitPrio=4294901739 usec=603
[2021-11-18T16:05:36.549] sched: Allocate JobId=22 NodeList=node-[1-6] #CPUs=72 Partition=normal
[2021-11-18T16:05:36.549] _batch_launch_defer: JobId=22 still waiting on node prologs (prolog completion count == 0; node_cnt = 6)
[2021-11-18T16:05:36.555] BUG-12801: prolog_complete: JobId=22: Keeping job in state Prolog until all prologs complete
[2021-11-18T16:05:36.555] BUG-12801: prolog_complete: JobId=22: Keeping job in state Prolog until all prologs complete
[2021-11-18T16:05:36.556] BUG-12801: prolog_complete: JobId=22: Keeping job in state Prolog until all prologs complete
[2021-11-18T16:05:36.556] BUG-12801: prolog_complete: JobId=22: Keeping job in state Prolog until all prologs complete
[2021-11-18T16:05:36.556] BUG-12801: prolog_complete: JobId=22: Keeping job in state Prolog until all prologs complete
[2021-11-18T16:05:46.560] BUG-12801: prolog_complete: JobId=22: Received prolog completes from all 6 nodes in the allocation
[2021-11-18T16:08:40.547] Batch JobId=22 missing from batch node node-1 (not found BatchStartTime after startup)
[2021-11-18T16:08:40.547] _job_complete: JobId=22 WTERMSIG 126
[2021-11-18T16:08:40.547] _job_complete: JobId=22 cancelled by node failure
[2021-11-18T16:08:40.547] _job_complete: JobId=22 done
[2021-11-18T16:08:40.547] agent(batch_launch): removed pending request for cancelled JobId=22
```

Comment 45 Felix Abecassis 2021-11-18 17:16:08 MST

Also, to get a faster job cancellation, I suggest increasing the frequency at which slurmd pings slurmctld.
Set this value to 1:
https://github.com/SchedMD/slurm/blob/15a9f494c9288f248a6f435a133ee96c6365e9d3/src/slurmctld/ping_nodes.c#L54-L55

Also, you can lower SlurmdTimeout in slurm.conf or manually set ping_interval to 30 here:
https://github.com/SchedMD/slurm/blob/15a9f494c9288f248a6f435a133ee96c6365e9d3/src/slurmctld/controller.c#L1966-L1975

Comment 46 Michael Hinton 2021-11-19 13:56:29 MST

I went ahead and set ping_interval to 30 and MAX_REG_FREQUENCY to 1. Thanks for the pointers.

What values do you have for MessageTimeout, ResumeTimeout, and SuspendTime? Also, you said your BatchStartTimeout was 30, but you put down 120 in the conf. Do they both show the issue, or only the first one?

I am unable to reproduce the issue following your reproducer. Is this the right command still (just with run2.sub instead of run.sub)? Since we are no longer doing the resize, is everything after run2.sub even needed?:

    sbatch run2.sub ; sleep 2s ; srun -N2 -t 5 bash -c 'date ; echo $SLURM_NODELIST; sleep 120s'

What I'm still confused about is this 3-minute gap:

(In reply to Felix Abecassis from comment #44)
> [2021-11-18T16:05:46.560] BUG-12801: prolog_complete: JobId=22: Received
> prolog completes from all 6 nodes in the allocation
> [2021-11-18T16:08:40.547] Batch JobId=22 missing from batch node node-1 (not
> found BatchStartTime after startup)
What is the job doing while it's stuck in this longer-than-usual delay? Does it never finish running when it should? Or does it stay in completing forever? Or something else?

When I try to reproduce the issue, the first 6-node batch job succeeds as expected. there is no 3-minute gap of it doing anything weird.

-Michael

Comment 47 Felix Abecassis 2021-11-19 14:18:48 MST

> What values do you have for MessageTimeout, ResumeTimeout, and SuspendTime?

MessageTimeout          = 60 sec
ResumeTimeout           = 1200 sec
SuspendTimeout          = 30 sec

> Also, you said your BatchStartTimeout was 30, but you put down 120 in the conf. Do they both show the issue, or only the first one?

Apologies, BatchStartTimeout can be 30s or 120s, I get the repro in both cases. Perhaps set it to 30s if you currently can't get a repro with 120s

> Since we are no longer doing the resize, is everything after run2.sub even needed?

Correct, only sbatch is required for the repro.

> What is the job doing while it's stuck in this longer-than-usual delay? Does it never finish running when it should? Or does it stay in completing forever? Or something else?

No, it didn't start at all. The output file is not even created. The 3 minute gap is because of "BatchStartTimeout = 120", plus waiting for the next ping_interval. With a lower BatchStartTimeout, the gap will be lower.

The repro is not 100% of the time, how many times did you try?

Comment 48 Michael Hinton 2021-11-19 14:28:25 MST

(In reply to Felix Abecassis from comment #47)
> > What values do you have for MessageTimeout, ResumeTimeout, and SuspendTime?
> 
> MessageTimeout          = 60 sec
> ResumeTimeout           = 1200 sec
> SuspendTimeout          = 30 sec

Ah, ok. I think your huge resume timeout combined with suspend timeout is triggering this code in _purge_missing_jobs():

		if ((job_ptr->batch_flag != 0)			&&
		    (slurm_conf.suspend_time != 0) /* power mgmt on */	&&
		    (job_ptr->start_time < node_boot_time)) {
			startup_time = batch_startup_time -
				slurm_conf.resume_timeout;

So startup_time is much less than it would be. Let me try to set a high resume timeout with suspend timeout to see if I can reproduce it.

Comment 49 Michael Hinton 2021-11-19 16:31:30 MST

Ok, I was able to reproduce it once. I believe a well-timed node registration is messing up _batch_launch_defer() so that the batch script never runs after that, and then things just time out (as expected). Here's a timeline from my single time reproducing it:

[15:20:16] (batch submit RPC)

[15:20:17] (job allocation)

[15:20:17] (_batch_launch_defer() waits to run batch script until prologs all finish)

[15:20:17] (Prologs for nodes 2-6 finish)

[15:20:22] (_batch_launch_defer() runs, but doesn't launch batch script because prolog node 1 hasn't finished)

[15:20:25] (Node registration comes in)

[15:20:27] (Node 1 prolog finishes)

[15:20:55] (Node registration comes in)

[15:21:25] (Node registration comes in, BatchStartTimeout-related code hit, and "missing from batch node" error emitted)

I still need to put in more debug logs to understand what code paths are being taken that make is so _batch_launch_defer() no longer hits the v1 code to start the batch script. Does it exit early? Or does _batch_launch_defer() stop being called altogether?

I'm hopeful that there is a straightforward fix to this issue. I am still hopeful the approach of v1 can work if we fix it to handle this.

Thanks,
-Michael

Comment 50 Michael Hinton 2021-11-19 17:30:24 MST

I think I see the problem. _agent_defer() -> _batch_launch_defer() only gets called when the global var last_job_update is updated. I believe that we need to take a job write lock and set last_job_update whenever we update the prolog counter for a job (in job_ptr->kill_on_node_fail), or else  _batch_launch_defer() will never run again until things time out and it's too late.

It might be as simple as adding:

lock_slurmctld(job_write_lock);
last_job_update = time(NULL)
unlock_slurmctld(job_write_lock);

in the appropriate place in prolog_complete, assuming job_write_lock isn't already taken:

	if (!(msg->flags & CTLD_QUEUE_PROCESSING))
		lock_slurmctld(job_write_lock);
	error_code = prolog_complete(comp_msg->job_id, comp_msg->prolog_rc);
	if (!(msg->flags & CTLD_QUEUE_PROCESSING))
		unlock_slurmctld(job_write_lock);

I'm not sure what CTLD_QUEUE_PROCESSING is all about yet.

I'm going home for the weekend, but feel free to play around with this idea as needed. I'll pick this up first thing Monday morning and see if I can't get a v2 that works.

Thanks!
-Michael

Comment 51 Felix Abecassis 2021-11-19 17:49:54 MST

Thanks for the update!

I will be OOTO next week, so take your time!

Comment 52 Michael Hinton 2021-11-22 13:35:00 MST

Created attachment 22363 [details]
v2

Hey Felix,

Here is a v2 that fixes the issue in the latest reproducer you specified. The issue was that the thread that checks if a job's batch script is ready to run (_agent_defer() -> _batch_launch_defer()) waits until a global var called last_job_update is updated (it's an optimization - why check up on job-related things when no jobs have changed?). v1 never touched it, so _batch_launch_defer() was never called to start the batch script after the last prolog finished. last_job_update would have to be set by some other job in order to run the job in question. This is probably why it was not 100% reproducable for you - some other unrelated job could finish or start and trigger _batch_launch_defer() before a node registration hits the BatchStartTime timeout.

I don't expect that you'll hit any BatchStartTime timeouts after v2, but if you do, then I think the various timeout values will need to be adjusted further.

Let me know how v2 goes. Thanks!
-Michael

Comment 53 Michael Hinton 2021-11-22 13:37:00 MST

Oh, and I should mention that v2 requires that enable_rpc_queue is not specified. I'm assuming that you are not using that, but if you are, then we will need to get a v3 that can play nice with enable_rpc_queue.

Comment 54 Michael Hinton 2021-11-22 13:57:34 MST

As for the "too many prolog completes received" error, I don't think they hurt anything, but it would be good if you can find a reproducer for that as well once you can confirm v2 fixes the more important issue. I'm not sure why a job's node_cnt is 0 when those prolog complete RPCs come in, but maybe that is just an expected situation if a job gets cancelled or something while other prologs are still finishing up.

Comment 55 Felix Abecassis 2021-11-29 15:29:03 MST

I confirm that patch v2 fixes the issue seen with v1 on my local machine, as described in comment 44.
We are going to deploy on our test cluster next.

Comment 56 Felix Abecassis 2021-12-06 13:49:30 MST

Unfortunately, we are still seeing hung jobs on the test cluster with the error "not found BatchStartTime after startup", but so far I cannot repro the issue locally.
So for now we are going to go with the "srun true" workaround, hopefully the DeferBatch feature will make it to 22.05.

Comment 57 Jason Booth 2021-12-06 13:51:34 MST

Thanks for the update, Felix. Micheal will be back in the office tomorrow and I will have him look into your update.

Comment 58 Michael Hinton 2021-12-07 10:20:19 MST

(In reply to Felix Abecassis from comment #56)
> Unfortunately, we are still seeing hung jobs on the test cluster with the
> error "not found BatchStartTime after startup", but so far I cannot repro
> the issue locally.
> So for now we are going to go with the "srun true" workaround, hopefully the
> DeferBatch feature will make it to 22.05.
It's possible that you still need to adjust some values affecting the BatchStartTime timeout. Could you attach more information about this set of "not found BatchStartTime after startup" errors? I'm assuming they are less frequent with v2 than with v1? Or are they just as frequent?

-Michael

Comment 59 Felix Abecassis 2021-12-07 10:32:21 MST

BatchStartTime would not help as there is a large gap between the end of the prolog and the job being cancelled:

2021-12-06T03:05:26-08:00 slurm-01 slurmctld[1792945]: _slurm_rpc_submit_batch_job: JobId=6207 InitPrio=16884 usec=9173
2021-12-06T03:05:27-08:00 slurm-01 slurmctld[1792945]: sched: Allocate JobId=6207 NodeList=n[0002-0003,0005-0007,0009-0011] #CPUs=2048 Partition=part
2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: Batch JobId=6207 missing from batch node n0002 (not found BatchStartTime after startup)
2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: _job_complete: JobId=6207 WTERMSIG 126
2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: _job_complete: JobId=6207 cancelled by node failure
2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: _job_complete: JobId=6207 done

It looks like it's less frequent than before, but hard to say for sure.

Comment 60 Michael Hinton 2021-12-07 14:30:01 MST

(In reply to Felix Abecassis from comment #59)
> BatchStartTime would not help as there is a large gap between the end of the
> prolog and the job being cancelled:
> 
> 2021-12-06T03:05:26-08:00 slurm-01 slurmctld[1792945]:
> _slurm_rpc_submit_batch_job: JobId=6207 InitPrio=16884 usec=9173
> 2021-12-06T03:05:27-08:00 slurm-01 slurmctld[1792945]: sched: Allocate
> JobId=6207 NodeList=n[0002-0003,0005-0007,0009-0011] #CPUs=2048
> Partition=part
> 2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: Batch JobId=6207
> missing from batch node n0002 (not found BatchStartTime after startup)
> 2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: _job_complete:
> JobId=6207 WTERMSIG 126
> 2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: _job_complete:
> JobId=6207 cancelled by node failure
> 2021-12-06T03:20:34-08:00 slurm-01 slurmctld[1792945]: _job_complete:
> JobId=6207 done
> 
> It looks like it's less frequent than before, but hard to say for sure.
Ok, so are the jobs with the "missing from batch node" errors still exclusively correlated with job resizing/shrinking, or are jobs that don't need a resize also showing this error? Is there any more information you can give about the affected jobs?

Thanks,
-Michael

Comment 61 Felix Abecassis 2021-12-07 15:00:02 MST

Unfortunately, I could not find more useful information about this failure, it looks very similar to the situation that lead to patch v2: https://bugs.schedmd.com/show_bug.cgi?id=12801#c31
But this time I can't reproduce locally.

Comment 62 Michael Hinton 2021-12-07 16:14:00 MST

(In reply to Felix Abecassis from comment #61)
> Unfortunately, I could not find more useful information about this failure,
> it looks very similar to the situation that lead to patch v2:
> https://bugs.schedmd.com/show_bug.cgi?id=12801#c31
> But this time I can't reproduce locally.
Would you be open to applying a debug patch on top of v2 to see what is causing the batch timeout to trigger? I think we need more visibility on that.

Comment 63 Michael Hinton 2021-12-10 12:01:18 MST

(In reply to Felix Abecassis from comment #56)
> So for now we are going to go with the "srun true" workaround, hopefully the
> DeferBatch feature will make it to 22.05.
How's the "srun true" workaround going?

Since you have a workaround, I'm going to go ahead and drop this down to a severity 3 until you are able to create a reproducer w/ v2 or are available to apply a further debug patch, so we can better see what's going on with the jobs that are causing BatchStartTime timeouts.

Thanks!
-Michael

Comment 64 Felix Abecassis 2021-12-10 14:07:06 MST

Sorry for the delay. We weren't able to deploy v2 so we asked users to use the workaround instead. It seems fine so far but there has been fewer than usual resizing jobs in the past few days, we will continue monitoring.

It's fine to drop to severity 3, we will circle back when we have time to deploy a debug patch.

Comment 66 Michael Hinton 2022-01-10 13:45:07 MST

Hi Felix,

(In reply to Michael Hinton from comment #63)
> Since you have a workaround, I'm going to go ahead and drop this down to a
> severity 3 until you are able to create a reproducer w/ v2 or are available
> to apply a further debug patch, so we can better see what's going on with
> the jobs that are causing BatchStartTime timeouts.
Have you been able to look into creating a reproducer with v2? Or if that has been elusive, we could supply you with a debug patch to help see what's going on.

(In reply to Felix Abecassis from comment #35)
> DeferBatch is a way to prevent the symptoms, but we hope that job resizing
> will be made to work with all possible Slurm configurations. Or if job
> resizing cannot be made to work without DeferBatch being enabled, it should
> probably be enforced in the code.
The resize code was never designed to be triggered this quickly. The expected resize workflow was to run a job for a while, then release nodes while some post-processing continued on a subset of the resources, then for the job to end. And that is all assuming reasonable durations for the Prolog and Epilog.

That being said, hopefully we can improve the resizing code to not allow a resize to start when it shouldn't. It should ideally be smart enough to not jump the gun and resize when prologs/epilogs are still finishing.

Thanks,
-Michael

Comment 67 Felix Abecassis 2022-01-10 14:10:04 MST

> Have you been able to look into creating a reproducer with v2? Or if that has been elusive, we could supply you with a debug patch to help see what's going on.

No, sorry. I tried a bit more but I still couldn't get a repro locally.
At this point I think we are going to wait until the 22.05 release, with hopefully the DeferBatch patch being merged.

> The resize code was never designed to be triggered this quickly. The expected resize workflow was to run a job for a while, then release nodes while some post-processing continued on a subset of the resources, then for the job to end. And that is all assuming reasonable durations for the Prolog and Epilog.

That makes sense, and that's why we mentioned that https://bugs.schedmd.com/show_bug.cgi?id=12266 would also fix this bug, as no resizing would be needed for this use case.

Comment 68 Felix Abecassis 2022-01-13 18:35:40 MST

FYI, it seems like the v2 patch is causing this bug:
https://bugs.schedmd.com/show_bug.cgi?id=13187

(I initially didn't realized our cluster still had this patch applied)

Comment 69 Michael Hinton 2022-02-17 21:09:27 MST

Hi Felix,

Have you been able to make any progress on a reproducer for the error you see with v2 (and now on master, according to bug 11635 comment 44 and bug 13187 comment 3)? If not, could you provide more information for another instance when you reproduced it?

Thanks,
-Michael

Comment 70 Felix Abecassis 2022-02-17 21:28:33 MST

There are 2 different problems here.

1) https://bugs.schedmd.com/show_bug.cgi?id=12801#c56
Is an error of type "not found BatchStartTime after startup", it happened with v2 of the patch but I have *not* verified yet if this issues is happening on master.
I don't have a repro for this one as discussed in the previous comments here.

2) https://bugs.schedmd.com/show_bug.cgi?id=13187
Is an error of type "Duplicate jobid", it happened with v2 of the patch and it is happening on master. The repro is simple and is mentioned in the description of #13187

Comment 71 Michael Hinton 2022-04-19 11:08:59 MDT

Comment on attachment 22363 [details]
v2

I'm marking v2 as obsolete, because it was incorporated in commits 180e6c6525, 6fd7283818, and c8e56447ad, which landed on master a few months ago. So the DeferBatch flag should be in master and should solve the resizing problem described in this bug.

Dominik has another patch that he will soon make available in bug 11635 for you to test that should fix the other issue you are seeing, I think.

Is it alright if we close this bug and continue tracking down the other issue in bug 11635?

Thanks,
-Michael

Comment 72 Felix Abecassis 2022-04-19 11:55:31 MDT

Yes that's fine for me.

I will test the master branch + Dominik's patch when it's available.

Comment 73 Michael Hinton 2022-04-19 11:57:17 MDT

Ok great. I'll mark this as a duplicate of 11635, since it was fixed there and we'll continue there as well.

Thanks!
-Michael

*** This ticket has been marked as a duplicate of ticket 11635 ***