Ticket 11635

Summary:	Discarded REQUEST_STEP_COMPLETE events leading to stalled job steps
Product:	Slurm	Reporter:	Felix Abecassis <fabecassis>
Component:	slurmstepd	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	albert.gil, jbernauer, lyeager
Version:	20.11.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=12801
Site:	NVIDIA (PSLA)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	patch proposal patch proposal slurmctld log for one job patch proposal v13 slurmd.node-1.log

Description Felix Abecassis 2021-05-17 13:01:24 MDT

Related to https://bugs.schedmd.com/show_bug.cgi?id=11446 (private).

On our cluster, for some jobs, we sometimes see many error messages like the following:
May 14 18:56:06 node-0022 slurmd[26142]: message repeated 31 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]


This only ever happens for job step 0. Using sacct shows that this job step as an anomalous end time, beyond the end time of the job itself:
$ sacct -j 1407764 --format jobid,start,end,elapsed
       JobID               Start                 End    Elapsed 
------------ ------------------- ------------------- ---------- 
1407764      2021-05-14T18:55:11 2021-05-14T18:56:18   00:01:07 
1407764.bat+ 2021-05-14T18:55:11 2021-05-14T18:56:18   00:01:07 
1407764.0    2021-05-14T18:55:56 2021-05-14T18:57:46   00:01:50 
1407764.1    2021-05-14T18:56:16 2021-05-14T18:56:18   00:00:02 

1407764.0 and 1407764.1 were launched using the same srun command that just prints to stdout on each node and then exits. This job above was using "--overcommit", so the two time ranges overlap, even though the steps were run sequentially as foreground jobs in the sbatch script.

Without --overcommit, the time ranges don't overlap anymore, but the second step is blocked for a while:
srun: Job step creation temporarily disabled, retrying (Requested nodes are busy)

This is a waste of resources for short lived large multi-node jobs, as the user task for job step 0 completed very quickly, but step 1 has to wait for ~2 minutes to run.

In https://bugs.schedmd.com/show_bug.cgi?id=11446, I also found out that the slurmstepd process for step 0 was still running for a while, and it was blocked in function stepd_wait_for_children_slurmstepd.

I was then able to reproduce the error on a local 8 node cluster (with multiple slurmd, *not* front-end), by injecting an artificial delay of 10s for the submission of the job to the root of the tree (node-1):
https://github.com/SchedMD/slurm/blob/slurm-20-11-6-1/src/api/step_launch.c#L1735
        if (ctx->step_req->step_id.step_id == 0) {
                ret_list = slurm_send_recv_msgs("node-[2-8]", &msg, timeout);
                error("sleep 10s before sending to node-1");
                sleep(10);
                ret_list = slurm_send_recv_msgs("node-1", &msg, timeout);
        } else
                ret_list = slurm_send_recv_msgs(nodelist, &msg, timeout); 

With this modification, we can reproduce the "no such file or directory" error message for node-1. I believe it happens because slurmd for node-1 receives a completion message REQUEST_STEP_COMPLETE for a job step *before* it received the REQUEST_LAUNCH_TASKS request. Looks like the acknowledgement will just be discarded if stepd_connect fails, and then this slurmd process will be stuck in stepd_wait_for_children_slurmstepd, waiting for completion messages from children that already sent them. 

On our cluster, this issue happens for very large jobs, which would indicate that we are hitting a natural propagation delay, like the one simulated above.

Comment 1 Felix Abecassis 2021-05-17 14:45:18 MDT

Note that blocking later job steps is a new behavior in 20.11. The error message can show up in 20.02, but the next job step executed immediately.

Using this batch script:
#!/bin/bash -eux
#SBATCH -t 5 --nodes=8 --exclusive

srun --mpi=none --ntasks="$((SLURM_JOB_NUM_NODES * 8))" --ntasks-per-node=8 -l true
srun --mpi=none --ntasks="$((SLURM_JOB_NUM_NODES * 8))" --ntasks-per-node=8 -l echo "done"


On Slurm 20.11:
$ sacct -j 4 --format jobid,start,end,elapsed
       JobID               Start                 End    Elapsed 
------------ ------------------- ------------------- ---------- 
4            2021-05-17T13:31:09 2021-05-17T13:32:23   00:01:14 
4.batch      2021-05-17T13:31:09 2021-05-17T13:32:23   00:01:14 
4.0          2021-05-17T13:31:09 2021-05-17T13:32:22   00:01:13 
4.1          2021-05-17T13:32:22 2021-05-17T13:32:23   00:00:01

The job took 01:14 to execute, as job step 4.1 was blocked on the termination of 4.0. Per bug#11446, it seems that 4.0 now holds to the node resources for a long duration because of this bug.

On Slurm 20.02:
$ sacct -j 5 --format jobid,start,end,elapsed
       JobID               Start                 End    Elapsed 
------------ ------------------- ------------------- ---------- 
5            2021-05-17T13:34:51 2021-05-17T13:35:02   00:00:11 
5.batch      2021-05-17T13:34:51 2021-05-17T13:35:02   00:00:11 
5.0          2021-05-17T13:34:51 2021-05-17T13:36:03   00:01:12 
5.1          2021-05-17T13:35:01 2021-05-17T13:35:02   00:00:01 
The whole job too 00:11 to execute (because of the artificial 10s delay). Job step 0 had the same duration, but it didn't prevent job step 1 from starting.

Comment 6 Dominik Bartkiewicz 2021-05-20 12:25:29 MDT

Hi

I can recreate this issue.
I don't know if we should treat this as a bug.

You can modify reverse tree timeouts (lower REVERSE_TREE_CHILDREN_TIMEOUT or increase REVERSE_TREE_PARENT_RETRY). 

Dominik

Comment 7 Felix Abecassis 2021-05-20 21:04:54 MDT

> I don't know if we should treat this as a bug.

If this is the decision, please explain the rationale.

Thanks,

Comment 8 Dominik Bartkiewicz 2021-05-21 09:38:04 MDT

Hi

It is not a decision. It is my understanding of this issue.
This sixty-second timeout exists to handle the situation when stepd ends before the parent in reverse tree stepd start.
This mechanism works fine. We can discuss if the current values of those timeouts are optimal.

More concern is why such a situation is occurring when you submit the job on your cluster.
This is only speculation, but I suspect that this can be related to PrologFlags=Serial.

Could you send me slurmd.log with an affected job from the production cluster? 

Dominik

Comment 9 Felix Abecassis 2021-05-21 10:41:32 MDT

> This mechanism works fine. We can discuss if the current values of those timeouts are optimal.

Got it. It was indeed harmless until Slurm 20.11. But now that the default behavior (no overlap of job steps) has changed, the job steps are now delayed and this is a problem as this is not an efficient use of resources. So it seems that it was not a bug in 20.02, but now in 20.11 it's a bug.

If you can provide us with a single reliable knob (not through user environment variables) to disable the default behavior of 20.11 (always allowing overlapping job steps), then the issue will still exist but will go back to being harmless like in 20.02.

> More concern is why such a situation is occurring when you submit the job on your cluster.

This is fair, we can focus on that for now.

> This is only speculation, but I suspect that this can be related to PrologFlags=Serial.

How so? "Serial" applies to prolog and epilog, are you saying it can have an impact on the execution between prolog and job?

Comment 10 Felix Abecassis 2021-05-21 10:47:35 MDT

> Could you send me slurmd.log with an affected job from the production cluster? 

Here is an extract of the log for node-0507, so one of the deepest node in the job tree:
$ grep -A2 '1407764' node-0507.log
18:55:11 slurm-prolog: START user=fabecassis job=1407764
18:55:11 slurm-prolog: /etc/slurm/prolog.d/...
--       
18:55:56 slurmd[25055]: launch task StepId=1407764.0 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:47232
18:55:56 slurmd[25055]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:55:56 slurmd[25055]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:55:57 slurmd[25055]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:55:57 slurm-prolog: NHC end
18:55:57 slurm-prolog: END user=fabecassis job=1407764
18:55:59 slurmstepd[3655936]: done with job
18:56:16 slurmd[25055]: launch task StepId=1407764.1 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:47362
18:56:16 slurmd[25055]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:56:17 slurmstepd[3655976]: done with job
18:56:18 slurm-epilog: START user=fabecassis job=1407764
18:56:19 slurm-epilog: /etc/slurm/epilog.d/30-tuned-profile.sh:
18:56:19 slurm-epilog: Restoring default settings
--         
18:56:57 slurm-epilog: END user=fabecassis job=1407764


The step 0 request is concurrent to the execution of the prolog and the stepd_connect failure.



And below is the log for node-0002, which should be close to the root of the tree if I understand correctly:
$ grep -A2 '1407764' node-0002.log.1
18:55:11 slurm-prolog: START user=fabecassis job=1407764
18:55:11 slurm-prolog: /etc/slurm/prolog.d/10-fans.sh:
18:55:12 slurm-prolog: /etc/slurm/prolog.d/...
--        
18:55:56 slurmd[25976]: launch task StepId=1407764.0 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:60674
18:55:56 slurmd[25976]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:55:56 slurm-prolog: NHC end
18:55:56 slurm-prolog: END user=fabecassis job=1407764
18:56:16 slurmd[25976]: launch task StepId=1407764.1 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:60804
18:56:16 slurmd[25976]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:56:18 slurmstepd[3361654]: done with job
18:57:06 slurmstepd[3361628]: done with job
18:57:07 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:09 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:09 slurm-epilog: START user=fabecassis job=1407764
18:57:09 slurm-epilog: /etc/slurm/epilog.d/...
--       
18:57:10 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:10 slurmd[25976]: message repeated 4 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:11 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:17 slurmd[25976]: message repeated 36 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:18 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:18 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
[... more stepd_connect failures ...]

18:57:34 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:38 slurm-epilog: NHC end
18:57:38 slurm-epilog: END user=fabecassis job=1407764


It looks like it could be the reverse problem on this node, REVERSE_TREE_CHILDREN_TIMEOUT is *too low*, so node-0002 has already given up on this job but continues receiving events from children?

Comment 11 Dominik Bartkiewicz 2021-05-25 09:16:30 MDT

Hi

Could you send me the full slurmd.log from nodes node-0002 and node-0507?
How did this job was submitted by sbatch or by salloc?

Dominik

Comment 12 Felix Abecassis 2021-05-25 09:28:28 MDT

The job was submitted through "sbatch".

Unfortunately I cannot share the full slurmd.log. Please let me know if you are looking for something in particular in the log and then I can provide it.

Comment 13 Dominik Bartkiewicz 2021-06-04 06:33:31 MDT

Hi

Could you check if, after adding sbatch_wait_nodes to SchedulerParameters or use "--wait-all-nodes" sbatch option, this problem still exists?

Dominik

Comment 14 Felix Abecassis 2021-06-04 12:10:12 MDT

Tested on the cluster, it doesn't seem to change the situation. With --wait-all-nodes=1 and *without* --overcommit, I still see:
srun: Job 1441883 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 1441883

$ sacct -j 1441883 --format jobid,start,end
       JobID               Start                 End 
------------ ------------------- ------------------- 
1441883      2021-06-04T10:54:36 2021-06-04T10:56:32 
1441883.bat+ 2021-06-04T10:54:36 2021-06-04T10:56:32 
1441883.0    2021-06-04T10:55:17 2021-06-04T10:56:32 
1441883.1    2021-06-04T10:56:32 2021-06-04T10:56:32

Comment 15 Dominik Bartkiewicz 2021-06-07 05:01:18 MDT

Hi

To ensure this was not the cluster with reproducer patch from the initial commit?

Dominik

Comment 16 Felix Abecassis 2021-06-07 09:36:04 MDT

No, the patch with the call to sleep() was only tested on a local machine with multiple slurmd.
My latest test was on a production cluster, using sbatch --wait-all-nodes=1

Comment 43 Dominik Bartkiewicz 2022-02-17 03:43:26 MST

Hi

We added a new DeferBatch flag to master.
I will force the batch step to start after the prolog is finished on all nodes.
https://github.com/SchedMD/slurm/compare/16cf1d3e76..0c35e45d35

Let me know if we can close this ticket or if you have any related questions.

Dominik

Comment 44 Felix Abecassis 2022-02-17 20:45:55 MST

Hi Dominik,

This patch seems to have the same bug than the patch provided by Michael in https://bugs.schedmd.com/show_bug.cgi?id=12801, so I have reopened https://bugs.schedmd.com/show_bug.cgi?id=13187 as it is not happening on the master branch.

Comment 45 Felix Abecassis 2022-02-17 20:46:22 MST

I meant to say:
> as it is *now* happening on the master branch.

Comment 47 Dominik Bartkiewicz 2022-02-18 06:12:53 MST

Hi

I can reproduce this issue.
I will close bug 13187 as a duplicate of this ticket.
This bug is not directly caused by these patches but makes this race easy to hit.
I will let you know when I will have a patch for this.

Dominik

Comment 48 Dominik Bartkiewicz 2022-02-18 06:14:22 MST

*** Ticket 13187 has been marked as a duplicate of this ticket. ***

Comment 52 Michael Hinton 2022-04-19 11:57:17 MDT

*** Ticket 12801 has been marked as a duplicate of this ticket. ***

Comment 53 Dominik Bartkiewicz 2022-04-20 04:09:30 MDT

Created attachment 24546 [details]
patch proposal

Hi

This patch should fix this race.
Could you check if it works as we expect and correctly covers all your uses cases?

Dominik

Comment 54 Felix Abecassis 2022-04-20 10:47:30 MDT

Hi Dominik,

The patch does not apply on the current master:

$ git am -3 patch
Applying: Allow other steps to start before the batch step
Using index info to reconstruct a base tree...
M       src/slurmd/slurmd/req.c
Falling back to patching base and 3-way merge...
Auto-merging src/slurmd/slurmd/req.c
CONFLICT (content): Merge conflict in src/slurmd/slurmd/req.c
error: Failed to merge in the changes.
Patch failed at 0001 Allow other steps to start before the batch step
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Comment 55 Dominik Bartkiewicz 2022-04-22 06:37:44 MDT

Created attachment 24608 [details]
patch proposal

Hi

Generated on fresh master, I have hope it will apply clearly.

Dominik

Comment 56 Felix Abecassis 2022-04-22 12:40:15 MDT

It didn't work:
$ curl -s https://bugs.schedmd.com/attachment.cgi?id=24608 | git am -3
Applying: Allow other steps to start before the batch step
Using index info to reconstruct a base tree...
M       src/slurmd/slurmd/req.c
Falling back to patching base and 3-way merge...
Auto-merging src/slurmd/slurmd/req.c
CONFLICT (content): Merge conflict in src/slurmd/slurmd/req.c
error: Failed to merge in the changes.
Patch failed at 0001 Allow other steps to start before the batch step
hint: Use 'git am --show-current-patch=diff' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


I'm doing a manual merge.

Comment 57 Felix Abecassis 2022-04-22 15:39:50 MDT

Created attachment 24623 [details]
slurmctld log for one job

Comment 58 Felix Abecassis 2022-04-22 15:40:16 MDT

I'm facing issues with the DeferBatch flag on master (without the patch attached to this bug).

I'm running 8 slurmd on my local workstation and I disabled prologs and epilogs. Without DeferBatch a simple job starts and finishes immediately:
$ scontrol show config | grep PrologFlags
PrologFlags             = Alloc,Contain,Serial

$ sbatch -N8 --wrap="srun true; srun sleep 2s;"
Submitted batch job 29450

$ sacct -j 29450 --format jobid,start,end,elapsed
JobID                      Start                 End    Elapsed 
------------ ------------------- ------------------- ---------- 
29450        2022-04-22T14:29:37 2022-04-22T14:29:39   00:00:02 
29450.batch  2022-04-22T14:29:37 2022-04-22T14:29:39   00:00:02 
29450.extern 2022-04-22T14:29:37 2022-04-22T14:29:39   00:00:02 
29450.0      2022-04-22T14:29:37 2022-04-22T14:29:37   00:00:00 
29450.1      2022-04-22T14:29:37 2022-04-22T14:29:39   00:00:02



But with DeferBatch, the batch script takes between 20 and 120s to start executing:
$ scontrol show config | grep PrologFlags
PrologFlags             = Alloc,Contain,DeferBatch,Serial

$ sbatch -N8 --wrap="srun true; srun sleep 2s;"
Submitted batch job 29453

$ sacct -j 29453 --format jobid,start,end,elapsed
JobID                      Start                 End    Elapsed 
------------ ------------------- ------------------- ---------- 
29453        2022-04-22T14:33:10 2022-04-22T14:34:57   00:01:47 
29453.batch  2022-04-22T14:33:10 2022-04-22T14:34:57   00:01:47 
29453.extern 2022-04-22T14:33:10 2022-04-22T14:34:57   00:01:47 
29453.0      2022-04-22T14:34:55 2022-04-22T14:34:55   00:00:00 
29453.1      2022-04-22T14:34:55 2022-04-22T14:34:57   00:00:02

On the slurmctld side, I see that all 8 nodes sent REQUEST_COMPLETE_PROLOG  at 14:33:10 but REQUEST_BATCH_JOB_LAUNCH happened at 14:34:55. See partial slurmctld log attached.

Comment 59 Dominik Bartkiewicz 2022-04-25 08:41:18 MDT

Created attachment 24643 [details]
patch proposal v13

Hi

Thanks for catching this. I can recreate this issue, it can occur on synthetic/empty clusters.
This patch force agent to check deferred RPC after the prolog is complete, this should prevent this delay.

Dominik

Comment 60 Felix Abecassis 2022-04-25 12:01:14 MDT

There might still be a problem with this case, I reused the setup described in the original bug description with an artificial delay for node-1 in a multiple slurmd case (did not try on a real cluster) and reused the run.sub from comment #1.

I see that job steps have no large gap between them, so this part is solved:
$ sacct -j 29496 --format jobid,start,end,elapsed
JobID                      Start                 End    Elapsed 
------------ ------------------- ------------------- ---------- 
29496        2022-04-25T10:51:13 2022-04-25T10:51:24   00:00:11 
29496.batch  2022-04-25T10:51:13 2022-04-25T10:51:24   00:00:11 
29496.extern 2022-04-25T10:51:13 2022-04-25T10:51:24   00:00:11 
29496.0      2022-04-25T10:51:13 2022-04-25T10:52:26   00:01:13 
29496.1      2022-04-25T10:51:24 2022-04-25T10:51:24   00:00:00


But, the 1 minute delay seems to have moved in the node cleanup, node-1 now stays for 1+ minute in this state:
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             29496    normal  run.sub fabecass CG       0:11      1 node-1

So while the job itself executed the job steps faster, I'm concerned that some nodes might not go back to idle as fast as they should.

Comment 61 Felix Abecassis 2022-04-25 12:01:55 MDT

Created attachment 24652 [details]
slurmd.node-1.log

Attached is debug log for node-1 (the node getting the artificial delay)

Comment 62 Dominik Bartkiewicz 2022-04-26 09:46:32 MDT

Hi

In this case, the reverse tree fallback mechanism works properly.
This extra timeout occurs due to a change in srun and can't be caused by prolog.

Dominik

Comment 63 Felix Abecassis 2022-04-26 09:48:26 MDT

Okay, sounds good. You're right that with DeferBatch this issue is less likely to happen now, it would require a significant network problem to cause such a propagation delay.

If we see problems like that in production we will open a new bug then.

Comment 66 Dominik Bartkiewicz 2022-04-28 04:50:06 MDT

Hi

Bothe commits are in master and will be included in 22.05.
https://github.com/SchedMD/slurm/commit/b8231a3e73
https://github.com/SchedMD/slurm/commit/30bf69732e

I'm gonna go ahead and close the bug.
Please, reopen if there are further questions or concerns.

Dominik