Related to https://bugs.schedmd.com/show_bug.cgi?id=11446 (private). On our cluster, for some jobs, we sometimes see many error messages like the following: May 14 18:56:06 node-0022 slurmd[26142]: message repeated 31 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] This only ever happens for job step 0. Using sacct shows that this job step as an anomalous end time, beyond the end time of the job itself: $ sacct -j 1407764 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 1407764 2021-05-14T18:55:11 2021-05-14T18:56:18 00:01:07 1407764.bat+ 2021-05-14T18:55:11 2021-05-14T18:56:18 00:01:07 1407764.0 2021-05-14T18:55:56 2021-05-14T18:57:46 00:01:50 1407764.1 2021-05-14T18:56:16 2021-05-14T18:56:18 00:00:02 1407764.0 and 1407764.1 were launched using the same srun command that just prints to stdout on each node and then exits. This job above was using "--overcommit", so the two time ranges overlap, even though the steps were run sequentially as foreground jobs in the sbatch script. Without --overcommit, the time ranges don't overlap anymore, but the second step is blocked for a while: srun: Job step creation temporarily disabled, retrying (Requested nodes are busy) This is a waste of resources for short lived large multi-node jobs, as the user task for job step 0 completed very quickly, but step 1 has to wait for ~2 minutes to run. In https://bugs.schedmd.com/show_bug.cgi?id=11446, I also found out that the slurmstepd process for step 0 was still running for a while, and it was blocked in function stepd_wait_for_children_slurmstepd. I was then able to reproduce the error on a local 8 node cluster (with multiple slurmd, *not* front-end), by injecting an artificial delay of 10s for the submission of the job to the root of the tree (node-1): https://github.com/SchedMD/slurm/blob/slurm-20-11-6-1/src/api/step_launch.c#L1735 if (ctx->step_req->step_id.step_id == 0) { ret_list = slurm_send_recv_msgs("node-[2-8]", &msg, timeout); error("sleep 10s before sending to node-1"); sleep(10); ret_list = slurm_send_recv_msgs("node-1", &msg, timeout); } else ret_list = slurm_send_recv_msgs(nodelist, &msg, timeout); With this modification, we can reproduce the "no such file or directory" error message for node-1. I believe it happens because slurmd for node-1 receives a completion message REQUEST_STEP_COMPLETE for a job step *before* it received the REQUEST_LAUNCH_TASKS request. Looks like the acknowledgement will just be discarded if stepd_connect fails, and then this slurmd process will be stuck in stepd_wait_for_children_slurmstepd, waiting for completion messages from children that already sent them. On our cluster, this issue happens for very large jobs, which would indicate that we are hitting a natural propagation delay, like the one simulated above.
Note that blocking later job steps is a new behavior in 20.11. The error message can show up in 20.02, but the next job step executed immediately. Using this batch script: #!/bin/bash -eux #SBATCH -t 5 --nodes=8 --exclusive srun --mpi=none --ntasks="$((SLURM_JOB_NUM_NODES * 8))" --ntasks-per-node=8 -l true srun --mpi=none --ntasks="$((SLURM_JOB_NUM_NODES * 8))" --ntasks-per-node=8 -l echo "done" On Slurm 20.11: $ sacct -j 4 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 4 2021-05-17T13:31:09 2021-05-17T13:32:23 00:01:14 4.batch 2021-05-17T13:31:09 2021-05-17T13:32:23 00:01:14 4.0 2021-05-17T13:31:09 2021-05-17T13:32:22 00:01:13 4.1 2021-05-17T13:32:22 2021-05-17T13:32:23 00:00:01 The job took 01:14 to execute, as job step 4.1 was blocked on the termination of 4.0. Per bug#11446, it seems that 4.0 now holds to the node resources for a long duration because of this bug. On Slurm 20.02: $ sacct -j 5 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 5 2021-05-17T13:34:51 2021-05-17T13:35:02 00:00:11 5.batch 2021-05-17T13:34:51 2021-05-17T13:35:02 00:00:11 5.0 2021-05-17T13:34:51 2021-05-17T13:36:03 00:01:12 5.1 2021-05-17T13:35:01 2021-05-17T13:35:02 00:00:01 The whole job too 00:11 to execute (because of the artificial 10s delay). Job step 0 had the same duration, but it didn't prevent job step 1 from starting.
Hi I can recreate this issue. I don't know if we should treat this as a bug. You can modify reverse tree timeouts (lower REVERSE_TREE_CHILDREN_TIMEOUT or increase REVERSE_TREE_PARENT_RETRY). Dominik
> I don't know if we should treat this as a bug. If this is the decision, please explain the rationale. Thanks,
Hi It is not a decision. It is my understanding of this issue. This sixty-second timeout exists to handle the situation when stepd ends before the parent in reverse tree stepd start. This mechanism works fine. We can discuss if the current values of those timeouts are optimal. More concern is why such a situation is occurring when you submit the job on your cluster. This is only speculation, but I suspect that this can be related to PrologFlags=Serial. Could you send me slurmd.log with an affected job from the production cluster? Dominik
> This mechanism works fine. We can discuss if the current values of those timeouts are optimal. Got it. It was indeed harmless until Slurm 20.11. But now that the default behavior (no overlap of job steps) has changed, the job steps are now delayed and this is a problem as this is not an efficient use of resources. So it seems that it was not a bug in 20.02, but now in 20.11 it's a bug. If you can provide us with a single reliable knob (not through user environment variables) to disable the default behavior of 20.11 (always allowing overlapping job steps), then the issue will still exist but will go back to being harmless like in 20.02. > More concern is why such a situation is occurring when you submit the job on your cluster. This is fair, we can focus on that for now. > This is only speculation, but I suspect that this can be related to PrologFlags=Serial. How so? "Serial" applies to prolog and epilog, are you saying it can have an impact on the execution between prolog and job?
> Could you send me slurmd.log with an affected job from the production cluster? Here is an extract of the log for node-0507, so one of the deepest node in the job tree: $ grep -A2 '1407764' node-0507.log 18:55:11 slurm-prolog: START user=fabecassis job=1407764 18:55:11 slurm-prolog: /etc/slurm/prolog.d/... -- 18:55:56 slurmd[25055]: launch task StepId=1407764.0 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:47232 18:55:56 slurmd[25055]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none 18:55:56 slurmd[25055]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory 18:55:57 slurmd[25055]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] 18:55:57 slurm-prolog: NHC end 18:55:57 slurm-prolog: END user=fabecassis job=1407764 18:55:59 slurmstepd[3655936]: done with job 18:56:16 slurmd[25055]: launch task StepId=1407764.1 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:47362 18:56:16 slurmd[25055]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none 18:56:17 slurmstepd[3655976]: done with job 18:56:18 slurm-epilog: START user=fabecassis job=1407764 18:56:19 slurm-epilog: /etc/slurm/epilog.d/30-tuned-profile.sh: 18:56:19 slurm-epilog: Restoring default settings -- 18:56:57 slurm-epilog: END user=fabecassis job=1407764 The step 0 request is concurrent to the execution of the prolog and the stepd_connect failure. And below is the log for node-0002, which should be close to the root of the tree if I understand correctly: $ grep -A2 '1407764' node-0002.log.1 18:55:11 slurm-prolog: START user=fabecassis job=1407764 18:55:11 slurm-prolog: /etc/slurm/prolog.d/10-fans.sh: 18:55:12 slurm-prolog: /etc/slurm/prolog.d/... -- 18:55:56 slurmd[25976]: launch task StepId=1407764.0 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:60674 18:55:56 slurmd[25976]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none 18:55:56 slurm-prolog: NHC end 18:55:56 slurm-prolog: END user=fabecassis job=1407764 18:56:16 slurmd[25976]: launch task StepId=1407764.1 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:60804 18:56:16 slurmd[25976]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none 18:56:18 slurmstepd[3361654]: done with job 18:57:06 slurmstepd[3361628]: done with job 18:57:07 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory 18:57:09 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] 18:57:09 slurm-epilog: START user=fabecassis job=1407764 18:57:09 slurm-epilog: /etc/slurm/epilog.d/... -- 18:57:10 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory 18:57:10 slurmd[25976]: message repeated 4 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] 18:57:11 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory 18:57:17 slurmd[25976]: message repeated 36 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] 18:57:18 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory 18:57:18 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] [... more stepd_connect failures ...] 18:57:34 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory] 18:57:38 slurm-epilog: NHC end 18:57:38 slurm-epilog: END user=fabecassis job=1407764 It looks like it could be the reverse problem on this node, REVERSE_TREE_CHILDREN_TIMEOUT is *too low*, so node-0002 has already given up on this job but continues receiving events from children?
Hi Could you send me the full slurmd.log from nodes node-0002 and node-0507? How did this job was submitted by sbatch or by salloc? Dominik
The job was submitted through "sbatch". Unfortunately I cannot share the full slurmd.log. Please let me know if you are looking for something in particular in the log and then I can provide it.
Hi Could you check if, after adding sbatch_wait_nodes to SchedulerParameters or use "--wait-all-nodes" sbatch option, this problem still exists? Dominik
Tested on the cluster, it doesn't seem to change the situation. With --wait-all-nodes=1 and *without* --overcommit, I still see: srun: Job 1441883 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 1441883 $ sacct -j 1441883 --format jobid,start,end JobID Start End ------------ ------------------- ------------------- 1441883 2021-06-04T10:54:36 2021-06-04T10:56:32 1441883.bat+ 2021-06-04T10:54:36 2021-06-04T10:56:32 1441883.0 2021-06-04T10:55:17 2021-06-04T10:56:32 1441883.1 2021-06-04T10:56:32 2021-06-04T10:56:32
Hi To ensure this was not the cluster with reproducer patch from the initial commit? Dominik
No, the patch with the call to sleep() was only tested on a local machine with multiple slurmd. My latest test was on a production cluster, using sbatch --wait-all-nodes=1
Hi We added a new DeferBatch flag to master. I will force the batch step to start after the prolog is finished on all nodes. https://github.com/SchedMD/slurm/compare/16cf1d3e76..0c35e45d35 Let me know if we can close this ticket or if you have any related questions. Dominik
Hi Dominik, This patch seems to have the same bug than the patch provided by Michael in https://bugs.schedmd.com/show_bug.cgi?id=12801, so I have reopened https://bugs.schedmd.com/show_bug.cgi?id=13187 as it is not happening on the master branch.
I meant to say: > as it is *now* happening on the master branch.
Hi I can reproduce this issue. I will close bug 13187 as a duplicate of this ticket. This bug is not directly caused by these patches but makes this race easy to hit. I will let you know when I will have a patch for this. Dominik
*** Ticket 13187 has been marked as a duplicate of this ticket. ***
*** Ticket 12801 has been marked as a duplicate of this ticket. ***
Created attachment 24546 [details] patch proposal Hi This patch should fix this race. Could you check if it works as we expect and correctly covers all your uses cases? Dominik
Hi Dominik, The patch does not apply on the current master: $ git am -3 patch Applying: Allow other steps to start before the batch step Using index info to reconstruct a base tree... M src/slurmd/slurmd/req.c Falling back to patching base and 3-way merge... Auto-merging src/slurmd/slurmd/req.c CONFLICT (content): Merge conflict in src/slurmd/slurmd/req.c error: Failed to merge in the changes. Patch failed at 0001 Allow other steps to start before the batch step hint: Use 'git am --show-current-patch=diff' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort".
Created attachment 24608 [details] patch proposal Hi Generated on fresh master, I have hope it will apply clearly. Dominik
It didn't work: $ curl -s https://bugs.schedmd.com/attachment.cgi?id=24608 | git am -3 Applying: Allow other steps to start before the batch step Using index info to reconstruct a base tree... M src/slurmd/slurmd/req.c Falling back to patching base and 3-way merge... Auto-merging src/slurmd/slurmd/req.c CONFLICT (content): Merge conflict in src/slurmd/slurmd/req.c error: Failed to merge in the changes. Patch failed at 0001 Allow other steps to start before the batch step hint: Use 'git am --show-current-patch=diff' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". I'm doing a manual merge.
Created attachment 24623 [details] slurmctld log for one job
I'm facing issues with the DeferBatch flag on master (without the patch attached to this bug). I'm running 8 slurmd on my local workstation and I disabled prologs and epilogs. Without DeferBatch a simple job starts and finishes immediately: $ scontrol show config | grep PrologFlags PrologFlags = Alloc,Contain,Serial $ sbatch -N8 --wrap="srun true; srun sleep 2s;" Submitted batch job 29450 $ sacct -j 29450 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 29450 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 29450.batch 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 29450.extern 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 29450.0 2022-04-22T14:29:37 2022-04-22T14:29:37 00:00:00 29450.1 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 But with DeferBatch, the batch script takes between 20 and 120s to start executing: $ scontrol show config | grep PrologFlags PrologFlags = Alloc,Contain,DeferBatch,Serial $ sbatch -N8 --wrap="srun true; srun sleep 2s;" Submitted batch job 29453 $ sacct -j 29453 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 29453 2022-04-22T14:33:10 2022-04-22T14:34:57 00:01:47 29453.batch 2022-04-22T14:33:10 2022-04-22T14:34:57 00:01:47 29453.extern 2022-04-22T14:33:10 2022-04-22T14:34:57 00:01:47 29453.0 2022-04-22T14:34:55 2022-04-22T14:34:55 00:00:00 29453.1 2022-04-22T14:34:55 2022-04-22T14:34:57 00:00:02 On the slurmctld side, I see that all 8 nodes sent REQUEST_COMPLETE_PROLOG at 14:33:10 but REQUEST_BATCH_JOB_LAUNCH happened at 14:34:55. See partial slurmctld log attached.
Created attachment 24643 [details] patch proposal v13 Hi Thanks for catching this. I can recreate this issue, it can occur on synthetic/empty clusters. This patch force agent to check deferred RPC after the prolog is complete, this should prevent this delay. Dominik
There might still be a problem with this case, I reused the setup described in the original bug description with an artificial delay for node-1 in a multiple slurmd case (did not try on a real cluster) and reused the run.sub from comment #1. I see that job steps have no large gap between them, so this part is solved: $ sacct -j 29496 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 29496 2022-04-25T10:51:13 2022-04-25T10:51:24 00:00:11 29496.batch 2022-04-25T10:51:13 2022-04-25T10:51:24 00:00:11 29496.extern 2022-04-25T10:51:13 2022-04-25T10:51:24 00:00:11 29496.0 2022-04-25T10:51:13 2022-04-25T10:52:26 00:01:13 29496.1 2022-04-25T10:51:24 2022-04-25T10:51:24 00:00:00 But, the 1 minute delay seems to have moved in the node cleanup, node-1 now stays for 1+ minute in this state: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 29496 normal run.sub fabecass CG 0:11 1 node-1 So while the job itself executed the job steps faster, I'm concerned that some nodes might not go back to idle as fast as they should.
Created attachment 24652 [details] slurmd.node-1.log Attached is debug log for node-1 (the node getting the artificial delay)
Hi In this case, the reverse tree fallback mechanism works properly. This extra timeout occurs due to a change in srun and can't be caused by prolog. Dominik
Okay, sounds good. You're right that with DeferBatch this issue is less likely to happen now, it would require a significant network problem to cause such a propagation delay. If we see problems like that in production we will open a new bug then.
Hi Bothe commits are in master and will be included in 22.05. https://github.com/SchedMD/slurm/commit/b8231a3e73 https://github.com/SchedMD/slurm/commit/30bf69732e I'm gonna go ahead and close the bug. Please, reopen if there are further questions or concerns. Dominik