Summary: | Discarded REQUEST_STEP_COMPLETE events leading to stalled job steps | ||
---|---|---|---|
Product: | Slurm | Reporter: | Felix Abecassis <fabecassis> |
Component: | slurmstepd | Assignee: | Dominik Bartkiewicz <bart> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | albert.gil, jbernauer, lyeager |
Version: | 20.11.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=12801 | ||
Site: | NVIDIA (PSLA) | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 22.05.0pre1 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
patch proposal
patch proposal slurmctld log for one job patch proposal v13 slurmd.node-1.log |
Description
Felix Abecassis
2021-05-17 13:01:24 MDT
Note that blocking later job steps is a new behavior in 20.11. The error message can show up in 20.02, but the next job step executed immediately. Using this batch script: #!/bin/bash -eux #SBATCH -t 5 --nodes=8 --exclusive srun --mpi=none --ntasks="$((SLURM_JOB_NUM_NODES * 8))" --ntasks-per-node=8 -l true srun --mpi=none --ntasks="$((SLURM_JOB_NUM_NODES * 8))" --ntasks-per-node=8 -l echo "done" On Slurm 20.11: $ sacct -j 4 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 4 2021-05-17T13:31:09 2021-05-17T13:32:23 00:01:14 4.batch 2021-05-17T13:31:09 2021-05-17T13:32:23 00:01:14 4.0 2021-05-17T13:31:09 2021-05-17T13:32:22 00:01:13 4.1 2021-05-17T13:32:22 2021-05-17T13:32:23 00:00:01 The job took 01:14 to execute, as job step 4.1 was blocked on the termination of 4.0. Per bug#11446, it seems that 4.0 now holds to the node resources for a long duration because of this bug. On Slurm 20.02: $ sacct -j 5 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 5 2021-05-17T13:34:51 2021-05-17T13:35:02 00:00:11 5.batch 2021-05-17T13:34:51 2021-05-17T13:35:02 00:00:11 5.0 2021-05-17T13:34:51 2021-05-17T13:36:03 00:01:12 5.1 2021-05-17T13:35:01 2021-05-17T13:35:02 00:00:01 The whole job too 00:11 to execute (because of the artificial 10s delay). Job step 0 had the same duration, but it didn't prevent job step 1 from starting. Hi I can recreate this issue. I don't know if we should treat this as a bug. You can modify reverse tree timeouts (lower REVERSE_TREE_CHILDREN_TIMEOUT or increase REVERSE_TREE_PARENT_RETRY). Dominik > I don't know if we should treat this as a bug.
If this is the decision, please explain the rationale.
Thanks,
Hi It is not a decision. It is my understanding of this issue. This sixty-second timeout exists to handle the situation when stepd ends before the parent in reverse tree stepd start. This mechanism works fine. We can discuss if the current values of those timeouts are optimal. More concern is why such a situation is occurring when you submit the job on your cluster. This is only speculation, but I suspect that this can be related to PrologFlags=Serial. Could you send me slurmd.log with an affected job from the production cluster? Dominik > This mechanism works fine. We can discuss if the current values of those timeouts are optimal. Got it. It was indeed harmless until Slurm 20.11. But now that the default behavior (no overlap of job steps) has changed, the job steps are now delayed and this is a problem as this is not an efficient use of resources. So it seems that it was not a bug in 20.02, but now in 20.11 it's a bug. If you can provide us with a single reliable knob (not through user environment variables) to disable the default behavior of 20.11 (always allowing overlapping job steps), then the issue will still exist but will go back to being harmless like in 20.02. > More concern is why such a situation is occurring when you submit the job on your cluster. This is fair, we can focus on that for now. > This is only speculation, but I suspect that this can be related to PrologFlags=Serial. How so? "Serial" applies to prolog and epilog, are you saying it can have an impact on the execution between prolog and job? > Could you send me slurmd.log with an affected job from the production cluster?
Here is an extract of the log for node-0507, so one of the deepest node in the job tree:
$ grep -A2 '1407764' node-0507.log
18:55:11 slurm-prolog: START user=fabecassis job=1407764
18:55:11 slurm-prolog: /etc/slurm/prolog.d/...
--
18:55:56 slurmd[25055]: launch task StepId=1407764.0 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:47232
18:55:56 slurmd[25055]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:55:56 slurmd[25055]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:55:57 slurmd[25055]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:55:57 slurm-prolog: NHC end
18:55:57 slurm-prolog: END user=fabecassis job=1407764
18:55:59 slurmstepd[3655936]: done with job
18:56:16 slurmd[25055]: launch task StepId=1407764.1 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:47362
18:56:16 slurmd[25055]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:56:17 slurmstepd[3655976]: done with job
18:56:18 slurm-epilog: START user=fabecassis job=1407764
18:56:19 slurm-epilog: /etc/slurm/epilog.d/30-tuned-profile.sh:
18:56:19 slurm-epilog: Restoring default settings
--
18:56:57 slurm-epilog: END user=fabecassis job=1407764
The step 0 request is concurrent to the execution of the prolog and the stepd_connect failure.
And below is the log for node-0002, which should be close to the root of the tree if I understand correctly:
$ grep -A2 '1407764' node-0002.log.1
18:55:11 slurm-prolog: START user=fabecassis job=1407764
18:55:11 slurm-prolog: /etc/slurm/prolog.d/10-fans.sh:
18:55:12 slurm-prolog: /etc/slurm/prolog.d/...
--
18:55:56 slurmd[25976]: launch task StepId=1407764.0 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:60674
18:55:56 slurmd[25976]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:55:56 slurm-prolog: NHC end
18:55:56 slurm-prolog: END user=fabecassis job=1407764
18:56:16 slurmd[25976]: launch task StepId=1407764.1 request from UID:10009 GID:10001 HOST:10.248.0.2 PORT:60804
18:56:16 slurmd[25976]: task/affinity: lllp_distribution: JobId=1407764 manual binding: none
18:56:18 slurmstepd[3361654]: done with job
18:57:06 slurmstepd[3361628]: done with job
18:57:07 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:09 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:09 slurm-epilog: START user=fabecassis job=1407764
18:57:09 slurm-epilog: /etc/slurm/epilog.d/...
--
18:57:10 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:10 slurmd[25976]: message repeated 4 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:11 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:17 slurmd[25976]: message repeated 36 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:18 slurmd[25976]: error: stepd_connect to StepId=1407764.0 failed: No such file or directory
18:57:18 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
[... more stepd_connect failures ...]
18:57:34 slurmd[25976]: message repeated 5 times: [ error: stepd_connect to StepId=1407764.0 failed: No such file or directory]
18:57:38 slurm-epilog: NHC end
18:57:38 slurm-epilog: END user=fabecassis job=1407764
It looks like it could be the reverse problem on this node, REVERSE_TREE_CHILDREN_TIMEOUT is *too low*, so node-0002 has already given up on this job but continues receiving events from children?
Hi Could you send me the full slurmd.log from nodes node-0002 and node-0507? How did this job was submitted by sbatch or by salloc? Dominik The job was submitted through "sbatch". Unfortunately I cannot share the full slurmd.log. Please let me know if you are looking for something in particular in the log and then I can provide it. Hi Could you check if, after adding sbatch_wait_nodes to SchedulerParameters or use "--wait-all-nodes" sbatch option, this problem still exists? Dominik Tested on the cluster, it doesn't seem to change the situation. With --wait-all-nodes=1 and *without* --overcommit, I still see: srun: Job 1441883 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 1441883 $ sacct -j 1441883 --format jobid,start,end JobID Start End ------------ ------------------- ------------------- 1441883 2021-06-04T10:54:36 2021-06-04T10:56:32 1441883.bat+ 2021-06-04T10:54:36 2021-06-04T10:56:32 1441883.0 2021-06-04T10:55:17 2021-06-04T10:56:32 1441883.1 2021-06-04T10:56:32 2021-06-04T10:56:32 Hi To ensure this was not the cluster with reproducer patch from the initial commit? Dominik No, the patch with the call to sleep() was only tested on a local machine with multiple slurmd. My latest test was on a production cluster, using sbatch --wait-all-nodes=1 Hi We added a new DeferBatch flag to master. I will force the batch step to start after the prolog is finished on all nodes. https://github.com/SchedMD/slurm/compare/16cf1d3e76..0c35e45d35 Let me know if we can close this ticket or if you have any related questions. Dominik Hi Dominik, This patch seems to have the same bug than the patch provided by Michael in https://bugs.schedmd.com/show_bug.cgi?id=12801, so I have reopened https://bugs.schedmd.com/show_bug.cgi?id=13187 as it is not happening on the master branch. I meant to say:
> as it is *now* happening on the master branch.
Hi I can reproduce this issue. I will close bug 13187 as a duplicate of this ticket. This bug is not directly caused by these patches but makes this race easy to hit. I will let you know when I will have a patch for this. Dominik *** Ticket 13187 has been marked as a duplicate of this ticket. *** *** Ticket 12801 has been marked as a duplicate of this ticket. *** Created attachment 24546 [details]
patch proposal
Hi
This patch should fix this race.
Could you check if it works as we expect and correctly covers all your uses cases?
Dominik
Hi Dominik, The patch does not apply on the current master: $ git am -3 patch Applying: Allow other steps to start before the batch step Using index info to reconstruct a base tree... M src/slurmd/slurmd/req.c Falling back to patching base and 3-way merge... Auto-merging src/slurmd/slurmd/req.c CONFLICT (content): Merge conflict in src/slurmd/slurmd/req.c error: Failed to merge in the changes. Patch failed at 0001 Allow other steps to start before the batch step hint: Use 'git am --show-current-patch=diff' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". Created attachment 24608 [details]
patch proposal
Hi
Generated on fresh master, I have hope it will apply clearly.
Dominik
It didn't work: $ curl -s https://bugs.schedmd.com/attachment.cgi?id=24608 | git am -3 Applying: Allow other steps to start before the batch step Using index info to reconstruct a base tree... M src/slurmd/slurmd/req.c Falling back to patching base and 3-way merge... Auto-merging src/slurmd/slurmd/req.c CONFLICT (content): Merge conflict in src/slurmd/slurmd/req.c error: Failed to merge in the changes. Patch failed at 0001 Allow other steps to start before the batch step hint: Use 'git am --show-current-patch=diff' to see the failed patch When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". I'm doing a manual merge. Created attachment 24623 [details]
slurmctld log for one job
I'm facing issues with the DeferBatch flag on master (without the patch attached to this bug). I'm running 8 slurmd on my local workstation and I disabled prologs and epilogs. Without DeferBatch a simple job starts and finishes immediately: $ scontrol show config | grep PrologFlags PrologFlags = Alloc,Contain,Serial $ sbatch -N8 --wrap="srun true; srun sleep 2s;" Submitted batch job 29450 $ sacct -j 29450 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 29450 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 29450.batch 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 29450.extern 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 29450.0 2022-04-22T14:29:37 2022-04-22T14:29:37 00:00:00 29450.1 2022-04-22T14:29:37 2022-04-22T14:29:39 00:00:02 But with DeferBatch, the batch script takes between 20 and 120s to start executing: $ scontrol show config | grep PrologFlags PrologFlags = Alloc,Contain,DeferBatch,Serial $ sbatch -N8 --wrap="srun true; srun sleep 2s;" Submitted batch job 29453 $ sacct -j 29453 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 29453 2022-04-22T14:33:10 2022-04-22T14:34:57 00:01:47 29453.batch 2022-04-22T14:33:10 2022-04-22T14:34:57 00:01:47 29453.extern 2022-04-22T14:33:10 2022-04-22T14:34:57 00:01:47 29453.0 2022-04-22T14:34:55 2022-04-22T14:34:55 00:00:00 29453.1 2022-04-22T14:34:55 2022-04-22T14:34:57 00:00:02 On the slurmctld side, I see that all 8 nodes sent REQUEST_COMPLETE_PROLOG at 14:33:10 but REQUEST_BATCH_JOB_LAUNCH happened at 14:34:55. See partial slurmctld log attached. Created attachment 24643 [details]
patch proposal v13
Hi
Thanks for catching this. I can recreate this issue, it can occur on synthetic/empty clusters.
This patch force agent to check deferred RPC after the prolog is complete, this should prevent this delay.
Dominik
There might still be a problem with this case, I reused the setup described in the original bug description with an artificial delay for node-1 in a multiple slurmd case (did not try on a real cluster) and reused the run.sub from comment #1. I see that job steps have no large gap between them, so this part is solved: $ sacct -j 29496 --format jobid,start,end,elapsed JobID Start End Elapsed ------------ ------------------- ------------------- ---------- 29496 2022-04-25T10:51:13 2022-04-25T10:51:24 00:00:11 29496.batch 2022-04-25T10:51:13 2022-04-25T10:51:24 00:00:11 29496.extern 2022-04-25T10:51:13 2022-04-25T10:51:24 00:00:11 29496.0 2022-04-25T10:51:13 2022-04-25T10:52:26 00:01:13 29496.1 2022-04-25T10:51:24 2022-04-25T10:51:24 00:00:00 But, the 1 minute delay seems to have moved in the node cleanup, node-1 now stays for 1+ minute in this state: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 29496 normal run.sub fabecass CG 0:11 1 node-1 So while the job itself executed the job steps faster, I'm concerned that some nodes might not go back to idle as fast as they should. Created attachment 24652 [details]
slurmd.node-1.log
Attached is debug log for node-1 (the node getting the artificial delay)
Hi In this case, the reverse tree fallback mechanism works properly. This extra timeout occurs due to a change in srun and can't be caused by prolog. Dominik Okay, sounds good. You're right that with DeferBatch this issue is less likely to happen now, it would require a significant network problem to cause such a propagation delay. If we see problems like that in production we will open a new bug then. Hi Bothe commits are in master and will be included in 22.05. https://github.com/SchedMD/slurm/commit/b8231a3e73 https://github.com/SchedMD/slurm/commit/30bf69732e I'm gonna go ahead and close the bug. Please, reopen if there are further questions or concerns. Dominik |