Created attachment 29070 [details] slurmctld.log Hi, Since our last upgrade, 2 weeks ago, we integrated a prolog/epilog stage. Today we notice fews jobs was stuck in completing state during the prolog stage. Even after this log message : " slurmctld.log:2023-02-27T06:50:29.569176+01:00 maestro-sched slurmctld[265177]: slurmctld: Batch JobId=11794130_30(11794246) missing from batch node maestro-1032 (not found BatchStartTime after startup), Requeuing job slurmctld.log:2023-02-27T06:50:29.569203+01:00 maestro-sched slurmctld[265177]: slurmctld: _job_complete: JobId=11794130_30(11794246) WTERMSIG 126 slurmctld.log:2023-02-27T06:50:29.569229+01:00 maestro-sched slurmctld[265177]: slurmctld: _job_complete: JobId=11794130_30(11794246) cancelled by node failure slurmctld.log:2023-02-27T06:50:29.569295+01:00 maestro-sched slurmctld[265177]: slurmctld: _job_complete: requeue JobId=11794130_30(11794246) due to node failure " Example of jobs stuck: " [braffest@maestro-1032 ~]$ squeue -w maestro-1032 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 11794130_30 common gregret_ albarbie CG 0:00 1 maestro-1032 11794130_29 common gregret_ albarbie CG 0:00 1 maestro-1032 11794130_23 common gregret_ albarbie CG 0:00 1 maestro-1032 11794130_22 common gregret_ albarbie CG 0:00 1 maestro-1032 11794130_21 common gregret_ albarbie CG 0:00 1 maestro-1032 11794130_10 common gregret_ albarbie CG 0:00 1 maestro-1032 11794130_1 common gregret_ albarbie R 10:49:16 1 maestro-1032 [braffest@maestro-1032 slurm]$ scontrol show jobs 11794130_30 JobId=11794246 ArrayJobId=11794130 ArrayTaskId=30 JobName=XXXXX UserId=XXXXX(11111) GroupId=YYYYY(22222) MCS_label=N/A Priority=1128 Nice=0 Account=dbc QOS=normal JobState=COMPLETING Reason=Prolog Dependency=(null) Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:1 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2023-02-27T01:18:48 EligibleTime=2023-02-27T01:18:49 AccrueTime=2023-02-27T01:18:49 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-27T01:18:49 Scheduler=Main Partition=common AllocNode:Sid=maestro-sched:1659160 " We also see few prolog zombie processes on 3 nodes(maestro-1032, maestro-1042 and maestro-2013), but I only see jobs stuck on maestro-1032... Because our prolog/epilog script is created to account our GPUs, on our issue, there is no GPU on our node concerning the issue, so the prolog should be very quick. And some jobs from the same array has worked on the same node... (11794130_1) The ticket Bug 16126 seems to be the same issue(same log of slurmctld /backtrace of slurmd), but for the moment, the jobs that was stuck has been requeued (maybe because of an action from me) after more than 15 hours. I also save the core file just in case. I attach slurm.conf, slurmctld.log, slurmd from maestro-1032 and our prolog script. If you need more information, feel free to ask. Thanks for your support. Best, Brice
Created attachment 29071 [details] slurm.conf
Created attachment 29072 [details] prolog script
Created attachment 29073 [details] slurmd log of maestro-1032
Hi, Thanks for your research and the logs and information you posted. I believe this is a duplicate of bug 16126. I posted a comment on that bug. Feel free to CC yourself on that ticket. I'll keep this ticket open until we are sure that it is a duplicate and you have a resolution.
Are you able to reliably reproduce this issue? I uploaded a patch to bug 16126 - would you be interested in trying it on one slurmd?
Hi, I am not able to reproduce the issue on our dev environment... The issue seems to be random on our production environment, so if I need to see if the patch is successful, I need to patch all our production environment... At the moment, I am reluctant to patch all our cluster... I keep checking the other ticket, depending on how the ticket turns out, I will set up or not the patch. Thanks for your understanding. Brice
Hi, I have not yet gotten feedback from the site in bug 16126, but I wanted to get some more information from you. Are you using any spank plugins?
Hi, We don't use spank plugin. For your information, since 3 days, we got some jobs blocked as the initial issue. This time, it seems there are not requeued. Brice
Can you upload a backtrace of the stuck slurmd process?
Created attachment 30602 [details] Backtrace_slurmd I don't know if it can help you, but all of those jobs are in completed state: 25785880 25785870 25785865 25785864 25785863 25785862 25785861 25785859 25785858 25785857 25785855 Brice
Are there any slurmd processes that are a child of the main slurmd process? Can you get a backtrace of one of those, too?
It seems to not be the case, the main slurmd process is 12800 : pstree -p 12800 -l slurmd(12800)─┬─gpu_accounting_(722986) [...] ├─gpu_accounting_(723869) ├─{slurmd}(722105) [...] └─{slurmd}(847476) Or I am missing something? Brice
I agree, it looks like those are just threads of the main slurmd process. Are there any defunct processes on the compute node? Would you be interested in setting PrologEpilogTimeout in slurm.conf? This should kill the prolog/epilog process when that timeout is reached to prevent nodes from being stuck indefinitely. I would also like to see if that actually clean us all of the prolog/epilog processes.
All the "gpu_accounting_" processes are defunct and zombie state. We already set a value to this parameter : >PrologEpilogTimeout = 65534 Brice
(In reply to Institut Pasteur HPC Admin from comment #14) > All the "gpu_accounting_" processes are defunct and zombie state. Considering this, would you be interested in trying a patch? This patch fixes a case where slurmd waits for already defunct processes until they timeout and are killed. Considering that you have PrologEpilogTimeout set to 65534, it will be such a long time (about 18 hours) until they timeout that it might as well not be set at all. This 18 hour time is similar to the >15 hours that you mentioned in comment 0. This patch makes slurmd immediately reap defunct prolog/epilog processes: attachment 29142 [details] > We already set a value to this parameter : > > >PrologEpilogTimeout = 65534 > > Brice I meant, would you be interested in setting PrologEpilogTimeout such that stuck prologs/epilogs will actually timeout in a reasonable amount of time? Or if you would rather try the above patch, then apply that patch to all of the slurmd's.
Hi, This time the completing state seems to never end. sacct -j 25785880 --format=submit,start,end,state Submit Start End State ------------------- ------------------- ------------------- ---------- 2023-05-30T02:28:19 2023-05-30T02:28:23 2023-05-31T10:49:24 CANCELLED 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED I don't check it before, I used 'scontrol show config' and don't check in the slurm.conf, but the value of "PrologEpilogTimeout" is the default one. In the documentation, it says : >The default behavior is to wait indefinitely I will just try to set a little value of PrologEpilogTimeout like 10 minutes, and check what will be the new behavior before thinking about a patch. Do I only need to do a "scontrol reconfigure" to change the value properly?
(In reply to Institut Pasteur HPC Admin from comment #16) > Hi, > > This time the completing state seems to never end. > > sacct -j 25785880 --format=submit,start,end,state > Submit Start End State > ------------------- ------------------- ------------------- ---------- > 2023-05-30T02:28:19 2023-05-30T02:28:23 2023-05-31T10:49:24 CANCELLED > 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED > 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49 CANCELLED > > > I don't check it before, I used 'scontrol show config' and don't check in > the slurm.conf, but the value of "PrologEpilogTimeout" is the default one. > In the documentation, it says : > >The default behavior is to wait indefinitely My mistake: I misremembered the default value (65534). I thought it was 65535 or zero. > I will just try to set a little value of PrologEpilogTimeout like 10 > minutes, and check what will be the new behavior before thinking about a > patch. > > Do I only need to do a "scontrol reconfigure" to change the value properly? scontrol reconfigure is sufficient.
We pushed commit 5a3f79271b upstream - it will be part of 23.02.3. It fixes waiting indefinitely for defunct script processes. This is the same patch as attachment 29142 [details] except it is written for Slurm 23.02. I think that this will solve the defunct script process issue that you are seeing. You can feel free to apply this patch in 22.05 (attachment 29142 [details]) or upgrade to 23.02.3 after that is released. Are you interested in testing the patch in 22.05? If not, then I can just resolve this bug as fixed assuming that the commit does fix the issue. You can always reopen the bug if it is not solved after upgrading.
Hi, We will wait the next slurm upgrade to test the fix. Thanks for your support, Brice
We had to revert this change in commit 790b4a4738 ahead of 23.02.5. See bug 17481 for details. In short, Slurm has to wait for all processes in the script's process group to complete. This places an increased importance on setting timeout values for scripts, such as PrologEpilogTimeout (which by default has an unlimited timeout).
Hi, I understand the reason of reverting the commit. But I didn't notice the initial state(before requeue) of the job was "NODE_FAILURE", I am wondering if you are still thinking about a fix for my initial ticket? I know that setting a little value for the parameter "PrologEpilogTimeout" will speed up the requeue, but it still not clean to get a "NODE_FAILURE" state for a job. Do you recommand us to do something to avoid the race condition? Thanks for your support, Brice
(In reply to Institut Pasteur HPC Admin from comment #25) > But I didn't notice the initial state(before requeue) of the job was > "NODE_FAILURE", I am wondering if you are still thinking about a fix for my > initial ticket? I don't see anything about NODE_FAILURE in this ticket except for this comment. What ticket are you referring to? If the issue is separate from the "prolog stuck in completing state" issue in comment 0, open a new ticket.