Ticket 16143

Summary:	Prolog stuck in completing state
Product:	Slurm	Reporter:	Institut Pasteur HPC Admin <hpc>
Component:	slurmd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	22.05.8
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=16126 https://bugs.schedmd.com/show_bug.cgi?id=17481
Site:	Institut Pasteur	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	23.02.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld.log slurm.conf prolog script slurmd log of maestro-1032 Backtrace_slurmd

Description Institut Pasteur HPC Admin 2023-02-27 15:43:36 MST

Created attachment 29070 [details]
slurmctld.log

Hi,

Since our last upgrade, 2 weeks ago, we integrated a prolog/epilog stage. 

Today we notice fews jobs was stuck in completing state during the prolog stage. Even after this log message : 
"
slurmctld.log:2023-02-27T06:50:29.569176+01:00 maestro-sched slurmctld[265177]: slurmctld: Batch JobId=11794130_30(11794246) missing from batch node maestro-1032 (not found BatchStartTime after startup), Requeuing job
slurmctld.log:2023-02-27T06:50:29.569203+01:00 maestro-sched slurmctld[265177]: slurmctld: _job_complete: JobId=11794130_30(11794246) WTERMSIG 126
slurmctld.log:2023-02-27T06:50:29.569229+01:00 maestro-sched slurmctld[265177]: slurmctld: _job_complete: JobId=11794130_30(11794246) cancelled by node failure
slurmctld.log:2023-02-27T06:50:29.569295+01:00 maestro-sched slurmctld[265177]: slurmctld: _job_complete: requeue JobId=11794130_30(11794246) due to node failure
"

Example of jobs stuck:
"
[braffest@maestro-1032 ~]$ squeue -w maestro-1032
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       11794130_30    common gregret_ albarbie CG       0:00      1 maestro-1032
       11794130_29    common gregret_ albarbie CG       0:00      1 maestro-1032
       11794130_23    common gregret_ albarbie CG       0:00      1 maestro-1032
       11794130_22    common gregret_ albarbie CG       0:00      1 maestro-1032
       11794130_21    common gregret_ albarbie CG       0:00      1 maestro-1032
       11794130_10    common gregret_ albarbie CG       0:00      1 maestro-1032
       11794130_1     common gregret_ albarbie  R   10:49:16      1 maestro-1032
[braffest@maestro-1032 slurm]$ scontrol show jobs 11794130_30
JobId=11794246 ArrayJobId=11794130 ArrayTaskId=30 JobName=XXXXX
   UserId=XXXXX(11111) GroupId=YYYYY(22222) MCS_label=N/A
   Priority=1128 Nice=0 Account=dbc QOS=normal
   JobState=COMPLETING Reason=Prolog Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=2 Reboot=0 ExitCode=0:1
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2023-02-27T01:18:48 EligibleTime=2023-02-27T01:18:49
   AccrueTime=2023-02-27T01:18:49
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-02-27T01:18:49 Scheduler=Main
   Partition=common AllocNode:Sid=maestro-sched:1659160
"

We also see few prolog zombie processes on 3 nodes(maestro-1032, maestro-1042 and maestro-2013), but I only see jobs stuck on maestro-1032...

Because our prolog/epilog script is created to account our GPUs, on our issue, there is no GPU on our node concerning the issue, so the prolog should be very quick. And some jobs from the same array has worked on the same node... (11794130_1)

The ticket Bug 16126 seems to be the same issue(same log of slurmctld /backtrace of slurmd), but for the moment, the jobs that was stuck has been requeued (maybe because of an action from me) after more than 15 hours.

I also save the core file just in case.

I attach slurm.conf, slurmctld.log, slurmd from maestro-1032 and our prolog script.

If you need more information, feel free to ask.

Thanks for your support.

Best,

Brice

Comment 1 Institut Pasteur HPC Admin 2023-02-27 15:44:26 MST

Created attachment 29071 [details]
slurm.conf

Comment 2 Institut Pasteur HPC Admin 2023-02-27 15:44:46 MST

Created attachment 29072 [details]
prolog script

Comment 3 Institut Pasteur HPC Admin 2023-02-27 15:45:15 MST

Created attachment 29073 [details]
slurmd log of maestro-1032

Comment 4 Marshall Garey 2023-02-28 17:39:52 MST

Hi,

Thanks for your research and the logs and information you posted. I believe this is a duplicate of bug 16126. I posted a comment on that bug. Feel free to CC yourself on that ticket. I'll keep this ticket open until we are sure that it is a duplicate and you have a resolution.

Comment 5 Marshall Garey 2023-03-06 15:23:56 MST

Are you able to reliably reproduce this issue? I uploaded a patch to bug 16126 - would you be interested in trying it on one slurmd?

Comment 6 Institut Pasteur HPC Admin 2023-03-17 05:39:13 MDT

Hi,

I am not able to reproduce the issue on our dev environment...

The issue seems to be random on our production environment, so if I need to see if the patch is successful, I need to patch all our production environment...

At the moment, I am reluctant to patch all our cluster... 

I keep checking the other ticket, depending on how the ticket turns out, I will set up or not the patch.

Thanks for your understanding.

Brice

Comment 7 Marshall Garey 2023-06-01 15:21:12 MDT

Hi,

I have not yet gotten feedback from the site in bug 16126, but I wanted to get some more information from you. Are you using any spank plugins?

Comment 8 Institut Pasteur HPC Admin 2023-06-02 06:41:21 MDT

Hi,

We don't use spank plugin.

For your information, since 3 days, we got some jobs blocked as the initial issue. This time, it seems there are not requeued.

Brice

Comment 9 Marshall Garey 2023-06-02 08:41:24 MDT

Can you upload a backtrace of the stuck slurmd process?

Comment 10 Institut Pasteur HPC Admin 2023-06-02 08:53:49 MDT

Created attachment 30602 [details]
Backtrace_slurmd

I don't know if it can help you, but all of those jobs are in completed state:

25785880
25785870
25785865
25785864
25785863
25785862
25785861
25785859
25785858
25785857
25785855


Brice

Comment 11 Marshall Garey 2023-06-02 08:59:35 MDT

Are there any slurmd processes that are a child of the main slurmd process? Can you get a backtrace of one of those, too?

Comment 12 Institut Pasteur HPC Admin 2023-06-02 09:06:33 MDT

It seems to not be the case, the main slurmd process is 12800 :

pstree -p 12800 -l
slurmd(12800)─┬─gpu_accounting_(722986)
              [...]
              ├─gpu_accounting_(723869)
              ├─{slurmd}(722105)
              [...]
              └─{slurmd}(847476)

Or I am missing something?

Brice

Comment 13 Marshall Garey 2023-06-02 09:28:47 MDT

I agree, it looks like those are just threads of the main slurmd process.

Are there any defunct processes on the compute node?

Would you be interested in setting PrologEpilogTimeout in slurm.conf? This should kill the prolog/epilog process when that timeout is reached to prevent nodes from being stuck indefinitely. I would also like to see if that actually clean us all of the prolog/epilog processes.

Comment 14 Institut Pasteur HPC Admin 2023-06-02 09:33:51 MDT

All the "gpu_accounting_" processes are defunct and zombie state.

We already set a value to this parameter : 

>PrologEpilogTimeout     = 65534

Brice

Comment 15 Marshall Garey 2023-06-02 11:20:45 MDT

(In reply to Institut Pasteur HPC Admin from comment #14)
> All the "gpu_accounting_" processes are defunct and zombie state.

Considering this, would you be interested in trying a patch? This patch fixes a case where slurmd waits for already defunct processes until they timeout and are killed. Considering that you have PrologEpilogTimeout set to 65534, it will be such a long time (about 18 hours) until they timeout that it might as well not be set at all. This 18 hour time is similar to the >15 hours that you mentioned in comment 0.

This patch makes slurmd immediately reap defunct prolog/epilog processes:

attachment 29142 [details]


> We already set a value to this parameter : 
> 
> >PrologEpilogTimeout     = 65534
> 
> Brice

I meant, would you be interested in setting PrologEpilogTimeout such that stuck prologs/epilogs will actually timeout in a reasonable amount of time?
Or if you would rather try the above patch, then apply that patch to all of the slurmd's.

Comment 16 Institut Pasteur HPC Admin 2023-06-05 04:21:05 MDT

Hi,

This time the completing state seems to never end.

sacct -j 25785880 --format=submit,start,end,state
             Submit               Start                 End      State 
------------------- ------------------- ------------------- ---------- 
2023-05-30T02:28:19 2023-05-30T02:28:23 2023-05-31T10:49:24  CANCELLED 
2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49  CANCELLED 
2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49  CANCELLED 


I don't check it before, I used 'scontrol show config' and don't check in the slurm.conf, but the value of "PrologEpilogTimeout" is the default one. In the documentation, it says :
>The default behavior is to wait indefinitely

I will just try to set a little value of PrologEpilogTimeout like 10 minutes, and check what will be the new behavior before thinking about a patch. 

Do I only need to do a "scontrol reconfigure" to change the value properly?

Comment 17 Marshall Garey 2023-06-05 15:13:31 MDT

(In reply to Institut Pasteur HPC Admin from comment #16)
> Hi,
> 
> This time the completing state seems to never end.
> 
> sacct -j 25785880 --format=submit,start,end,state
>              Submit               Start                 End      State 
> ------------------- ------------------- ------------------- ---------- 
> 2023-05-30T02:28:19 2023-05-30T02:28:23 2023-05-31T10:49:24  CANCELLED 
> 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49  CANCELLED 
> 2023-05-30T02:28:23 2023-05-30T02:28:23 2023-06-03T11:16:49  CANCELLED 
> 
> 
> I don't check it before, I used 'scontrol show config' and don't check in
> the slurm.conf, but the value of "PrologEpilogTimeout" is the default one.
> In the documentation, it says :
> >The default behavior is to wait indefinitely

My mistake: I misremembered the default value (65534). I thought it was 65535 or zero.


> I will just try to set a little value of PrologEpilogTimeout like 10
> minutes, and check what will be the new behavior before thinking about a
> patch. 
> 
> Do I only need to do a "scontrol reconfigure" to change the value properly?

scontrol reconfigure is sufficient.

Comment 21 Marshall Garey 2023-06-12 13:38:13 MDT

We pushed commit 5a3f79271b upstream - it will be part of 23.02.3. It fixes waiting indefinitely for defunct script processes. This is the same patch as attachment 29142 [details] except it is written for Slurm 23.02.

I think that this will solve the defunct script process issue that you are seeing. You can feel free to apply this patch in 22.05 (attachment 29142 [details]) or upgrade to 23.02.3 after that is released.

Are you interested in testing the patch in 22.05? If not, then I can just resolve this bug as fixed assuming that the commit does fix the issue. You can always reopen the bug if it is not solved after upgrading.

Comment 22 Institut Pasteur HPC Admin 2023-06-14 03:47:37 MDT

Hi,

We will wait the next slurm upgrade to test the fix.

Thanks for your support,

Brice

Comment 23 Marshall Garey 2023-08-30 16:38:32 MDT

We had to revert this change in commit 790b4a4738 ahead of 23.02.5. See bug 17481 for details.

In short, Slurm has to wait for all processes in the script's process group to complete. This places an increased importance on setting timeout values for scripts, such as PrologEpilogTimeout (which by default has an unlimited timeout).

Comment 25 Institut Pasteur HPC Admin 2023-10-20 04:12:03 MDT

Hi,

I understand the reason of reverting the commit.

But I didn't notice the initial state(before requeue) of the job was "NODE_FAILURE", I am wondering if you are still thinking about a fix for my initial ticket?

I know that setting a little value for the parameter "PrologEpilogTimeout" will speed up the requeue, but it still not clean to get a "NODE_FAILURE" state for a job.

Do you recommand us to do something to avoid the race condition?

Thanks for your support,

Brice

Comment 26 Marshall Garey 2023-10-20 09:13:27 MDT

(In reply to Institut Pasteur HPC Admin from comment #25)
> But I didn't notice the initial state(before requeue) of the job was
> "NODE_FAILURE", I am wondering if you are still thinking about a fix for my
> initial ticket?

I don't see anything about NODE_FAILURE in this ticket except for this comment. What ticket are you referring to?

If the issue is separate from the "prolog stuck in completing state" issue in comment 0, open a new ticket.