Ticket 20777

Summary:	When slurmd is restarted during prolog phase, it can potentially hang the job with no output
Product:	Slurm	Reporter:	moet
Component:	slurmd	Assignee:	Ben Roberts <ben>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	marshall
Version:	23.02.5
Hardware:	Linux
OS:	Linux
Site:	NVIDIA HWinf-CS	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	Selene	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description moet 2024-08-26 10:28:15 MDT

We have encounter several cases where job hung and did not provide any output after a slurmd restart during the prolog phase.

For what I understand, this could be due to controller waiting for slurmd to complete prolog, but as slurmd was restarted and lost it's state, it failed to respond to controller? 

We wish to see if there are way to keep slurmd restart during prolog from causing a job hang or failure.

Comment 2 Ben Roberts 2024-08-27 09:06:28 MDT

Hello,

What you are reporting sounds similar to an issue that we have seen and recently checked in a fix for.  The approach was not to keep slurmd from restarting during a prolog, but to move the execution to a separate forked process so that a slurmd restart won't affect it.  The fix will be in 24.11, so unfortunately it's not available immediately.

You can see the bug where this was reported and worked on here:
https://support.schedmd.com/show_bug.cgi?id=16126

The commit message for the relevant changes looks like this in the NEWS file for 24.11:
 -- Improve the way to run external commands and fork processes to avoid 
    non-async-signal safe calls between a fork and an exec. We fork ourselves
    now and executes the commands in a safe environment. This includes spank 
    prolog/epilog executions.

Please let me know if you have any questions or comments about these changes and how they might affect your environment.

Thanks,
Ben

Comment 3 Ben Roberts 2024-09-12 13:55:49 MDT

I wanted to follow up and see if you have any additional questions about the fix I mentioned.  If not I'll plan on closing this ticket as a duplicate of 16126 and you can confirm that the issue is resolved with the release of 24.11.

Thanks,
Ben

Comment 4 Ben Roberts 2024-10-01 12:04:59 MDT

I haven't head any follow up questions about this, so I'll go ahead and close this as a duplicate of ticket 16126.  Let us know if it's still an issue in 24.11.

Thanks,
Ben

*** This ticket has been marked as a duplicate of ticket 16126 ***

Comment 6 Ben Roberts 2024-10-02 11:16:27 MDT

Hello,

The developer that worked on ticket 16126 looked at this ticket and thinks the changes he made in 24.11 might not fully address the situation you're describing.  Is this something you can reproduce reliably?  If so, I would like to have you choose a node you can test this on, enable debug logging on your controller and cause the failure to happen.  You can enable debug logging without restarting slurmctld by running this command:
  scontrol setdebug debug2

Once you have triggered the failure you can set the debug logging back down to the regular (info) level.
  scontrol setdebug info

Then if you would send the relevant logs along with the output for the job and node that shows how you see the problem.  I would also like to have you confirm the version of Slurm you're using and describe the steps you take to reproduce the behavior.

Thanks,
Ben

Comment 7 Ben Roberts 2024-10-10 14:04:25 MDT

I haven't heard any additional questions or intput about this.  I'll go ahead and close the ticket, but let us know if there is anything else we can do to help.

Thanks,
Ben