Ticket 20777 - When slurmd is restarted during prolog phase, it can potentially hang the job with no output
Summary: When slurmd is restarted during prolog phase, it can potentially hang the job...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 23.02.5
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Ben Roberts
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2024-08-26 10:28 MDT by moet
Modified: 2024-10-10 14:04 MDT (History)
1 user (show)

See Also:
Site: NVIDIA HWinf-CS
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: Selene
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description moet 2024-08-26 10:28:15 MDT
We have encounter several cases where job hung and did not provide any output after a slurmd restart during the prolog phase.

For what I understand, this could be due to controller waiting for slurmd to complete prolog, but as slurmd was restarted and lost it's state, it failed to respond to controller? 

We wish to see if there are way to keep slurmd restart during prolog from causing a job hang or failure.
Comment 2 Ben Roberts 2024-08-27 09:06:28 MDT
Hello,

What you are reporting sounds similar to an issue that we have seen and recently checked in a fix for.  The approach was not to keep slurmd from restarting during a prolog, but to move the execution to a separate forked process so that a slurmd restart won't affect it.  The fix will be in 24.11, so unfortunately it's not available immediately.

You can see the bug where this was reported and worked on here:
https://support.schedmd.com/show_bug.cgi?id=16126

The commit message for the relevant changes looks like this in the NEWS file for 24.11:
 -- Improve the way to run external commands and fork processes to avoid 
    non-async-signal safe calls between a fork and an exec. We fork ourselves
    now and executes the commands in a safe environment. This includes spank 
    prolog/epilog executions.

Please let me know if you have any questions or comments about these changes and how they might affect your environment.

Thanks,
Ben
Comment 3 Ben Roberts 2024-09-12 13:55:49 MDT
I wanted to follow up and see if you have any additional questions about the fix I mentioned.  If not I'll plan on closing this ticket as a duplicate of 16126 and you can confirm that the issue is resolved with the release of 24.11.

Thanks,
Ben
Comment 4 Ben Roberts 2024-10-01 12:04:59 MDT
I haven't head any follow up questions about this, so I'll go ahead and close this as a duplicate of ticket 16126.  Let us know if it's still an issue in 24.11.

Thanks,
Ben

*** This ticket has been marked as a duplicate of ticket 16126 ***
Comment 6 Ben Roberts 2024-10-02 11:16:27 MDT
Hello,

The developer that worked on ticket 16126 looked at this ticket and thinks the changes he made in 24.11 might not fully address the situation you're describing.  Is this something you can reproduce reliably?  If so, I would like to have you choose a node you can test this on, enable debug logging on your controller and cause the failure to happen.  You can enable debug logging without restarting slurmctld by running this command:
  scontrol setdebug debug2

Once you have triggered the failure you can set the debug logging back down to the regular (info) level.
  scontrol setdebug info

Then if you would send the relevant logs along with the output for the job and node that shows how you see the problem.  I would also like to have you confirm the version of Slurm you're using and describe the steps you take to reproduce the behavior.

Thanks,
Ben
Comment 7 Ben Roberts 2024-10-10 14:04:25 MDT
I haven't heard any additional questions or intput about this.  I'll go ahead and close the ticket, but let us know if there is anything else we can do to help.

Thanks,
Ben