| Summary: | When slurmd is restarted during prolog phase, it can potentially hang the job with no output | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | moet |
| Component: | slurmd | Assignee: | Ben Roberts <ben> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | marshall |
| Version: | 23.02.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NVIDIA HWinf-CS | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | Selene | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
moet
2024-08-26 10:28:15 MDT
Hello, What you are reporting sounds similar to an issue that we have seen and recently checked in a fix for. The approach was not to keep slurmd from restarting during a prolog, but to move the execution to a separate forked process so that a slurmd restart won't affect it. The fix will be in 24.11, so unfortunately it's not available immediately. You can see the bug where this was reported and worked on here: https://support.schedmd.com/show_bug.cgi?id=16126 The commit message for the relevant changes looks like this in the NEWS file for 24.11: -- Improve the way to run external commands and fork processes to avoid non-async-signal safe calls between a fork and an exec. We fork ourselves now and executes the commands in a safe environment. This includes spank prolog/epilog executions. Please let me know if you have any questions or comments about these changes and how they might affect your environment. Thanks, Ben I wanted to follow up and see if you have any additional questions about the fix I mentioned. If not I'll plan on closing this ticket as a duplicate of 16126 and you can confirm that the issue is resolved with the release of 24.11. Thanks, Ben I haven't head any follow up questions about this, so I'll go ahead and close this as a duplicate of ticket 16126. Let us know if it's still an issue in 24.11. Thanks, Ben *** This ticket has been marked as a duplicate of ticket 16126 *** Hello, The developer that worked on ticket 16126 looked at this ticket and thinks the changes he made in 24.11 might not fully address the situation you're describing. Is this something you can reproduce reliably? If so, I would like to have you choose a node you can test this on, enable debug logging on your controller and cause the failure to happen. You can enable debug logging without restarting slurmctld by running this command: scontrol setdebug debug2 Once you have triggered the failure you can set the debug logging back down to the regular (info) level. scontrol setdebug info Then if you would send the relevant logs along with the output for the job and node that shows how you see the problem. I would also like to have you confirm the version of Slurm you're using and describe the steps you take to reproduce the behavior. Thanks, Ben I haven't heard any additional questions or intput about this. I'll go ahead and close the ticket, but let us know if there is anything else we can do to help. Thanks, Ben |