| Summary: | Jobs go to RUNNING state before node is booted and ready when PrologSlurmctld is set | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Rémi Palancher <remi-externe.palancher> |
| Component: | Scheduling | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex, dmjacobsen, dpaul |
| Version: | 15.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | EDF - Electricite de France | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 16.05.10 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
Proposed fix
Patch for version 15.08.13 |
||
|
Description
Rémi Palancher
2017-02-02 01:30:18 MST
Rémi - sorry for the late response. I can reproduce this by setting PrologSlurmctld, thanks for the suggestion. We're gonna come take a look at this and come back to you. Created attachment 4041 [details]
Proposed fix
Alex,
If you could give do some testing of this patch for me, that would be much appreciated.
I have tried this using quite a few different configurations:
Suspending and resuming nodes
Jobs with --reboot option
With and without PrologSlurmctld
BlueGene/Q (emulated)
Various timings
I have not yet run on a KNL (rebooting the node into various NUMA and MCDRAM modes) but hope to do that next week.
(In reply to Moe Jette from comment #12) > Created attachment 4041 [details] > Proposed fix > > Alex, > If you could give do some testing of this patch for me, that would be much > appreciated. > > I have tried this using quite a few different configurations: > Suspending and resuming nodes > Jobs with --reboot option > With and without PrologSlurmctld > BlueGene/Q (emulated) > Various timings > > I have not yet run on a KNL (rebooting the node into various NUMA and MCDRAM > modes) but hope to do that next week. Moe, I've been testing different configurations too: - Suspend/Resume with/without PrologSlurmctld - Suspend/Resume with/without --reboot - Combinations of the two above - Different timings - Changing the timings while powering up and 'scontrol reconfigure', including a resumeprogram with a sleep 30 + slurmd start, and initially ResumeTimeout < 30, then readjusting it to > 30 and 'scontrol reconfigure', then job ends up transitioning from CF to R properly and node state changes correctly too. They all work as expected for me, your patch looks good so far. I've not tested: - BlueGene/Q (emulated) - KNL Three of us have tested this patch on quite a few different systems with various configurations and found no problems. https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13.patch This change will be in version 16.05.10 when released (no date set). Hi Moe, Thank you for this patch! Is there any chance you backport it to slurm 15.08? Best, Rémi Created attachment 4049 [details]
Patch for version 15.08.13
I do not expect that we will have any more releases of Slurm version 15.08. Also note that the reboot logic in version 16.05 is very different from 15.08 (mostly due to changes required for support of Intel KNL and rebooting to change NUMA or MCDRAM configuration). The attached patch has been tested with version 15.08.13.
Also note that support for Slurm version 15.08 will end in May 2017, so upgrading soon is strongly recommended. Hi Moe, Thank you for this new version of the patch, I'm going to give it a try soon. Indeed, we will upgrade this cluster next summer, thank you for the reminder. Best, Rémi |