| Summary: | small failure window for srun, fails when queued before nodes added to slurm.conf with srun: error: fwd_tree_thread: can't find address for host | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chris <chris.harwell> |
| Component: | Scheduling | Assignee: | Unassigned Developer <dev-unassigned> |
| Status: | CONFIRMED --- | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | taras.shapovalov |
| Version: | 14.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | D E Shaw Research | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.0pre2 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | srun reload conf patch | ||
|
Description
Chris
2014-12-19 00:54:18 MST
This is fixed in 15.08.0pre2. https://github.com/SchedMD/slurm/commit/abc435fd92b03f73dc7e2d0d67f48fec4736d81b Please reopen if you have any questions. Thanks, Brian I apologize. We had to revert the fix. The issue is more complex than originally anticipated. I've updated the #add_nodes section in the FAQ with a note about this corner case. Please reopen the ticket if you feel that this needs to be fixed for your environment. Otherwise I would continue advocating the use of sbatch. Thank you for looking at this and supplying that patch and the faq note on adding nodes. Admittedly, this is still just a minor issue, but we do still see it in our environment. Sorry I didn't try the patch because I saw it got reverted. I imagine there might be various cloud/thin provisioning environments that could trip on it more often. What were the additional complexities? I am wondering if they might not affect us and we could apply and try that patch - or does that patch not address the corner case? Created attachment 1590 [details]
srun reload conf patch
The patch that was reverted attempted to reload the configuration only if the slurm.conf had changed. The problem with this approach was that the slurm.conf can include other files and they wouldn't get reloaded if the slurm.conf was never modified.
Another attempt was to have srun check the nodes that it received in the allocation against the nodes in the loaded conf file and if there were any that didn't exist in the conf, then reload the configuration. The concern with this approach was that it would be too expensive on large jobs.
Other solutions would require a lot of additional work.
I've attached the patch that checks the srun allocation against the loaded configuration. If you don't mind the extra cost, you could use this one.
Let me know if you have any questions.
Thanks,
Brian
Thanks for that. I wouldn't know how to do it, but how far away is the configuration hash from the srun context? I know I see it when I type scontrol show config: drdws0115:bin$ scontrol show config | grep HASH HASH_VAL = Different Ours=0x15cd7c6d Slurmctld=0x7f8c4e0c I guess one would only need to enter this logic if srun start time is older than slurmctld start time, though I'm not sure that info is available in the srun context either. On 01/29/2015 01:59 PM, bugs@schedmd.com wrote: > *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=1333#c4> on bug 1333 <http://bugs.schedmd.com/show_bug.cgi?id=1333> from Brian Christiansen <mailto:brian@schedmd.com> * > > Created attachment 1590 [details] <attachment.cgi?id=1590&action=diff> [details] <attachment.cgi?id=1590&action=edit> > srun reload conf patch > > The patch that was reverted attempted to reload the configuration only if the > slurm.conf had changed. The problem with this approach was that the slurm.conf > can include other files and they wouldn't get reloaded if the slurm.conf was > never modified. > > Another attempt was to have srun check the nodes that it received in the > allocation against the nodes in the loaded conf file and if there were any that > didn't exist in the conf, then reload the configuration. The concern with this > approach was that it would be too expensive on large jobs. > > Other solutions would require a lot of additional work. > > I've attached the patch that checks the srun allocation against the loaded > configuration. If you don't mind the extra cost, you could use this one. > > Let me know if you have any questions. > > Thanks, > Brian > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -- > You are receiving this mail because: > > * You reported the bug. > The problem with the hash method is that the hash is generated line by line as the slurm.conf, and included files, are processed. So in order to know the current hash of the slurm.conf (ie. after the srun allocation is granted) you have to reinit the slurm.conf again. Actually, thinking through that more, it just might work. The controller could send back the the hash when the allocation is granted. scenario: 1. controller's hash is 1234 2. srun submits jobs. hash is 1234 3. slurm.conf is updated and controller is restarted. controller's hash is now 1235. 4. controller grants allocation and sends back new hash 1235. 5. srun receives allocation and checks controller's hash against it's hash. 6. If the hashes are different then reinit the config. I'll look into it. Thanks for the idea. New feature request for 15.08. David Hi Brian, We still see the issue on Slurm 22.05. Our scenario: Bright Cluster Manager dynamically adds nodes from cloud when it sees jobs are pending (including srun jobs!), but srun jobs does not have a chance to start because of that issue. Any chance to get it fixed? I see there was a feature request in 2015. |