Ticket 1333

Summary:	small failure window for srun, fails when queued before nodes added to slurm.conf with srun: error: fwd_tree_thread: can't find address for host
Product:	Slurm	Reporter:	Chris <chris.harwell>
Component:	Scheduling	Assignee:	Unassigned Developer <dev-unassigned>
Status:	CONFIRMED ---	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	taras.shapovalov
Version:	14.11.2
Hardware:	Linux
OS:	Linux
Site:	D E Shaw Research	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	15.08.0pre2
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	srun reload conf patch

Description Chris 2014-12-19 00:54:18 MST

Similar to suggested possible duplicate of BugID 1069, but I don't think it is identical. I think the component should be srun, but not seeing that in the list I'll just pick scheduling.

I believe there is a small window where an srun job will fail because it is using an out of date slurm.conf read on startup, rather than (re-reading) at allocation. 

So, if a new node is added to the file on the host where srun is already started, but not yet allocated, and then the job is allocated the task will fail. Am I stating this sequence correctly? That is, does this window for failure exist? 

Consider an otherwise empty partition name "8x8x8" or that the job won't be scheduled for some reason.
1) sbatch -p 8x8x8 ....
2) salloc -p 8x8x8 srun ....
3) srun -p 8x8x8 ....

A node is added to slurm.conf following 
http://slurm.schedmd.com/faq.html#add_nodes
such that the new node is in slurm.conf in all places including on the host where commands #1,2,3 were executed with the old version.

the new nodes slurmd chats with slurmctld and is marked available, an allocation occurs and the job task launch is attempted.

#1 works.
#2 works.
#3 fails.

For us there is some impact.  A majority of our jobs are submitted via sbatch and so I believe immune to this, but we also have a developer mode where srun is used. It is often the new or different node that the developer wants to test/troubleshoot and so they sometimes see this failure mode. (I've encouraged a number of other options: sbatch, not adding/removing nodes so often, but leaving them in, but for various reasons this is the setup being used now here). I've proposed they use #2 instead of #3 to protect against such windows - does that seem correct?

I wonder if other people have experienced this, but written it off to slow file distribution. I did the first three times I saw it ;>


Here is the output from commands for slurm version 2.6.7. I retried something similar with 14.11.2 and saw the same:

In my test the slurm.conf file on drdlogin0008 should now with the new version of the script be updated between  15:41:57 and 15:41:58, but the job starts AFTER that 2014-12-18T15:42:28 and fails.

This is a log of adding the new node:
[harwell@drdlogin0008 ~]$ tail -f /var/log/local6
Dec 18 15:41:43 drdlogin0008.en.desres.deshaw.com : 1. Generate new slurm-anton2.conf to replace /opt/slurman2/etc/slurm-anton2.conf.#015
Dec 18 15:41:43 drdlogin0008.en.desres.deshaw.com : Running AMS machine-list#015
Dec 18 15:41:46 drdlogin0008.en.desres.deshaw.com : Files slurm-anton2.conf and /opt/slurman2/etc/slurm-anton2.conf differ#015
Dec 18 15:41:47 drdlogin0008.en.desres.deshaw.com : ... Install new file#015
Dec 18 15:41:47 drdlogin0008.en.desres.deshaw.com : ... Pushing updated file to bcfg2.#015
Dec 18 15:41:55 drdlogin0008.en.desres.deshaw.com : 2. Restarting slurmctl.#015
Dec 18 15:41:56 drdlogin0008.en.desres.deshaw.com : stopping slurmctld: #033[60G[#033[0;32m  OK  #033[0;39m]#015#015
Dec 18 15:41:57 drdlogin0008.en.desres.deshaw.com : slurmctld is stopped (/var/run/slurman2ctld-anton2.pid)#015
Dec 18 15:41:57 drdlogin0008.en.desres.deshaw.com : slurmctld is stopped#015
Dec 18 15:41:57 drdlogin0008.en.desres.deshaw.com : ... /opt/slurman2/etc/slurm-anton2.conf update on budget.desres.deshaw.com,drdenws[01-03,05-06].en.desres.deshaw.com,drdlogin[0001-0005,0007-0009,0011,0013-0014,0017,0019-0023,0025-0027,0030-0032,0034-0037,0039-0047].en.desres.deshaw.com,drdmem[0002-0018,0021].en.desres.deshaw.com,drdmgmt0003.en.desres.deshaw.com,drdwiki0004.en.desres.deshaw.com,drdws[0004-0006,0008,0010-0012,0022,0025,0028-0029,0034,0041,0045-0046,0050-0051,0055-0058,0060,0066,0071,0077-0092,0094-0113,0115,0117-0119,0121,0123-0125,0127-0134,0136-0138].nyc.desres.deshaw.com with pdcp#015
Dec 18 15:41:58 drdlogin0008.en.desres.deshaw.com : starting slurmctld (/var/run/slurman2ctld-anton2.pid): #033[60G[#033[0;32m  OK  #033[0;39m]#015#015
Dec 18 15:42:00 drdlogin0008.en.desres.deshaw.com : 3. Triggering bcfg2 update of slurmd.#015
Dec 18 15:42:08 drdlogin0008.en.desres.deshaw.com : ... /opt/slurman2/etc/slurm-anton2.conf update and slurm restart on anton2val[0001-0008],en[206-207]-cm[01,03,05,07,09,11,13,15],en[208,212]-cm01,en[214,218]-cm[01,03,05,07],en[217,221]-cm[09,11,13,15],en[222-229]-cm[01,05],labcm[0001,0003-0005,0008-0012],nyc200-cm[00-15],nyc202-cm[09,11,13,15],nyc203-cm[01,03,07].#015
Dec 18 15:43:19 drdlogin0008.en.desres.deshaw.com : DONE#015


The node is present when I check after the failure:
[harwell@drdlogin0008 ~]$ grep -i en210-888 /opt/slurman2/etc/slurm-anton2.conf
NodeName=en210-888 NodeHostname=en212-cm01 Procs=512 Port=6819 Feature=torus
PartitionName=8x8x8 Nodes=en210-888
PartitionName=admin AllowGroups=dradadm Hidden=Yes Nodes=anton2val[0001-0008],en208-884m-0,en210-888,en206-441m-0,en206-441m-1,en206-441m-2,en206-441m-3,en206-441m-4,en206-441m-5,en206-441m-6,en206-441m-7,en207-441m-0,en207-441m-1,en207-441m-2,en207-441m-3,en207-441m-4,en207-441m-5,en207-441m-6,en207-441m-7,en214-444m-0,en214-444m-1,en214-444m-2,en214-444m-3,en214-444m-4,en214-444m-5,en214-444m-6,en214-444m-7,en218-444m-0,en218-444m-1,en218-444m-2,en218-444m-3,en218-444m-4,en218-444m-5,en218-444m-6,en218-444m-7,en222-444-0,en222-444-1,en223-444-0,en223-444-1,en224-444-0,en224-444-1,en225-444-0,en225-444-1,en226-444-0,en226-444-1,en227-444-0,en227-444-1,en228-444-0,en228-444-1,en229-444-0,en229-444-1,labcm0001,labcm0002,labcm0003,labcm0005,labcm0008,labcm0009,labcm0010,labcm0011,labcm0012,nyc200-111-0-0,nyc200-111-0-8,nyc200-111-1-0,nyc200-111-1-8,nyc200-111-2-0,nyc200-111-2-8,nyc200-111-3-0,nyc200-111-3-8,nyc200-111-4-0,nyc200-111-4-8,nyc200-111-5-0,nyc200-111-5-8,nyc200-111-
 6-0,nyc20
0-111-6-8,nyc200-111-7-0,nyc200-111-7-8,nyc201-444m-0,nyc201-444m-1,nyc201-444m-2,nyc201-444m-3,nyc203-444m-2,nyc203-444m-3,nyc203-844m-0

Showing the modification time of the file - also see the node add log above.
[harwell@drdlogin0008 ~]$ stat !$
stat /opt/slurman2/etc/slurm-anton2.conf
  File: `/opt/slurman2/etc/slurm-anton2.conf'
  Size: 17764           Blocks: 40         IO Block: 4096   regular file
Device: 901h/2305d      Inode: 556447      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-12-18 15:43:02.227655705 -0500
Modify: 2014-12-18 15:41:57.242875884 -0500
Change: 2014-12-18 15:41:57.242875884 -0500

Note the start time is after the modification time shown above:
[harwell@drdlogin0008 ~]$ saa2 -j 1118189 -o Submit,Start,End
             Submit               Start                 End
------------------- ------------------- -------------------
2014-12-18T15:40:55 2014-12-18T15:42:28 2014-12-18T15:42:30

and yet it fails:
[harwell@drdlogin0008 ~]$ garden with -m desres-settings/anton2/1.0/ALL srun -p 8x8x8 /bin/hostname
srun: Required node not available (down or drained)
srun: job 1118189 queued and waiting for resources
srun: job 1118189 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host en210-888, check slurm.conf
srun: error: Task launch for 1118189.0 failed on node en210-888: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

[harwell@drdlogin0008 ~]$ srun -V
slurm 2.6.7 

I did confirm by hand editing slurm-anton2.conf that message is generated when the lines are missing
srun: debug2: Called _file_writable
srun: debug:  Started IO server thread (140388276762368)
srun: debug:  Entering _launch_tasks
srun: launching 1118352.0 on host en210-888, 1 tasks: 0
srun: debug2: Tree head got back 0 looking for 1
srun: error: fwd_tree_thread: can't find address for host en210-888, check slurm.conf
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=1001 err=1012 type=9001
srun: error: Task launch for 1118352.0 failed on node en210-888: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Comment 1 Brian Christiansen 2014-12-30 07:30:47 MST

This is fixed in 15.08.0pre2.
https://github.com/SchedMD/slurm/commit/abc435fd92b03f73dc7e2d0d67f48fec4736d81b

Please reopen if you have any questions.

Thanks,
Brian

Comment 2 Brian Christiansen 2014-12-31 07:53:47 MST

I apologize. We had to revert the fix. The issue is more complex than originally anticipated. 

I've updated the #add_nodes section in the FAQ with a note about this corner case.

Please reopen the ticket if you feel that this needs to be fixed for your environment. Otherwise I would continue advocating the use of sbatch.

Comment 3 Chris 2015-01-29 01:14:23 MST

Thank you for looking at this and supplying that patch and the faq note on adding nodes.  Admittedly, this is still just a minor issue, but we do still see it in our environment.  Sorry I didn't try the patch because I saw it got reverted.  I imagine there might be various cloud/thin provisioning environments that could trip on it more often.  What were the additional complexities? I am wondering if they might not affect us and we could apply and try that patch - or does that patch not address the corner case?

Comment 4 Brian Christiansen 2015-01-29 04:59:15 MST

Created attachment 1590 [details]
srun reload conf patch

The patch that was reverted attempted to reload the configuration only if the slurm.conf had changed. The problem with this approach was that the slurm.conf can include other files and they wouldn't get reloaded if the slurm.conf was never  modified.

Another attempt was to have srun check the nodes that it received in the allocation against the nodes in the loaded conf file and if there were any that didn't exist in the conf, then reload the configuration. The concern with this approach was that it would be too expensive on large jobs.

Other solutions would require a lot of additional work.

I've attached the patch that checks the srun allocation against the loaded configuration. If you don't mind the extra cost, you could use this one.

Let me know if you have any questions.

Thanks,
Brian

Comment 5 Chris 2015-01-29 05:32:02 MST

Thanks for that.

I wouldn't know how to do it, but how far away is the configuration hash from the srun context?  I know I see it when I type scontrol show config:

drdws0115:bin$ scontrol show config | grep HASH
HASH_VAL                = Different Ours=0x15cd7c6d Slurmctld=0x7f8c4e0c

I guess one would only need to enter this logic if srun start time is older than slurmctld start time, though I'm not sure that info is available in the srun context either.



On 01/29/2015 01:59 PM, bugs@schedmd.com wrote:
> *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=1333#c4> on bug 1333 <http://bugs.schedmd.com/show_bug.cgi?id=1333> from Brian Christiansen <mailto:brian@schedmd.com> *
> 
> Created attachment 1590 [details] <attachment.cgi?id=1590&action=diff> [details] <attachment.cgi?id=1590&action=edit>
> srun reload conf patch
> 
> The patch that was reverted attempted to reload the configuration only if the
> slurm.conf had changed. The problem with this approach was that the slurm.conf
> can include other files and they wouldn't get reloaded if the slurm.conf was
> never  modified.
> 
> Another attempt was to have srun check the nodes that it received in the
> allocation against the nodes in the loaded conf file and if there were any that
> didn't exist in the conf, then reload the configuration. The concern with this
> approach was that it would be too expensive on large jobs.
> 
> Other solutions would require a lot of additional work.
> 
> I've attached the patch that checks the srun allocation against the loaded
> configuration. If you don't mind the extra cost, you could use this one.
> 
> Let me know if you have any questions.
> 
> Thanks,
> Brian
> 
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 --
> You are receiving this mail because:
> 
>   * You reported the bug.
>

Comment 6 Brian Christiansen 2015-01-29 09:18:37 MST

The problem with the hash method is that the hash is generated line by line as the slurm.conf, and included files, are processed. So in order to know the current hash of the slurm.conf (ie. after the srun allocation is granted) you have to reinit the slurm.conf again.

Comment 7 Brian Christiansen 2015-01-29 09:43:29 MST

Actually, thinking through that more, it just might work. The controller could send back the the hash when the allocation is granted.

scenario:
1. controller's hash is 1234
2. srun submits jobs. hash is 1234
3. slurm.conf is updated and controller is restarted. controller's hash is now 1235.
4. controller grants allocation and sends back new hash 1235.
5. srun receives allocation and checks controller's hash against it's hash.
6. If the hashes are different then reinit the config.

I'll look into it. Thanks for the idea.

Comment 8 David Bigagli 2015-02-11 04:17:11 MST

New feature request for 15.08.

David

Comment 9 Taras Shapovalov 2023-01-23 03:03:52 MST

Hi Brian,

We still see the issue on Slurm 22.05. Our scenario: Bright Cluster Manager dynamically adds nodes from cloud when it sees jobs are pending (including srun jobs!), but srun jobs does not have a chance to start because of that issue. Any chance to get it fixed? I see there was a feature request in 2015.