Similar to suggested possible duplicate of BugID 1069, but I don't think it is identical. I think the component should be srun, but not seeing that in the list I'll just pick scheduling. I believe there is a small window where an srun job will fail because it is using an out of date slurm.conf read on startup, rather than (re-reading) at allocation. So, if a new node is added to the file on the host where srun is already started, but not yet allocated, and then the job is allocated the task will fail. Am I stating this sequence correctly? That is, does this window for failure exist? Consider an otherwise empty partition name "8x8x8" or that the job won't be scheduled for some reason. 1) sbatch -p 8x8x8 .... 2) salloc -p 8x8x8 srun .... 3) srun -p 8x8x8 .... A node is added to slurm.conf following http://slurm.schedmd.com/faq.html#add_nodes such that the new node is in slurm.conf in all places including on the host where commands #1,2,3 were executed with the old version. the new nodes slurmd chats with slurmctld and is marked available, an allocation occurs and the job task launch is attempted. #1 works. #2 works. #3 fails. For us there is some impact. A majority of our jobs are submitted via sbatch and so I believe immune to this, but we also have a developer mode where srun is used. It is often the new or different node that the developer wants to test/troubleshoot and so they sometimes see this failure mode. (I've encouraged a number of other options: sbatch, not adding/removing nodes so often, but leaving them in, but for various reasons this is the setup being used now here). I've proposed they use #2 instead of #3 to protect against such windows - does that seem correct? I wonder if other people have experienced this, but written it off to slow file distribution. I did the first three times I saw it ;> Here is the output from commands for slurm version 2.6.7. I retried something similar with 14.11.2 and saw the same: In my test the slurm.conf file on drdlogin0008 should now with the new version of the script be updated between 15:41:57 and 15:41:58, but the job starts AFTER that 2014-12-18T15:42:28 and fails. This is a log of adding the new node: [harwell@drdlogin0008 ~]$ tail -f /var/log/local6 Dec 18 15:41:43 drdlogin0008.en.desres.deshaw.com : 1. Generate new slurm-anton2.conf to replace /opt/slurman2/etc/slurm-anton2.conf.#015 Dec 18 15:41:43 drdlogin0008.en.desres.deshaw.com : Running AMS machine-list#015 Dec 18 15:41:46 drdlogin0008.en.desres.deshaw.com : Files slurm-anton2.conf and /opt/slurman2/etc/slurm-anton2.conf differ#015 Dec 18 15:41:47 drdlogin0008.en.desres.deshaw.com : ... Install new file#015 Dec 18 15:41:47 drdlogin0008.en.desres.deshaw.com : ... Pushing updated file to bcfg2.#015 Dec 18 15:41:55 drdlogin0008.en.desres.deshaw.com : 2. Restarting slurmctl.#015 Dec 18 15:41:56 drdlogin0008.en.desres.deshaw.com : stopping slurmctld: #033[60G[#033[0;32m OK #033[0;39m]#015#015 Dec 18 15:41:57 drdlogin0008.en.desres.deshaw.com : slurmctld is stopped (/var/run/slurman2ctld-anton2.pid)#015 Dec 18 15:41:57 drdlogin0008.en.desres.deshaw.com : slurmctld is stopped#015 Dec 18 15:41:57 drdlogin0008.en.desres.deshaw.com : ... /opt/slurman2/etc/slurm-anton2.conf update on budget.desres.deshaw.com,drdenws[01-03,05-06].en.desres.deshaw.com,drdlogin[0001-0005,0007-0009,0011,0013-0014,0017,0019-0023,0025-0027,0030-0032,0034-0037,0039-0047].en.desres.deshaw.com,drdmem[0002-0018,0021].en.desres.deshaw.com,drdmgmt0003.en.desres.deshaw.com,drdwiki0004.en.desres.deshaw.com,drdws[0004-0006,0008,0010-0012,0022,0025,0028-0029,0034,0041,0045-0046,0050-0051,0055-0058,0060,0066,0071,0077-0092,0094-0113,0115,0117-0119,0121,0123-0125,0127-0134,0136-0138].nyc.desres.deshaw.com with pdcp#015 Dec 18 15:41:58 drdlogin0008.en.desres.deshaw.com : starting slurmctld (/var/run/slurman2ctld-anton2.pid): #033[60G[#033[0;32m OK #033[0;39m]#015#015 Dec 18 15:42:00 drdlogin0008.en.desres.deshaw.com : 3. Triggering bcfg2 update of slurmd.#015 Dec 18 15:42:08 drdlogin0008.en.desres.deshaw.com : ... /opt/slurman2/etc/slurm-anton2.conf update and slurm restart on anton2val[0001-0008],en[206-207]-cm[01,03,05,07,09,11,13,15],en[208,212]-cm01,en[214,218]-cm[01,03,05,07],en[217,221]-cm[09,11,13,15],en[222-229]-cm[01,05],labcm[0001,0003-0005,0008-0012],nyc200-cm[00-15],nyc202-cm[09,11,13,15],nyc203-cm[01,03,07].#015 Dec 18 15:43:19 drdlogin0008.en.desres.deshaw.com : DONE#015 The node is present when I check after the failure: [harwell@drdlogin0008 ~]$ grep -i en210-888 /opt/slurman2/etc/slurm-anton2.conf NodeName=en210-888 NodeHostname=en212-cm01 Procs=512 Port=6819 Feature=torus PartitionName=8x8x8 Nodes=en210-888 PartitionName=admin AllowGroups=dradadm Hidden=Yes Nodes=anton2val[0001-0008],en208-884m-0,en210-888,en206-441m-0,en206-441m-1,en206-441m-2,en206-441m-3,en206-441m-4,en206-441m-5,en206-441m-6,en206-441m-7,en207-441m-0,en207-441m-1,en207-441m-2,en207-441m-3,en207-441m-4,en207-441m-5,en207-441m-6,en207-441m-7,en214-444m-0,en214-444m-1,en214-444m-2,en214-444m-3,en214-444m-4,en214-444m-5,en214-444m-6,en214-444m-7,en218-444m-0,en218-444m-1,en218-444m-2,en218-444m-3,en218-444m-4,en218-444m-5,en218-444m-6,en218-444m-7,en222-444-0,en222-444-1,en223-444-0,en223-444-1,en224-444-0,en224-444-1,en225-444-0,en225-444-1,en226-444-0,en226-444-1,en227-444-0,en227-444-1,en228-444-0,en228-444-1,en229-444-0,en229-444-1,labcm0001,labcm0002,labcm0003,labcm0005,labcm0008,labcm0009,labcm0010,labcm0011,labcm0012,nyc200-111-0-0,nyc200-111-0-8,nyc200-111-1-0,nyc200-111-1-8,nyc200-111-2-0,nyc200-111-2-8,nyc200-111-3-0,nyc200-111-3-8,nyc200-111-4-0,nyc200-111-4-8,nyc200-111-5-0,nyc200-111-5-8,nyc200-111- 6-0,nyc20 0-111-6-8,nyc200-111-7-0,nyc200-111-7-8,nyc201-444m-0,nyc201-444m-1,nyc201-444m-2,nyc201-444m-3,nyc203-444m-2,nyc203-444m-3,nyc203-844m-0 Showing the modification time of the file - also see the node add log above. [harwell@drdlogin0008 ~]$ stat !$ stat /opt/slurman2/etc/slurm-anton2.conf File: `/opt/slurman2/etc/slurm-anton2.conf' Size: 17764 Blocks: 40 IO Block: 4096 regular file Device: 901h/2305d Inode: 556447 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-12-18 15:43:02.227655705 -0500 Modify: 2014-12-18 15:41:57.242875884 -0500 Change: 2014-12-18 15:41:57.242875884 -0500 Note the start time is after the modification time shown above: [harwell@drdlogin0008 ~]$ saa2 -j 1118189 -o Submit,Start,End Submit Start End ------------------- ------------------- ------------------- 2014-12-18T15:40:55 2014-12-18T15:42:28 2014-12-18T15:42:30 and yet it fails: [harwell@drdlogin0008 ~]$ garden with -m desres-settings/anton2/1.0/ALL srun -p 8x8x8 /bin/hostname srun: Required node not available (down or drained) srun: job 1118189 queued and waiting for resources srun: job 1118189 has been allocated resources srun: error: fwd_tree_thread: can't find address for host en210-888, check slurm.conf srun: error: Task launch for 1118189.0 failed on node en210-888: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete [harwell@drdlogin0008 ~]$ srun -V slurm 2.6.7 I did confirm by hand editing slurm-anton2.conf that message is generated when the lines are missing srun: debug2: Called _file_writable srun: debug: Started IO server thread (140388276762368) srun: debug: Entering _launch_tasks srun: launching 1118352.0 on host en210-888, 1 tasks: 0 srun: debug2: Tree head got back 0 looking for 1 srun: error: fwd_tree_thread: can't find address for host en210-888, check slurm.conf srun: debug2: Tree head got back 1 srun: debug: launch returned msg_rc=1001 err=1012 type=9001 srun: error: Task launch for 1118352.0 failed on node en210-888: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete
This is fixed in 15.08.0pre2. https://github.com/SchedMD/slurm/commit/abc435fd92b03f73dc7e2d0d67f48fec4736d81b Please reopen if you have any questions. Thanks, Brian
I apologize. We had to revert the fix. The issue is more complex than originally anticipated. I've updated the #add_nodes section in the FAQ with a note about this corner case. Please reopen the ticket if you feel that this needs to be fixed for your environment. Otherwise I would continue advocating the use of sbatch.
Thank you for looking at this and supplying that patch and the faq note on adding nodes. Admittedly, this is still just a minor issue, but we do still see it in our environment. Sorry I didn't try the patch because I saw it got reverted. I imagine there might be various cloud/thin provisioning environments that could trip on it more often. What were the additional complexities? I am wondering if they might not affect us and we could apply and try that patch - or does that patch not address the corner case?
Created attachment 1590 [details] srun reload conf patch The patch that was reverted attempted to reload the configuration only if the slurm.conf had changed. The problem with this approach was that the slurm.conf can include other files and they wouldn't get reloaded if the slurm.conf was never modified. Another attempt was to have srun check the nodes that it received in the allocation against the nodes in the loaded conf file and if there were any that didn't exist in the conf, then reload the configuration. The concern with this approach was that it would be too expensive on large jobs. Other solutions would require a lot of additional work. I've attached the patch that checks the srun allocation against the loaded configuration. If you don't mind the extra cost, you could use this one. Let me know if you have any questions. Thanks, Brian
Thanks for that. I wouldn't know how to do it, but how far away is the configuration hash from the srun context? I know I see it when I type scontrol show config: drdws0115:bin$ scontrol show config | grep HASH HASH_VAL = Different Ours=0x15cd7c6d Slurmctld=0x7f8c4e0c I guess one would only need to enter this logic if srun start time is older than slurmctld start time, though I'm not sure that info is available in the srun context either. On 01/29/2015 01:59 PM, bugs@schedmd.com wrote: > *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=1333#c4> on bug 1333 <http://bugs.schedmd.com/show_bug.cgi?id=1333> from Brian Christiansen <mailto:brian@schedmd.com> * > > Created attachment 1590 [details] <attachment.cgi?id=1590&action=diff> [details] <attachment.cgi?id=1590&action=edit> > srun reload conf patch > > The patch that was reverted attempted to reload the configuration only if the > slurm.conf had changed. The problem with this approach was that the slurm.conf > can include other files and they wouldn't get reloaded if the slurm.conf was > never modified. > > Another attempt was to have srun check the nodes that it received in the > allocation against the nodes in the loaded conf file and if there were any that > didn't exist in the conf, then reload the configuration. The concern with this > approach was that it would be too expensive on large jobs. > > Other solutions would require a lot of additional work. > > I've attached the patch that checks the srun allocation against the loaded > configuration. If you don't mind the extra cost, you could use this one. > > Let me know if you have any questions. > > Thanks, > Brian > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -- > You are receiving this mail because: > > * You reported the bug. >
The problem with the hash method is that the hash is generated line by line as the slurm.conf, and included files, are processed. So in order to know the current hash of the slurm.conf (ie. after the srun allocation is granted) you have to reinit the slurm.conf again.
Actually, thinking through that more, it just might work. The controller could send back the the hash when the allocation is granted. scenario: 1. controller's hash is 1234 2. srun submits jobs. hash is 1234 3. slurm.conf is updated and controller is restarted. controller's hash is now 1235. 4. controller grants allocation and sends back new hash 1235. 5. srun receives allocation and checks controller's hash against it's hash. 6. If the hashes are different then reinit the config. I'll look into it. Thanks for the idea.
New feature request for 15.08. David
Hi Brian, We still see the issue on Slurm 22.05. Our scenario: Bright Cluster Manager dynamically adds nodes from cloud when it sees jobs are pending (including srun jobs!), but srun jobs does not have a chance to start because of that issue. Any chance to get it fixed? I see there was a feature request in 2015.