Hello, I'm seeing salloc unblock long before an allocation is usable in the case that the allocation needs to configure first: dmj@login:~> salloc -p debug_knl -C quad,flat -N 10 -t 1:00:00 /bin/bash salloc: Granted job allocation 15 salloc: Waiting for resource configuration salloc: Nodes nid00[320-329] are ready for job dmj@login:~> squeue -j $SLURM_JOB_ID JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 15 debug_knl bash dmj CF 0:29 10 nid00[320-329] dmj@login:~> I think it should block until the allocation is ready for the job, to reduce confusion about when the user can access the allocation. As an aside, I think it would be nice if interactive allocations like this informed the user that node reconfiguration was happening, e.g.: ... salloc: Granted job allocation 15. salloc: Reconfiguring nodes nid00[320-329] to quad,flat salloc: Waiting for resource configuration <pause until configuration complete> salloc: Nodes nid00[320-329] are ready for job ... Thanks, Doug -Doug
salloc is designed continue while the nodes are booting. I will investigate the salloc log showing nodes being ready, but squeue showing nodes configuring, which is certainly an inconsistency, but I think you want to use this salloc option: --wait-all-nodes=<value> Controls when the execution of the command begins. By default the job will begin execution as soon as the allocation is made. 0 Begin execution as soon as allocation can be made. Do not wait for all nodes to be ready for use (i.e. booted). 1 Do not begin execution until all nodes are ready for use.
Great, that makes sense. Is there a way to make `--wait-all-nodes=1` a default behavior from slurm.conf? I suppose we can always set SALLOC_WAIT_ALL_NODES=1 in the default environment. Thanks! Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Thu, Sep 1, 2016 at 3:38 PM, <bugs@schedmd.com> wrote: > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3043#c1> on bug > 3043 <https://bugs.schedmd.com/show_bug.cgi?id=3043> from Moe Jette > <jette@schedmd.com> * > > salloc is designed continue while the nodes are booting. I will investigate the > salloc log showing nodes being ready, but squeue showing nodes configuring, > which is certainly an inconsistency, but I think you want to use this salloc > option: > > > --wait-all-nodes=<value> > Controls when the execution of the command begins. By default > the job will begin execution > as soon as the allocation is made. > > 0 Begin execution as soon as allocation can be made. > Do not wait for all nodes to be > ready for use (i.e. booted). > > 1 Do not begin execution until all nodes are ready for use. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Doug Jacobsen from comment #2) > Great, that makes sense. Is there a way to make `--wait-all-nodes=1` a > default behavior from slurm.conf? I suppose we can always set > SALLOC_WAIT_ALL_NODES=1 in the default environment. The command line option or an environment variable are your only options today, but it will be easy to add a configuration parameter to make that the default behaviour. I'll get that to you soon.
I just added an option for this capability in Slurm version 16.05.5. Once you install that (or get the patch if you are anxious), just add the "salloc_wait_nodes" option to the SchedulerParameters parameter in the slurm.conf and that will cause salloc to wait for node boot completion by default. The salloc option of "--wait-all-nodes=0" would override that. The commit is here: https://github.com/SchedMD/slurm/commit/2670edc47c9ed715f52fcf3144e301fc9ee6b4b5