Hi SchedMD! The documentation for srun says, under `--exclusive`: """ The exclusive allocation of CPUs applies to job steps by default. In order to share the resources use the --overlap option. """ But experience seems to indicate that jobs steps are not exclusive by default. For instance: $ sbatch -p test -N 1 -n 4 -c 2 --wrap="bash -c 'for i in {1..4}; do srun -n 1 -c 2 sleep $((10*i)) & done; wait'" results in each step using the whole 8 CPUs allocated to the job: $ sacct -j 38224507 -o jobid,jobname,alloccpus,state,exitcode JobID JobName AllocCPUS State ExitCode ------------ ---------- ---------- ---------- -------- 38224507 wrap 8 COMPLETED 0:0 38224507.ba+ batch 8 COMPLETED 0:0 38224507.ex+ extern 8 COMPLETED 0:0 38224507.0 sleep 8 COMPLETED 0:0 38224507.1 sleep 8 COMPLETED 0:0 38224507.2 sleep 8 COMPLETED 0:0 38224507.3 sleep 8 COMPLETED 0:0 and job creation being delayed: $ cat slurm-38224507.out srun: Job 38224507 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 38224507 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Step created for job 38224507 srun: Job 38224507 step creation temporarily disabled, retrying (Requested nodes are busy) srun: Job 38224507 step creation still disabled, retrying (Requested nodes are busy) srun: Job 38224507 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 38224507 srun: Job 38224507 step creation still disabled, retrying (Requested nodes are busy) srun: Step created for job 38224507 When explicitly adding `--exclusive` to the `srun` command, things work as expected (2 CPUs allocated to each step): $ sbatch -p test -N 1 -n 4 -c 2 --wrap="bash -c 'for i in {1..4}; do srun -n 1 -c 2 --exclusive sleep $((10*i)) & done; wait'" Submitted batch job 38224793 $ sacct -j 38224793 -o jobid,jobname,alloccpus,state,exitcode JobID JobName AllocCPUS State ExitCode ------------ ---------- ---------- ---------- -------- 38224793 wrap 8 COMPLETED 0:0 38224793.ba+ batch 8 COMPLETED 0:0 38224793.ex+ extern 8 COMPLETED 0:0 38224793.0 sleep 2 COMPLETED 0:0 38224793.1 sleep 2 COMPLETED 0:0 38224793.2 sleep 2 COMPLETED 0:0 38224793.3 sleep 2 COMPLETED 0:0 Did I interpret the documentation incorrectly? I was expecting that `--exclusive` would be the default behavior, and that it would be necessary to add `--overlap` to get the behavior observed for job 38224507, which seems to be the current default. Thanks! -- Kilian
Hi Kilian, This seems like the same question as bug 11824 (which you submitted). In short: Exclusive access to CPUs is the default. But, --exact is not the default. By default, the step has access to all the CPUs in the job on all nodes allocated to the step. So, what you are seeing is expected. So, --exclusive is *not* the default for steps since --exclusive implies --exact. With sacct, can you check the start and end times of those job steps? sacct -j 38224507 -o jobid,start,end Because of the message (step creation temporarily disabled) I suspect that the steps did not all run concurrently. When I run your example job, my steps run one at a time. I realize that this behavior is confusing (as we discussed in bug 11824), but Tim Wickberg gave the reasoning for this change in bug 10383 comment 63 (in short, to not break MPI anymore). And yes, I still have a documentation bug open (bug 11310) to improve the documentation. I haven't gotten to it yet.
Hi Marshall, (In reply to Marshall Garey from comment #1) > This seems like the same question as bug 11824 (which you submitted). Ah, darn, I knew it sounded familiar, somehow. :) > In > short: > > Exclusive access to CPUs is the default. > > But, --exact is not the default. By default, the step has access to all the > CPUs in the job on all nodes allocated to the step. So, what you are seeing > is expected. > > So, --exclusive is *not* the default for steps since --exclusive implies > --exact. Yes, the documentation states that "The exclusive allocation of CPUs applies to job steps by default". So I guess that's the part that needs addressing. > With sacct, can you check the start and end times of those job steps? > > sacct -j 38224507 -o jobid,start,end $ sacct -j 38224507 -o jobid,start,end JobID Start End ------------ ------------------- ------------------- 38224507 2021-11-12T15:15:35 2021-11-12T15:15:46 38224507.ba+ 2021-11-12T15:15:35 2021-11-12T15:15:46 38224507.ex+ 2021-11-12T15:15:35 2021-11-12T15:15:46 38224507.0 2021-11-12T15:15:44 2021-11-12T15:15:44 38224507.1 2021-11-12T15:15:44 2021-11-12T15:15:44 38224507.2 2021-11-12T15:15:44 2021-11-12T15:15:45 38224507.3 2021-11-12T15:15:46 2021-11-12T15:15:46 > Because of the message (step creation temporarily disabled) I suspect that > the steps did not all run concurrently. When I run your example job, my > steps run one at a time. A little hard to se from the timings above since the task was really short, but yes, the steps are running one after the other, which is why the user who encountered that behavior contacted us in the first place. > I realize that this behavior is confusing (as we discussed in bug 11824), > but Tim Wickberg gave the reasoning for this change in bug 10383 comment 63 > (in short, to not break MPI anymore). No worries here, I get the need for --exact and --overlap. But since the documentation says that "The exclusive allocation of CPUs applies to job steps by default" and that --exclusive implies --exact, one would expect that srun runs with `--exclusive` and `--exact` by default. > And yes, I still have a documentation bug open (bug 11310) to improve the > documentation. I haven't gotten to it yet. Got it yes. Sorry for the duplicate report here. I guess that that documentation part is still *very* confusing to me. :D Happy to mark this bug as a duplicate of #11310. I'll add myself to that bug to follow updates. Thanks! -- Kilian
Sounds good. Closing this as a dup of 11310. Just a suggestion: If you want --exact to be the default, you can use a CliFilterPlugin[1] and set exact to true. Something like this in C: extern int cli_filter_p_setup_defaults(slurm_opt_t *opt, bool early) { if (opt->srun_opt) opt->srun_opt->exact = true; return SLURM_SUCCESS; } Or, you can do it in Lua. Or, you could set the SLURM_EXACT environment variable in the job script. And of course you can do this with any other parameter as well. [1] https://slurm.schedmd.com/cli_filter_plugins.html *** This ticket has been marked as a duplicate of ticket 11310 ***