Ticket 12912

Summary:	How to run srun tasks in parallel?
Product:	Slurm	Reporter:	Bjørn-Helge Mevik <b.h.mevik>
Component:	User Commands	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	kilian, marshall
Version:	20.11.8
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=12462 https://bugs.schedmd.com/show_bug.cgi?id=11310 https://bugs.schedmd.com/show_bug.cgi?id=10465 https://bugs.schedmd.com/show_bug.cgi?id=12765 https://bugs.schedmd.com/show_bug.cgi?id=11863 https://bugs.schedmd.com/show_bug.cgi?id=11303 https://bugs.schedmd.com/show_bug.cgi?id=11275 https://bugs.schedmd.com/show_bug.cgi?id=11852 https://bugs.schedmd.com/show_bug.cgi?id=11589
Site:	Sigma2 Norway	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Main slurm config file.

Description Bjørn-Helge Mevik 2021-11-25 02:31:50 MST

Created attachment 22402 [details]
Main slurm config file.

After upgrading from 19.05.7 to 20.11.8, we've discovered that the way that we've recommended for running tasks in parallel with srun does not work any more.  In 19.05.7 and earlier, we used  "srun --exclusive", based on an example in the srun man page:

       > cat my.script
       #!/bin/bash
       srun --exclusive -n4 prog1 &
       srun --exclusive -n3 prog2 &
       srun --exclusive -n1 prog3 &
       srun --exclusive -n1 prog4 &
       wait

As I understand, the behaviour of srun changed with 20.11.x, and now the example in the man page says

       $ cat my.script
       #!/bin/bash
       srun -n4 prog1 &
       srun -n3 prog2 &
       srun -n1 prog3 &
       srun -n1 prog4 &
       wait

However, in some of our partitions, we hand out cpu and memory, not whole nodes, and as I understand it, the default for srun is now that each run gets access to the whole job allocation, which means that only one srun will run at a time on each node.  We've verified this with the following job script:

---- snip ----
#!/bin/bash                                                                                                                                                  

#SBATCH -A nn9999k --time=10 --mem-per-cpu=1G                                                                                                                
#SBATCH -o out/%x-%j.out                                                                                                                                     
#SBATCH --ntasks=9                                                                                                                                           

echo Starting.
echo
env | grep SLURM | sort
echo

echo Submitting parallel steps, default:
srun -n4 my-binary A &
srun -n3 my-binary B &
srun -n1 my-binary C &
srun -n1 my-binary D &

echo Done submitting.  Waiting...
wait

echo
echo Submitting parallel steps, with --exact:
srun -n4 --exact my-binary A &
srun -n3 --exact my-binary B &
srun -n1 --exact my-binary C &
srun -n1 --exact my-binary D &

echo Done submitting.  Waiting...
wait
---- snip ----

"my-binary" is just a small script printing the date, the command line argument, the $SLURM_STEP_ID, hostname and $SLURM_CPUS_ON_NODE, and then sleeping a little:

---- snip ----
#!/bin/bash                                                                                                                                                  

echo $(date +%FT%T) - Arg: $1 - Step ID: $SLURM_STEP_ID - Host: $(hostname) - CPUs on node: $SLURM_CPUS_ON_NODE
sleep 30
---- snip ----

With this, the (relevant) output is

---- snip ----
SLURM_JOB_CPUS_PER_NODE=1,8
[...]
SLURM_JOB_NODELIST=c5-[1,5]
SLURM_JOB_NUM_NODES=2
[...]
SLURM_TASKS_PER_NODE=1,8
[...]

Submitting parallel steps, default:
Done submitting. Waiting...
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
2021-11-23T09:42:02 - Arg: A - Step ID: 0 - Host: c5-1 - CPUs on node: 1
2021-11-23T09:42:02 - Arg: A - Step ID: 0 - Host: c5-5 - CPUs on node: 8
2021-11-23T09:42:02 - Arg: A - Step ID: 0 - Host: c5-5 - CPUs on node: 8
2021-11-23T09:42:02 - Arg: A - Step ID: 0 - Host: c5-5 - CPUs on node: 8
srun: Job 4448178 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 4448178 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 4448178 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Step created for job 4448178
srun: Step created for job 4448178
2021-11-23T09:42:32 - Arg: C - Step ID: 2 - Host: c5-5 - CPUs on node: 8
2021-11-23T09:42:32 - Arg: D - Step ID: 1 - Host: c5-1 - CPUs on node: 1
srun: Job 4448178 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 4448178 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 4448178
2021-11-23T09:43:03 - Arg: B - Step ID: 3 - Host: c5-5 - CPUs on node: 8
2021-11-23T09:43:03 - Arg: B - Step ID: 3 - Host: c5-1 - CPUs on node: 1
2021-11-23T09:43:03 - Arg: B - Step ID: 3 - Host: c5-5 - CPUs on node: 8

Submitting parallel steps, with --exact:
Done submitting. Waiting...
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
2021-11-23T09:43:33 - Arg: A - Step ID: 4 - Host: c5-1 - CPUs on node: 1
2021-11-23T09:43:33 - Arg: A - Step ID: 4 - Host: c5-5 - CPUs on node: 3
2021-11-23T09:43:33 - Arg: C - Step ID: 5 - Host: c5-5 - CPUs on node: 1
2021-11-23T09:43:33 - Arg: D - Step ID: 6 - Host: c5-5 - CPUs on node: 1
2021-11-23T09:43:33 - Arg: A - Step ID: 4 - Host: c5-5 - CPUs on node: 3
2021-11-23T09:43:33 - Arg: A - Step ID: 4 - Host: c5-5 - CPUs on node: 3
srun: Job 4448178 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 4448178 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 4448178 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 4448178
2021-11-23T09:44:04 - Arg: B - Step ID: 7 - Host: c5-5 - CPUs on node: 2
2021-11-23T09:44:04 - Arg: B - Step ID: 7 - Host: c5-5 - CPUs on node: 2
2021-11-23T09:44:04 - Arg: B - Step ID: 7 - Host: c5-1 - CPUs on node: 1
Done.
---- snip ----

As can be seen, by default, each task of each step sees all the cpus on the node it runs on, so only one step can run on each node at the same time.

Adding --exact to the srun command lines fixes that particular problem, but still the last step (7, argument "B") refuses to use the three available CPUs on c5-5, but waits until it can run on two nodes.

The only reliable way to hand out cpus to parallel sruns that I can find is using "srun --exact --nodes=1-$SLURM_JOB_NUM_NODES ...", like this:

srun -n4 --exact --nodes=1-$SLURM_JOB_NUM_NODES my-binary A &                                                                                                
srun -n3 --exact --nodes=1-$SLURM_JOB_NUM_NODES my-binary B &                                                                                                
srun -n1 --exact --nodes=1-$SLURM_JOB_NUM_NODES my-binary C &                                                                                                
srun -n1 --exact --nodes=1-$SLURM_JOB_NUM_NODES my-binary D & 

From a different, but similar (two nodes, 1 + 8 cpus), run as the one above, this gave:

---- snip ----
Submitting parallel steps, with --exact and --nodes range:
Done submitting. Waiting...
2021-11-25T10:25:36 - Arg: D - Step ID: 8 - Host: c11-7 - CPUs on node: 1
2021-11-25T10:25:36 - Arg: C - Step ID: 9 - Host: c11-60 - CPUs on node: 1
2021-11-25T10:25:36 - Arg: A - Step ID: 10 - Host: c11-7 - CPUs on node: 4
2021-11-25T10:25:36 - Arg: B - Step ID: 11 - Host: c11-7 - CPUs on node: 3
2021-11-25T10:25:36 - Arg: A - Step ID: 10 - Host: c11-7 - CPUs on node: 4
2021-11-25T10:25:36 - Arg: B - Step ID: 11 - Host: c11-7 - CPUs on node: 3
2021-11-25T10:25:36 - Arg: A - Step ID: 10 - Host: c11-7 - CPUs on node: 4
2021-11-25T10:25:36 - Arg: B - Step ID: 11 - Host: c11-7 - CPUs on node: 3
2021-11-25T10:25:36 - Arg: A - Step ID: 10 - Host: c11-7 - CPUs on node: 4
Done.
---- snip ----

(This also gets rid of the warnings about number of nodes.)  Is this how one is supposed to run parallel sruns when handing out memory and cpus?

Regards,
Bjørn-Helge Mevik

Comment 2 Felip Moll 2021-12-02 10:56:35 MST

Hi Bjørn,

I reproduced it but I had to partially fill one node, because otherwise my task placement is different and the test works.
So basically I needed to force this to happen:

> SLURM_JOB_CPUS_PER_NODE=1,8

I am not sure this is related to the new behavior of --exact.

Still looking into that but looks more like something that still happened before.

Have you tried specifying --mem per each step?

According to the changes done and which are described in the RELEASE_NOTES, you're right that you must use --exact in place of --exclusive, otherwise it will try to use all the resources in the allocation, so this is the right way (and maybe using --mem):

srun -n4 --exact my-binary A &
srun -n3 --exact my-binary B &
srun -n1 --exact my-binary C &
srun -n1 --exact my-binary D &


I also suggest using "-v" in srun to see exactly what is being requested, and "scontrol show steps". I am doing tests inside an salloc which is more interactive instead of within an sbatch.

I am still looking into it, let me know if you see any oddity or if the --mem fixes anything.

------------
 -- By default, a step started with srun will be granted exclusive (or non-
    overlapping) access to the resources assigned to that step. No other
    parallel step will be allowed to run on the same resources at the same
    time. This replaces one facet of the '--exclusive' option's behavior, but
    does not imply the '--exact' option described below. To get the previous
    default behavior - which allowed parallel steps to share all resources -
    use the new srun '--overlap' option.

 -- In conjunction to this non-overlapping step allocation behavior being the
    new default, there is an additional new option for step management
    '--exact', which will allow a step access to only those resources requested
    by the step. This is the second half of the '--exclusive' behavior.
    Otherwise, by default all non-gres resources on each node in the allocation
    will be used by the step, making it so no other parallel step will have
    access to those resources unless both steps have specified '--overlap'.

Comment 4 Bjørn-Helge Mevik 2021-12-09 07:48:22 MST

(In reply to Felip Moll from comment #2)

> I reproduced it but I had to partially fill one node, because otherwise my
> task placement is different and the test works.
> So basically I needed to force this to happen:
> 
> > SLURM_JOB_CPUS_PER_NODE=1,8

Yes, it does depend on the task placement for the job.
 
> I am not sure this is related to the new behavior of --exact.
> 
> Still looking into that but looks more like something that still happened
> before.

I'm quite sure we were able to run examples like that (with the --exact) in version 19.05 and earlier and get all tasks to start at the same time.
 
> Have you tried specifying --mem per each step?

I hadn't before, but now I've tried, and it did not help.  In one case, it actually delayed the start of steps even more than without --mem.

Also, using --mem without specifying the explicit distribution of tasks over nodes doesn't seem like a good idea (our jobs are typically submitted with --mem-per-cpu).
 
Here is an excerpt of a run with "-v".  It landed on five nodes:

SLURM_JOB_NODELIST=c5-[37,39,42-44]
SLURM_TASKS_PER_NODE=1(x3),3(x2)

# output from 
# echo Submitting parallel steps, with --exact:
# srun -v -n4 --exact my-binary A &
# srun -v -n3 --exact my-binary B &
# srun -v -n1 --exact my-binary C &
# srun -v -n1 --exact my-binary D &

Submitting parallel steps, with --exact:
Done submitting. Waiting...
srun: Warning: can't run 4 processes on 5 nodes, setting nnodes to 4
srun: defined options
srun: -------------------- --------------------
srun: (null)              : c5-[37,39,42-44]
srun: exact               : set
srun: jobid               : 4565098
srun: job-name            : srun_parallel_from_man.sm
srun: mem-per-cpu         : 1G
srun: nodes               : 4
srun: ntasks              : 4
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: Warning: can't run 1 processes on 5 nodes, setting nnodes to 1
srun: defined options
srun: -------------------- --------------------
srun: (null)              : c5-[37,39,42-44]
srun: Warning: can't run 3 processes on 5 nodes, setting nnodes to 3
srun: exact               : set
srun: defined options
srun: jobid               : 4565098
srun: -------------------- --------------------
srun: job-name            : srun_parallel_from_man.sm
srun: (null)              : c5-[37,39,42-44]
srun: mem-per-cpu         : 1G
srun: exact               : set
srun: nodes               : 1
srun: jobid               : 4565098
srun: ntasks              : 1
srun: job-name            : srun_parallel_from_man.sm
srun: verbose             : 1
srun: mem-per-cpu         : 1G
srun: -------------------- --------------------
srun: nodes               : 3
srun: end of defined options
srun: ntasks              : 3
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: launching StepId=4565098.4 on host c5-37, 1 tasks: 0
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: launching StepId=4565098.4 on host c5-39, 1 tasks: 1
srun: launching StepId=4565098.4 on host c5-42, 1 tasks: 2
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: launching StepId=4565098.4 on host c5-43, 1 tasks: 3
srun: route/default: init: route default plugin loaded
srun: Warning: can't run 1 processes on 5 nodes, setting nnodes to 1
srun: defined options
srun: -------------------- --------------------
srun: (null)              : c5-[37,39,42-44]
srun: exact               : set
srun: jobid               : 4565098
srun: job-name            : srun_parallel_from_man.sm
srun: mem-per-cpu         : 1G
srun: nodes               : 1
srun: ntasks              : 1
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=4565098.5 on host c5-44, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=4565098.6 on host c5-43, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node c5-42, 1 tasks started
srun: launch/slurm: _task_start: Node c5-43, 1 tasks started
srun: launch/slurm: _task_start: Node c5-37, 1 tasks started
srun: launch/slurm: _task_start: Node c5-39, 1 tasks started
srun: launch/slurm: _task_start: Node c5-44, 1 tasks started
srun: launch/slurm: _task_start: Node c5-43, 1 tasks started
2021-12-09T10:45:20 - Arg: A - Step ID: 4 - Host: c5-42 - CPUs on node: 1
2021-12-09T10:45:20 - Arg: A - Step ID: 4 - Host: c5-37 - CPUs on node: 1
2021-12-09T10:45:20 - Arg: A - Step ID: 4 - Host: c5-39 - CPUs on node: 1
2021-12-09T10:45:20 - Arg: C - Step ID: 5 - Host: c5-44 - CPUs on node: 1
2021-12-09T10:45:20 - Arg: A - Step ID: 4 - Host: c5-43 - CPUs on node: 1
2021-12-09T10:45:20 - Arg: D - Step ID: 6 - Host: c5-43 - CPUs on node: 1
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.4 (status=0x0000).
srun: launch/slurm: _task_finish: c5-42: task 2: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.4 (status=0x0000).
srun: launch/slurm: _task_finish: c5-39: task 1: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.5 (status=0x0000).
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.4 (status=0x0000).
srun: launch/slurm: _task_finish: c5-44: task 0: Completed
srun: launch/slurm: _task_finish: c5-37: task 0: Completed
srun: Job 4565098 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.4 (status=0x0000).
srun: launch/slurm: _task_finish: c5-43: task 3: Completed
srun: Job 4565098 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 4565098
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.6 (status=0x0000).
srun: launch/slurm: _task_finish: c5-43: task 0: Completed
srun: launching StepId=4565098.7 on host c5-37, 1 tasks: 0
srun: launching StepId=4565098.7 on host c5-39, 1 tasks: 1
srun: launching StepId=4565098.7 on host c5-42, 1 tasks: 2
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node c5-42, 1 tasks started
srun: launch/slurm: _task_start: Node c5-37, 1 tasks started
srun: launch/slurm: _task_start: Node c5-39, 1 tasks started
2021-12-09T10:45:51 - Arg: B - Step ID: 7 - Host: c5-37 - CPUs on node: 1
2021-12-09T10:45:51 - Arg: B - Step ID: 7 - Host: c5-42 - CPUs on node: 1
2021-12-09T10:45:51 - Arg: B - Step ID: 7 - Host: c5-39 - CPUs on node: 1
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.7 (status=0x0000).
srun: launch/slurm: _task_finish: c5-42: task 2: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.7 (status=0x0000).
srun: launch/slurm: _task_finish: c5-39: task 1: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.7 (status=0x0000).
srun: launch/slurm: _task_finish: c5-37: task 0: Completed

# output from
# echo Submitting parallel steps, with --exact and --mem:
# srun -v -n4 --exact --mem=1G my-binary A &
# srun -v -n3 --exact --mem=1G my-binary B &
# srun -v -n1 --exact --mem=1G my-binary C &
# srun -v -n1 --exact --mem=1G my-binary D &

Submitting parallel steps, with --exact and --mem:
Done submitting. Waiting...
srun: Warning: can't run 3 processes on 5 nodes, setting nnodes to 3
srun: defined options
srun: -------------------- --------------------
srun: (null)              : c5-[37,39,42-44]
srun: exact               : set
srun: jobid               : 4565098
srun: job-name            : srun_parallel_from_man.sm
srun: mem                 : 1G
srun: nodes               : 3
srun: ntasks              : 3
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: Warning: can't run 1 processes on 5 nodes, setting nnodes to 1
srun: defined options
srun: -------------------- --------------------
srun: (null)              : c5-[37,39,42-44]
srun: exact               : set
srun: jobid               : 4565098
srun: job-name            : srun_parallel_from_man.sm
srun: mem                 : 1G
srun: nodes               : 1
srun: ntasks              : 1
srun: verbose             : 1
srun: -------------------- --------------------
srun: Warning: can't run 4 processes on 5 nodes, setting nnodes to 4
srun: end of defined options
srun: defined options
srun: -------------------- --------------------
srun: (null)              : c5-[37,39,42-44]
srun: exact               : set
srun: jobid               : 4565098
srun: Warning: can't run 1 processes on 5 nodes, setting nnodes to 1
srun: job-name            : srun_parallel_from_man.sm
srun: defined options
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: mem                 : 1G
srun: -------------------- --------------------
srun: nodes               : 4
srun: (null)              : c5-[37,39,42-44]
srun: ntasks              : 4
srun: exact               : set
srun: verbose             : 1
srun: jobid               : 4565098
srun: -------------------- --------------------
srun: job-name            : srun_parallel_from_man.sm
srun: end of defined options
srun: mem                 : 1G
srun: nodes               : 1
srun: ntasks              : 1
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: jobid 4565098: nodes(5):`c5-[37,39,42-44]', cpu counts: 1(x3),3(x2)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=4565098.12 on host c5-37, 1 tasks: 0
srun: launching StepId=4565098.12 on host c5-39, 1 tasks: 1
srun: launching StepId=4565098.12 on host c5-42, 1 tasks: 2
srun: route/default: init: route default plugin loaded
srun: launching StepId=4565098.13 on host c5-43, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=4565098.14 on host c5-44, 1 tasks: 0
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node c5-37, 1 tasks started
srun: launch/slurm: _task_start: Node c5-42, 1 tasks started
srun: launch/slurm: _task_start: Node c5-39, 1 tasks started
srun: launch/slurm: _task_start: Node c5-43, 1 tasks started
srun: launch/slurm: _task_start: Node c5-44, 1 tasks started
2021-12-09T10:46:55 - Arg: B - Step ID: 12 - Host: c5-37 - CPUs on node: 1
2021-12-09T10:46:55 - Arg: C - Step ID: 13 - Host: c5-43 - CPUs on node: 1
2021-12-09T10:46:55 - Arg: B - Step ID: 12 - Host: c5-42 - CPUs on node: 1
2021-12-09T10:46:55 - Arg: D - Step ID: 14 - Host: c5-44 - CPUs on node: 1
2021-12-09T10:46:55 - Arg: B - Step ID: 12 - Host: c5-39 - CPUs on node: 1
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.13 (status=0x0000).
srun: launch/slurm: _task_finish: c5-43: task 0: Completed
srun: Job 4565098 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.12 (status=0x0000).
srun: launch/slurm: _task_finish: c5-37: task 0: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.12 (status=0x0000).
srun: launch/slurm: _task_finish: c5-42: task 2: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.12 (status=0x0000).
srun: launch/slurm: _task_finish: c5-39: task 1: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.14 (status=0x0000).
srun: launch/slurm: _task_finish: c5-44: task 0: Completed
srun: Job 4565098 step creation still disabled, retrying (Requested nodes are busy)
srun: Step created for job 4565098
srun: launch/slurm: launch_p_step_launch: CpuBindType=(null type)
srun: launching StepId=4565098.15 on host c5-37, 1 tasks: 0
srun: launching StepId=4565098.15 on host c5-39, 1 tasks: 1
srun: launching StepId=4565098.15 on host c5-42, 1 tasks: 2
srun: launching StepId=4565098.15 on host c5-43, 1 tasks: 3
srun: route/default: init: route default plugin loaded
srun: launch/slurm: _task_start: Node c5-42, 1 tasks started
srun: launch/slurm: _task_start: Node c5-37, 1 tasks started
srun: launch/slurm: _task_start: Node c5-39, 1 tasks started
srun: launch/slurm: _task_start: Node c5-43, 1 tasks started
2021-12-09T10:47:26 - Arg: A - Step ID: 15 - Host: c5-37 - CPUs on node: 1
2021-12-09T10:47:26 - Arg: A - Step ID: 15 - Host: c5-39 - CPUs on node: 1
2021-12-09T10:47:26 - Arg: A - Step ID: 15 - Host: c5-42 - CPUs on node: 1
2021-12-09T10:47:26 - Arg: A - Step ID: 15 - Host: c5-43 - CPUs on node: 1
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.15 (status=0x0000).
srun: launch/slurm: _task_finish: c5-37: task 0: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.15 (status=0x0000).
srun: launch/slurm: _task_finish: c5-42: task 2: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.15 (status=0x0000).
srun: launch/slurm: _task_finish: c5-39: task 1: Completed
srun: launch/slurm: _task_finish: Received task exit notification for 1 task of StepId=4565098.15 (status=0x0000).
srun: launch/slurm: _task_finish: c5-43: task 3: Completed


Apart from the "srun: mem-per-cpu         : 1G" versus "srun: mem                 : 1G" and the ordering of the steps, I see no substantial difference.  The last step to start seems to refuse to start until it can start a single task per node.

Comment 5 Bjørn-Helge Mevik 2021-12-09 07:50:34 MST

(In reply to Bjørn-Helge Mevik from comment #4)
> (In reply to Felip Moll from comment #2)
> 
> > I am not sure this is related to the new behavior of --exact.
> > 
> > Still looking into that but looks more like something that still happened
> > before.
> 
> I'm quite sure we were able to run examples like that (with the --exact) in
> version 19.05 and earlier and get all tasks to start at the same time.

Sorry, that should have been "(with the --exclusive)".

Comment 6 Felip Moll 2021-12-10 08:49:57 MST

(In reply to Bjørn-Helge Mevik from comment #5)
> (In reply to Bjørn-Helge Mevik from comment #4)
> > (In reply to Felip Moll from comment #2)
> > 
> > > I am not sure this is related to the new behavior of --exact.
> > > 
> > > Still looking into that but looks more like something that still happened
> > > before.
> > 
> > I'm quite sure we were able to run examples like that (with the --exact) in
> > version 19.05 and earlier and get all tasks to start at the same time.
> 
> Sorry, that should have been "(with the --exclusive)".

Hi Bjørn,

Can you repeat the test and while the issue is appearing and the job running, do:

'scontrol show jobs'
'scontrol show nodes'
'scontrol show steps'

and upload/paste it here?

Thanks

Comment 12 Bjørn-Helge Mevik 2021-12-14 01:28:28 MST

(In reply to Felip Moll from comment #6)

Hi, Felip,

> Can you repeat the test and while the issue is appearing and the job
> running, do:
> 
> 'scontrol show jobs'
> 'scontrol show nodes'
> 'scontrol show steps'
> 
> and upload/paste it here?

This is on a production cluster, and I don't feel comfortable with uploading info about every running job on the cluster to a public place like this.  Is there somewhere I can send the output instead?

Comment 13 Felip Moll 2021-12-14 09:13:53 MST

Ok Bjørn,

Please defer the tests for now.

I will try to work more on my testbed, since I have more ideas, and will get back to you soon.

Comment 17 Felip Moll 2021-12-14 13:31:26 MST

Bjørn,

I have finally figured out the issue, and it turns to be an expected behavior.

If you are inside an allocation, we expect that srun will take the default values from the parameters of the allocation, so in your case, since the examples you shown needs two nodes, then the minimum number of nodes set by default in further sruns will be 2.

The fact that setting --nodes=1-$SLURM_JOB_NUM_NODES makes it work, is because you are setting the minimum number of nodes to 1. Not setting this the minimum would be two. You can check that enabling the STEP debugflag in slurmctld, and noticing a line similar to this one:

[2021-12-14T20:24:59.862] STEPS: _pick_step_nodes: step pick 2-2 nodes, avail:node2 idle: picked:NONE

This was noticed in bug 11589 too, and a solution was introduced in 21.08. The solution is to set "--distribution=pack" to srun, then I checked how it works. You can also set SelectTypeParameters=CR_PACK_NODES to make it the default. See more in slurm.conf man page in 21.08 or bug 11589, commit e942cadb345.

In 20.11 you can use the workaround you found of --nodes=1-$SLURM_JOB_NUM_NODES.

Another solution I suggest is to cherry pick the patch from bug 11589 and apply it to 20.11, in case you can patch your Slurm installation.

In 20.02 it didn't happen because the exclusive flag was flawed, so after the fixes we did in 20.11 this is the new behavior.
We are working to document this situation better, including your case, in bug 11310.

Does it make sense?

Comment 18 Bjørn-Helge Mevik 2021-12-16 04:52:50 MST

Ok, thanks for the info!

We are planning to upgrade to 21.08 in the near future, so in the mean time, I'll simply document the workaround --nodes=1-$SLURM_JOB_NUM_NODES.

Comment 19 Felip Moll 2021-12-16 05:25:11 MST

(In reply to Bjørn-Helge Mevik from comment #18)
> Ok, thanks for the info!
> 
> We are planning to upgrade to 21.08 in the near future, so in the mean time,
> I'll simply document the workaround --nodes=1-$SLURM_JOB_NUM_NODES.

Ok Bjørn,

Please, reopen this bug when you upgrade if you still have issues.

Thanks for your patience.