Ticket 4996

Summary: Permit jobs to be restricted to a certain number of cores per socket
Product: Slurm Reporter: Christopher Samuel <chris>
Component: SchedulingAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: Swinburne Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Christopher Samuel 2018-03-27 19:16:10 MDT
Hi there,

The bulk of our cluster is comprised of dual socket, 18 core Skylake nodes with dual GPUs.  As we reserve 4 cores per node for GPU jobs only we effectively have 16 cores per socket available for non-GPU jobs.

We would really like a way (in the lua submit filter ideally) to ensure that jobs can be restricted to 16 cores per socket (which we would then apply to all non-GPU jobs rather than use all 18 cores on one socket and 14 on the other, thus meaning all the GPU specific cores are on one socket).

I can see that you can request nodes with a certain minimum number of cores per socket, and you can say you want a certain number of tasks per socket, but I can't see a way to say you want a certain number of cores per socket (regardless of the number of tasks).

I guess the easiest way to express it would be for a request for "-c 32" to be required to allocate 16 cores on each socket, rather than 18 on one and 14 on the other.

Is there something I'm missing, or is this something that could be added relatively easily?

All the best,
Chris
Comment 1 Christopher Samuel 2018-04-03 22:58:47 MDT
Hi there,

This has become more important as I've realised that I've misread the manual page and the --ntasks-per-socket option only sets the maximum number of tasks per socket rather than the absolute number.

So the work I've been doing to try and pack jobs onto 16 cores per node for a particular partition is not going to work as my testing shows that with the first node in the cluster partially filled it takes some of the tasks rather than having them run at 16 tasks per node on the many idle nodes.

LLN doesn't make sense for this as we are happy for jobs that are less than 16 cores to be packet onto otherwise busy nodes (in fact we would prefer that to keep other nodes free for large parallel workloads).

Any ideas?

All the best,
Chris
Comment 2 Dominik Bartkiewicz 2018-04-04 05:36:22 MDT
Hi

Sorry for not answering sooner. 
I hope bug 4985 solution will also solve this case.
We should fix bug 4995 soon.

Dominik
Comment 3 Christopher Samuel 2018-04-04 06:59:20 MDT
(In reply to Dominik Bartkiewicz from comment #2)

> Hi

Hi Dominik,

> Sorry for not answering sooner. 
> I hope bug 4985 solution will also solve this case.

Is that a typo for 4995?

> We should fix bug 4995 soon.

I'm wondering if it will, because --cores-per-socket is
for selecting hardware with at least that number of cores,
there is no guarantee that those cores are available on the
one socket for a job though is there?

All the best,
Chris
Comment 4 Dominik Bartkiewicz 2018-04-04 07:57:56 MDT
Hi

Yes, that was typo.
It looks that extra logic that covers bug 4995 will give us chance to solve this too (as combination of MaxCPUsPerNode, --cores-per-socket).

Dominik
Comment 5 Christopher Samuel 2018-04-04 18:23:06 MDT
On 04/04/18 23:57, bugs@schedmd.com wrote:

> Yes, that was typo.

Not a worry, I do that a lot.. :-/

> It looks that extra logic that covers bug 4995 will give us chance to
> solve this too (as combination of MaxCPUsPerNode, --cores-per-socket).

Great, that sounds really promising. Much obliged.

All the best,
Chris
Comment 7 Dominik Bartkiewicz 2018-06-29 06:13:44 MDT
Hi
Current slurm version allows to create similar configuration but with some limitations:
slurm.conf:
...
PartitionName=gpu    Nodes=test[01-09]  Default=no State=up  Priority=1 DefMemPerCPU=100 MinNodes=1 DefaultTime=60 MaxTime=INFINITE  MaxCPUsPerNode=2
PartitionName=nogpu  Nodes=test[01-09]  Default=no State=up  Priority=1 DefMemPerCPU=100 MinNodes=1 DefaultTime=60 MaxTime=INFINITE  MaxCPUsPerNode=6
...

srun -n 6 --cores-per-socket=3 -N 1 --mem=100  -p nogpu hostname
slurmctld.log:
...
slurmctld: ====================
slurmctld: job_id:30 nhosts:1 ncpus:6 node_req:1 nodes=test01
slurmctld: Node[0]:
slurmctld:   Mem(MB):100:0  Sockets:2  Cores:4  CPUs:6:0
slurmctld:   Socket[0] Core[0] is allocated
slurmctld:   Socket[0] Core[1] is allocated
slurmctld:   Socket[0] Core[2] is allocated
slurmctld:   Socket[1] Core[0] is allocated
slurmctld:   Socket[1] Core[1] is allocated
slurmctld:   Socket[1] Core[2] is allocated
slurmctld: --------------------
...

srun -n 3 --cores-per-socket=3 -N 1 --mem=100  -p nogpu sleep 100&
srun -n 3 --cores-per-socket=3 -N 1 --mem=100  -p nogpu sleep 100&
slurmctld.log:
...
slurmctld: ====================
slurmctld: job_id:32 nhosts:1 ncpus:3 node_req:1 nodes=test01
slurmctld: Node[0]:
slurmctld:   Mem(MB):100:0  Sockets:2  Cores:4  CPUs:3:0
slurmctld:   Socket[0] Core[0] is allocated
slurmctld:   Socket[0] Core[1] is allocated
slurmctld:   Socket[0] Core[2] is allocated
slurmctld: --------------------
...
slurmctld: ====================
slurmctld: job_id:33 nhosts:1 ncpus:3 node_req:1 nodes=test01
slurmctld: Node[0]:
slurmctld:   Mem(MB):100:0  Sockets:2  Cores:4  CPUs:3:0
slurmctld:   Socket[1] Core[0] is allocated
slurmctld:   Socket[1] Core[1] is allocated
slurmctld:   Socket[1] Core[2] is allocated
slurmctld: --------------------
...


This should work fine if
size of the job is multiplication of arbitrary "minimum job size"
Every job uses "--cores-per-socket" to split jobs bigger than "minimum job size" into two sockets.


Dominik
Comment 8 Dominik Bartkiewicz 2018-07-09 04:55:43 MDT
Hi

Did you have a chance to test this?

Dominik
Comment 9 Christopher Samuel 2018-07-10 17:59:39 MDT
On 09/07/18 20:55, bugs@schedmd.com wrote:

> Did you have a chance to test this?

Sorry I completely missed your reply that you are prodding me on! :-(

I'll try and look at it soon, very busy here at the moment, especially
with Danny coming to do training here next week (which will cover all
my work days).

cheers,
Chris
Comment 10 Christopher Samuel 2018-08-10 05:00:53 MDT
(In reply to Christopher Samuel from comment #9)

> On 09/07/18 20:55, bugs@schedmd.com wrote:
> 
> > Did you have a chance to test this?
> 
> Sorry I completely missed your reply that you are prodding me on! :-(
> 
> I'll try and look at it soon, very busy here at the moment, especially
> with Danny coming to do training here next week (which will cover all
> my work days).

It's looking good thank you, I had made some changes to the submit filter to do:

                -- For jobs that fit on a single socket request that
                if ( job_desc.min_cpus < 17 ) then
                        -- set cores per socket to what it requests
                        job_desc.cores_per_socket=job_desc.min_cpus
                        slurm.log_info("slurm_job_submit (lua): Setting cores-per-socket to be %d when job requested %d for user %d", job_desc.cores_per_socket, job_desc.min_cpus, submit_uid )
                end
                -- For jobs that will fit on a whole node request 16 cores per socket
                if ( job_desc.min_cpus > 31 ) then
                        job_desc.cores_per_socket=16
                        slurm.log_info("slurm_job_submit (lua): Setting cores-per-socket to be %d when job requested %d for user %d", job_desc.cores_per_socket, job_desc.min_cpus, submit_uid )
                end


and that does indeed seem to do what we want.

Sorry for taking so long to reply on this.

All the best!
Chris
Comment 11 Dominik Bartkiewicz 2018-08-10 05:27:54 MDT
Hi

I am glad to hear that.
I would like to close this as resolved/infogiven.

Dominik
Comment 12 Christopher Samuel 2018-08-10 05:40:49 MDT
On Friday, 10 August 2018 9:27:54 PM AEST bugs@schedmd.com wrote:

> Hi

Hi Dominik,

> I am glad to hear that.
> I would like to close this as resolved/infogiven.

Not a problem, sounds good to me.

All the best,
Chris
Comment 13 Dominik Bartkiewicz 2018-08-10 05:47:22 MDT
Closing as resolved/infogiven.
As always, please feel free to reopen if you have
additional questions.

Dominik