Ticket 14624

Summary: requesting the total number of GRES
Product: Slurm Reporter: yitp.support
Component: SchedulingAssignee: Scott Hilton <scott>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: Kyoto University Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description yitp.support 2022-07-27 01:27:43 MDT
Customer wants to divide a physical node into some virtual nodes, then use them exclusively.
Currently I'm thinking of using GRES with "Cores=" specification.

For example,
#####################
in gres.conf:
NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy0 Cores=0-3
NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy1 Cores=4-7
NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy2 Cores=8-11
NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy3 Cores=12-15

in slurm.conf:
GresTypes=vnode

then,
$ sbatch --gres=vnode:2 --gres-flags=enforce-binding --wrap "sleep 100"
#####################

But "--gres" doesn't mean the total number of GRES, but the number of GRES on each node.
I'd like to specify the "total" in order to minimize the number of the nodes.

On the other hand, "--gpus" option is available for GPU resource and we can request the total number of GPUs.
Is it possible to define dummy GPU resource and use it for our purpose?

Or is there any other ways that satisfy customer's request?
Comment 3 Scott Hilton 2022-07-28 16:51:05 MDT
Cores does not force separation of jobs. It just makes a gres prefer to be bound to said cores. I don't think this is what you are looking for.

Why does a customer want to divide a physical node into some virtual nodes? What is the use case? Perhaps I can suggest a good better workaround if I know exactly what they are trying to accomplish.
Comment 4 yitp.support 2022-07-28 17:41:29 MDT
Currently customer is using nodes exclusively.
But recently the number of cores in the CPU is becoming larger, and not all user needs such a large number of cores for the job.
So customer is thinking of using each NUMA domain as a virtual node, and assign it to a job exclusively.
In this way, users can use full memory bandwidth and minimize the penalty of inter-NUMA memory access.
At the same time, customer also wants small size jobs to share and run on the rest resources for increasing system utilization.

For example:
 node[01-02]: 16core, 4core/NUMA

 Job1: -n 28 -c 1 => Use node01 and NUMA[0-2] in node02 exclusively
 Job2: -n 1 -c 1  => run on NUMA4 in node02
 Job3: -n 1 -c 2  => run on NUMA4 in node02
Comment 5 Scott Hilton 2022-07-29 09:53:40 MDT
Perhaps you could use this parameter along with the socket based options.
SlurmdParameters=numa_node_as_socket
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdParameters

Specifically look at --ntasks-per-socket, --cpu-bind=socket, --distribution, and --extra-node-info (though there are more socket based options that may be useful):
https://slurm.schedmd.com/srun.html

Let me know if this is a workable solution and if you have questions about any socket options.

-Scott
Comment 6 yitp.support 2022-07-29 13:10:01 MDT
Could you show me the parameters that achieve the condition of the example shown in comment #4?
SelectType=
SelectTypeParameters=
NodeName=
PartitionName=
Comment 7 Scott Hilton 2022-08-01 10:22:36 MDT
You will need hwloc v2 for numa_node_as_socket to work as mentioned in the documentation.

I think something like this should work. I am assuming you are using hyper threading with ThreadsPerCore=2.

SlurmdParameters=numa_node_as_socket
SelectType=select/cons_tres
SelectTypeParameters=CR_Socket # Only one job can use any given "Socket"(NUMA domain) at a time
NodeName=node[1-2] SocketsPerBoard=4 CoresPerSocket=4 ThreadsPerCore=2 CPUs=32 RealMemory=<mem> 
PartitionName=general Nodes=ALL

When testing this please send me the output of "scontrol show node" to confirm the configuration.

-Scott
Comment 8 yitp.support 2022-08-03 01:10:13 MDT
I will setup test environment.
I'll let you know when I get the result.
Comment 9 yitp.support 2022-08-28 03:54:44 MDT
I setup test environment with one node and checked the behavior of CR_Socket.

 node01: 2 Sockets, 20core/Socket
 Job1: -n 10 -c 2 => Ran on Socket 0
 Job2: -n 1 -c 2  => Ran on Socket 1
 Job3: -n 1 -c 2  => PENDING due to no available resource

I'd like to run job2 and job3 on the same socket.
Comment 10 Scott Hilton 2022-08-29 12:25:04 MDT
Ahh I see. I think you want CR_CORE_DEFAULT_DIST_BLOCK instead of CR_SOCKET then. It allocates by core but tries to minimize sockets used. 
It is equivalent to CR_CORE with each job using --distribution=*:block
Comment 11 Scott Hilton 2022-09-16 11:18:34 MDT
Did that solve your problem?
Comment 12 yitp.support 2022-09-21 04:45:28 MDT
Thank you for your advice.
It was a very good help towards solving the problem.
We will discuss better implementation methods with our customers.
Comment 13 Scott Hilton 2022-09-23 10:09:28 MDT
Glad I could help.

-Scott