Customer wants to divide a physical node into some virtual nodes, then use them exclusively. Currently I'm thinking of using GRES with "Cores=" specification. For example, ##################### in gres.conf: NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy0 Cores=0-3 NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy1 Cores=4-7 NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy2 Cores=8-11 NodeName=node[01-04] AutoDetect=off Name=vnode File=/tmp/dummy3 Cores=12-15 in slurm.conf: GresTypes=vnode then, $ sbatch --gres=vnode:2 --gres-flags=enforce-binding --wrap "sleep 100" ##################### But "--gres" doesn't mean the total number of GRES, but the number of GRES on each node. I'd like to specify the "total" in order to minimize the number of the nodes. On the other hand, "--gpus" option is available for GPU resource and we can request the total number of GPUs. Is it possible to define dummy GPU resource and use it for our purpose? Or is there any other ways that satisfy customer's request?
Cores does not force separation of jobs. It just makes a gres prefer to be bound to said cores. I don't think this is what you are looking for. Why does a customer want to divide a physical node into some virtual nodes? What is the use case? Perhaps I can suggest a good better workaround if I know exactly what they are trying to accomplish.
Currently customer is using nodes exclusively. But recently the number of cores in the CPU is becoming larger, and not all user needs such a large number of cores for the job. So customer is thinking of using each NUMA domain as a virtual node, and assign it to a job exclusively. In this way, users can use full memory bandwidth and minimize the penalty of inter-NUMA memory access. At the same time, customer also wants small size jobs to share and run on the rest resources for increasing system utilization. For example: node[01-02]: 16core, 4core/NUMA Job1: -n 28 -c 1 => Use node01 and NUMA[0-2] in node02 exclusively Job2: -n 1 -c 1 => run on NUMA4 in node02 Job3: -n 1 -c 2 => run on NUMA4 in node02
Perhaps you could use this parameter along with the socket based options. SlurmdParameters=numa_node_as_socket https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdParameters Specifically look at --ntasks-per-socket, --cpu-bind=socket, --distribution, and --extra-node-info (though there are more socket based options that may be useful): https://slurm.schedmd.com/srun.html Let me know if this is a workable solution and if you have questions about any socket options. -Scott
Could you show me the parameters that achieve the condition of the example shown in comment #4? SelectType= SelectTypeParameters= NodeName= PartitionName=
You will need hwloc v2 for numa_node_as_socket to work as mentioned in the documentation. I think something like this should work. I am assuming you are using hyper threading with ThreadsPerCore=2. SlurmdParameters=numa_node_as_socket SelectType=select/cons_tres SelectTypeParameters=CR_Socket # Only one job can use any given "Socket"(NUMA domain) at a time NodeName=node[1-2] SocketsPerBoard=4 CoresPerSocket=4 ThreadsPerCore=2 CPUs=32 RealMemory=<mem> PartitionName=general Nodes=ALL When testing this please send me the output of "scontrol show node" to confirm the configuration. -Scott
I will setup test environment. I'll let you know when I get the result.
I setup test environment with one node and checked the behavior of CR_Socket. node01: 2 Sockets, 20core/Socket Job1: -n 10 -c 2 => Ran on Socket 0 Job2: -n 1 -c 2 => Ran on Socket 1 Job3: -n 1 -c 2 => PENDING due to no available resource I'd like to run job2 and job3 on the same socket.
Ahh I see. I think you want CR_CORE_DEFAULT_DIST_BLOCK instead of CR_SOCKET then. It allocates by core but tries to minimize sockets used. It is equivalent to CR_CORE with each job using --distribution=*:block
Did that solve your problem?
Thank you for your advice. It was a very good help towards solving the problem. We will discuss better implementation methods with our customers.
Glad I could help. -Scott