6771 – Only allocate cores for the requested number of tasks

Ticket 6771 - Only allocate cores for the requested number of tasks

Summary: Only allocate cores for the requested number of tasks

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	17.11.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Nate Rini
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-03-28 07:40 MDT by hpc-ops
Modified:	2019-04-15 12:33 MDT (History)
CC List:	0 users

See Also:
Site:	Ghent
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (5.94 KB, text/plain) 2019-03-28 15:35 MDT, hpc-ops	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description hpc-ops 2019-03-28 07:40:03 MDT

Hi,

We are migrating from Moab/Torque to Slurm. Under the former, a user
requesting N nodes with one core per node essentially simply requested N
cores and Moab could assign these however it saw fit (taking the memory
requirements into consideration).

For example: qsub -l nodes=50:ppn=1

Under Slurm, we are currently using the (modified) qsub wrapper to ease the
transition for our users, calling sbatch with --ntasks-per-node=P in
case they specify ppn=P. This obviously has the undesired effect that
the allocation is then spread across N nodes, with P cores per node.

I get why, we're asking for exactly this. However, we'd like to compact
such jobs to use less nodes. According to the documentation, I'd expect
for example

salloc --nodes=1-2 --ntasks=40 --ntasks-per-node=36

to allocate 40 tasks, 1 task per core, so 40 cores in total, spread
across two nodes (our nodes have 36 cores). When looking at the cpuset
in the cgroup however, I notice that on both nodes, I get 36 cores
assigned to the job and scontrol for my job shows

NumNodes=2 NumCPUs=72 NumTasks=40 CPUs/Task=1

Is there a way to get NumCPUs to also be 40 in this case? This would be preferred, especially if I need more than the per core memory for my
job -- then I could get 2 cores per node, the mem-per-cpu set to what
I need and keep the remaining cores (and memory) available for other jobs.

Of course, I could set ntasks-per-node=20, but that would again result in potentially occupying more nodes than strictly needed (note that we do not have a single node policy) thereby occupying complete node resources for other jobs in the queue.

Kind regards,
-- Andy

Comment 1 Nate Rini 2019-03-28 15:00:14 MDT

(In reply to hpc-admin from comment #0)
> We are migrating from Moab/Torque to Slurm. Under the former, a user
> requesting N nodes with one core per node essentially simply requested N
> cores and Moab could assign these however it saw fit (taking the memory
> requirements into consideration). 
Can you please attach your slurm.conf. Settings for DefMemPerCPU, DefMemPerCPU and DefMemPerNode change how Slurm will respond.

> For example: qsub -l nodes=50:ppn=1
Please try using "salloc --nodes=50 --ntasks-per-node=1".

> Under Slurm, we are currently using the (modified) qsub wrapper to ease the
> transition for our users, calling sbatch with --ntasks-per-node=P in
> case they specify ppn=P. This obviously has the undesired effect that
> the allocation is then spread across N nodes, with P cores per node.
> 
> I get why, we're asking for exactly this. However, we'd like to compact
> such jobs to use less nodes. According to the documentation, I'd expect
> for example
> 
> salloc --nodes=1-2 --ntasks=40 --ntasks-per-node=36

This request allows the scheduler to choose the number nodes, which it can do 1 per node or up to 36 per node. If the node count is specified, then it will honor that request per the Slurm srun man page:
> If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option.

> to allocate 40 tasks, 1 task per core, so 40 cores in total, spread
> across two nodes (our nodes have 36 cores). When looking at the cpuset
> in the cgroup however, I notice that on both nodes, I get 36 cores
> assigned to the job and scontrol for my job shows
> 
> NumNodes=2 NumCPUs=72 NumTasks=40 CPUs/Task=1
>
> Is there a way to get NumCPUs to also be 40 in this case? This would be
> preferred, especially if I need more than the per core memory for my 
> job -- then I could get 2 cores per node, the mem-per-cpu set to what 
> I need and keep the remaining cores (and memory) available for other jobs.
> 
> Of course, I could set ntasks-per-node=20, but that would again result in
> potentially occupying more nodes than strictly needed (note that we do not
> have a single node policy) thereby occupying complete node resources for
> other jobs in the queue.
Is hyperthreading enabled?

Comment 2 hpc-ops 2019-03-28 15:35:09 MDT

Dear Nate,

Thanks for the swift reponse.

--nodes=50 --ntasks-per-node=1

is what we're using right now in the qsub wrapper as arguments to sbatch. This causes 50 nodes to be allocated, 1 core per task, 1 task per node. 

vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=10 --ntasks-per-node=1 --time=10
salloc: Granted job allocation 7903969
salloc: Waiting for resource configuration
salloc: Nodes node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os are ready for job

At least a number of nodes could easily accomodate more than one of the tasks:

vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> scontrol show node=node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os | grep AllocT
   AllocTRES=cpu=35,mem=88594M
   AllocTRES=cpu=29,mem=73462M
   AllocTRES=cpu=29,mem=73462M
   AllocTRES=cpu=31,mem=78536M
   AllocTRES=cpu=31,mem=78536M
   AllocTRES=cpu=31,mem=78536M
   AllocTRES=cpu=31,mem=78536M
   AllocTRES=cpu=22,mem=55711M
   AllocTRES=cpu=31,mem=78536M
   AllocTRES=cpu=31,mem=65654M

For this cluster, we have:

MaxMemPerNode=91341
DefMemPerCPU=2450

No other limits are specified afaik (slurm.conf attached).

So that certainly does not seem to be the solution. The following works better:

vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=1-10 --ntasks=10 --ntasks-per-node=10 --time=10
salloc: Granted job allocation 7903973
salloc: Waiting for resource configuration
salloc: Nodes node3250.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3264.victini.os,node3268.victini.os,node3270.victini.os are ready for job
vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> scontrol show node=node3250.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3264.victini.os,node3268.victini.os,node3270.victini.os | grep AllocT
   AllocTRES=cpu=31,mem=77761M
   AllocTRES=cpu=31,mem=77761M
   AllocTRES=cpu=21,mem=91014M
   AllocTRES=cpu=30,mem=75224M
   AllocTRES=cpu=31,mem=77761M
   AllocTRES=cpu=31,mem=77761M
   AllocTRES=cpu=19,mem=85940M

But then I see

   NumNodes=7 NumCPUs=70 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Which is also not what I want: NumCPUs is too high

I am probably misunderstanding something about the combination ntasks and ntasks-per-node, although the documentation indicates

"--ntasks-per-node=<ntasks>
Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. This is related to --cpus-per-task=ncpus, but does not require knowledge of the actual number of cpus on each node."

So I am slightly confused as to the exact meaning.

As per your comment about srun, I would need to no specify the --nodes and only use --ntasks and --ntasks-per-node?

This seems to work in some case, if ntasks is smaller of equal to ntasks-per-node, not if the former is larger than the latter. I do have a number of nodes in mix state free, so in theory slurm could assign cores, like it did for the other argument combinations I tried, but for

vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc  --ntasks=40 --ntasks-per-node=36 --time=10 --mem-per-cpu=100m
salloc: Pending job allocation 7903981
salloc: job 7903981 queued and waiting for resources

I see

[root@master29 ~]# scontrol show job=7903981
JobId=7903981 JobName=bash
   UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A
   Priority=23756 Nice=0 Account=gvo00002 QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2019-03-28T22:33:36 EligibleTime=2019-03-28T22:33:36
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-28T22:33:37
   Partition=victini AllocNode:Sid=gligar05:9509
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=2 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,mem=4000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=36:0:*:* CoreSpec=*
   MinCPUsNode=36 MinMemoryCPU=100M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/kyukon/home/gent/400/vsc40075
   Comment=stdout=/kyukon/home/gent/400/vsc40075/slurm-7903981.out
   Power=

I am not sure if these resources can ever be met.

-- Andy

Comment 3 hpc-ops 2019-03-28 15:35:45 MDT

Created attachment 9731 [details]
slurm.conf

Comment 4 Nate Rini 2019-03-28 16:02:22 MDT

(In reply to hpc-admin from comment #2)
> --nodes=50 --ntasks-per-node=1

> is what we're using right now in the qsub wrapper as arguments to sbatch.
> This causes 50 nodes to be allocated, 1 core per task, 1 task per node.

Please try (although it should not be needed):
> --nodes=50:50 --ntasks-per-node=1

Can you please provide the output of this:
> slurmd -C
> scontrol show job $JOBID
Please provide the all the jobs mentioned.
 
> vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=10
> --ntasks-per-node=1 --time=10
> salloc: Granted job allocation 7903969
> salloc: Waiting for resource configuration
> salloc: Nodes
> node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.
> os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.
> victini.os,node3265.victini.os,node3266.victini.os are ready for job
Looks like 10 nodes as requested.

> At least a number of nodes could easily accomodate more than one of the
> tasks:
Are you looking to pack the number of tasks into a node?

> MaxMemPerNode=91341
The value of "RealMemory=91341" on the node should be sufficient unless you want that max to be less than the actual.

> DefMemPerCPU=2450

91341 / 2450 = 37.28 CPUs max

> vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=1-10 --ntasks=10
> --ntasks-per-node=10 --time=10
This is requesting 1 to 10 nodes with at most 10 tasks per node with a total of 10 tasks.

>    NumNodes=7 NumCPUs=70 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

There are CPUs=36 per node per your slurm.conf:
70/36 = 1.94 nodes min with 7 CPUs per Task

Odd, that should 1 by default per the srun man page:
> -c, --cpus-per-task=<ncpus>
> The default is one CPU per process.

Is SLURM_CPUS_PER_TASK set in your env? Can you please provide this output:
> salloc -vv --nodes=10 --ntasks-per-node=1 --time=10
 
> Which is also not what I want: NumCPUs is too high

It can be set by adding "--cpus-per-task=1" to your command, but it should default to 1. I'm thinking something else may be changing the request.

> I am probably misunderstanding something about the combination ntasks and
> ntasks-per-node, although the documentation indicates
> 
> "--ntasks-per-node=<ntasks>
> Request that ntasks be invoked on each node. If used with the --ntasks
> option, the --ntasks option will take precedence and the --ntasks-per-node
> will be treated as a maximum count of tasks per node. Meant to be used with
> the --nodes option. This is related to --cpus-per-task=ncpus, but does not
> require knowledge of the actual number of cpus on each node."
> 
> So I am slightly confused as to the exact meaning.
> 
> As per your comment about srun, I would need to no specify the --nodes and
> only use --ntasks and --ntasks-per-node?
Not specifying them leaves Slurm the freedom to find any placement it can, potentially running more jobs.

> This seems to work in some case, if ntasks is smaller of equal to
> ntasks-per-node, not if the former is larger than the latter. I do have a
> number of nodes in mix state free, so in theory slurm could assign cores,
> like it did for the other argument combinations I tried, but for
> 
> vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc  --ntasks=40
> --ntasks-per-node=36 --time=10 --mem-per-cpu=100m
>
> [root@master29 ~]# scontrol show job=7903981
>    NumNodes=2 NumCPUs=40 NumTasks=40 CPUs/Task=1
>    NtasksPerN:B:S:C=36:0:*:* CoreSpec=*
>    MinCPUsNode=36 MinMemoryCPU=100M MinTmpDiskNode=0
>
> I am not sure if these resources can ever be met.

That request should never run: MinCPUsNode=36 < 40 / 2

Comment 5 hpc-ops 2019-03-29 03:58:40 MDT

Hi Nate,

(In reply to Nate Rini from comment #4)
> (In reply to hpc-admin from comment #2)
> > --nodes=50 --ntasks-per-node=1
> 
> > is what we're using right now in the qsub wrapper as arguments to sbatch.
> > This causes 50 nodes to be allocated, 1 core per task, 1 task per node.
> 
> Please try (although it should not be needed):
> > --nodes=50:50 --ntasks-per-node=1

I used 40, since I do not have 50 nodes in an idle or mix state atm

vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=40-40 --ntasks-per-node=1 --mem-per-cpu=100m
salloc: Pending job allocation 7904659
salloc: job 7904659 queued and waiting for resources
salloc: job 7904659 has been allocated resources
salloc: Granted job allocation 7904659
salloc: Waiting for resource configuration
salloc: Nodes node3200.victini.os,node3205.victini.os,node3206.victini.os,node3207.victini.os,node3209.victini.os,node3210.victini.os,node3216.victini.os,node3217.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3238.victini.os,node3240.victini.os,node3249.victini.os,node3250.victini.os,node3252.victini.os,node3253.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3260.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os,node3269.victini.os,node3270.victini.os,node3271.victini.os,node3273.victini.os,node3274.victini.os,node3275.victini.os,node3276.victini.os,node3278.victini.os,node3279.victini.os,node3290.victini.os,node3291.victini.os,node3292.victini.os are ready for job


> Can you please provide the output of this:
> > slurmd -C

vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> srun slurmd -C
NodeName=node3290 CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976
UpTime=77-21:39:28
slurmd: Considering each NUMA node as a socket
NodeName=node3268 CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976
UpTime=76-00:26:03
slurmd: Considering each NUMA node as a socket
slurmd: Considering each NUMA node as a socket
NodeName=node3263 CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976
UpTime=77-22:09:18

<snip>

> > scontrol show job $JOBID

[ageorges@master29 ~]$ sudo scontrol show job=7904659
JobId=7904659 JobName=bash
   UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A
   Priority=27290 Nice=0 Account=gvo00002 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:01:27 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-03-29T10:18:12 EligibleTime=2019-03-29T10:18:12
   StartTime=2019-03-29T10:18:13 EndTime=2019-03-29T11:18:13 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-29T10:18:13
   Partition=victini AllocNode:Sid=gligar04:27220
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node3200.victini.os,node3205.victini.os,node3206.victini.os,node3207.victini.os,node3209.victini.os,node3210.victini.os,node3216.victini.os,node3217.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3238.victini.os,node3240.victini.os,node3249.victini.os,node3250.victini.os,node3252.victini.os,node3253.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3260.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os,node3269.victini.os,node3270.victini.os,node3271.victini.os,node3273.victini.os,node3274.victini.os,node3275.victini.os,node3276.victini.os,node3278.victini.os,node3279.victini.os,node3290.victini.os,node3291.victini.os,node3292.victini.os
   BatchHost=node3200.victini.os
   NumNodes=40 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,mem=4000M,node=40,billing=40
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/kyukon/home/gent/400/vsc40075



> Please provide the all the jobs mentioned.

?

>  
> > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=10
> > --ntasks-per-node=1 --time=10
> > salloc: Granted job allocation 7903969
> > salloc: Waiting for resource configuration
> > salloc: Nodes
> > node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.
> > os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.
> > victini.os,node3265.victini.os,node3266.victini.os are ready for job
> Looks like 10 nodes as requested.
> 
> > At least a number of nodes could easily accomodate more than one of the
> > tasks:
> Are you looking to pack the number of tasks into a node?

I am looking to replicate the behaviour Moab exhibited (to some extent), so yes, if possible packing things on a node or on multiple nodes or parts of multiple nodes. 

If this is not possible, I fully understand as it seems to me that this makes scheduling much more complex (to consider possible scenarios etc.)

> 
> > MaxMemPerNode=91341
> The value of "RealMemory=91341" on the node should be sufficient unless you
> want that max to be less than the actual.

Yes, max should be a bit less, since there is the GPFS pagepool and such to take into account.

> 
> > DefMemPerCPU=2450
> 
> 91341 / 2450 = 37.28 CPUs max
> 
> > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=1-10 --ntasks=10
> > --ntasks-per-node=10 --time=10
> This is requesting 1 to 10 nodes with at most 10 tasks per node with a total
> of 10 tasks.

That is what I meant and understood from the documentation as well. The idea here is that I meant that I need place for 10 tasks, and slurm can consider how to reserve those resources, going from 1 to at most 10 nodes, but reserve no more than 10 cores (as the default is 1 task/core). 

> 
> >    NumNodes=7 NumCPUs=70 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> 
> There are CPUs=36 per node per your slurm.conf:
> 70/36 = 1.94 nodes min with 7 CPUs per Task
> 
> Odd, that should 1 by default per the srun man page:
> > -c, --cpus-per-task=<ncpus>
> > The default is one CPU per process.

Uhu. That is what I also read. Maybe I should specify it explicitly, but CPUs/Task=1 seems to suggest that is not needed.

> 
> Is SLURM_CPUS_PER_TASK set in your env? Can you please provide this output:
> > salloc -vv --nodes=10 --ntasks-per-node=1 --time=10

vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc -vv --nodes=10 --ntasks-per-node=1 --time=10
salloc: defined options for program `salloc'
salloc: --------------- ---------------------
salloc: user           : `vsc40075'
salloc: uid            : 2540075
salloc: gid            : 2540075
salloc: ntasks         : 10 (set)
salloc: cpus_per_task  : 0 (default)
salloc: nodes          : 10-10
salloc: partition      : default
salloc: job name       : `bash'
salloc: reservation    : `(null)'
salloc: wckey          : `(null)'
salloc: distribution   : unknown
salloc: verbose        : 2
salloc: immediate      : false
salloc: overcommit     : false
salloc: time_limit     : 10
salloc: nice           : -2
salloc: account        : (null)
salloc: comment        : (null)
salloc: dependency     : (null)
salloc: network        : (null)
salloc: power          :
salloc: profile        : `NotSet'
salloc: qos            : (null)
salloc: constraints    :
salloc: geometry       : (null)
salloc: reboot         : yes
salloc: rotate         : no
salloc: mail_type      : NONE
salloc: mail_user      : (null)
salloc: sockets-per-node  : -2
salloc: cores-per-socket  : -2
salloc: threads-per-core  : -2
salloc: ntasks-per-node   : 1
salloc: ntasks-per-socket : -2
salloc: ntasks-per-core   : -2
salloc: plane_size        : 4294967294
salloc: mem-bind          : default
salloc: user command   : `/bin/bash'
salloc: cpu_freq_min   : 4294967294
salloc: cpu_freq_max   : 4294967294
salloc: cpu_freq_gov   : 4294967294
salloc: switches          : -1
salloc: wait-for-switches : -1
salloc: core-spec         : NA
salloc: burst_buffer      : `(null)'
salloc: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
salloc: debug:  Munge authentication plugin loaded
salloc: debug:  slurmdbd: Sent PersistInit msg
salloc: Serial Job Resource Selection plugin loaded with argument 20
salloc: Consumable Resources (CR) Node Selection plugin loaded with argument 20
salloc: Cray node selection plugin loaded
salloc: Linear node selection plugin loaded with argument 20
salloc: debug:  slurmdbd: Sent fini msg
salloc: debug:  Entering slurm_allocation_msg_thr_create()
salloc: debug:  port from net_stream_listen is 30595
salloc: debug:  Entering _msg_thr_internal
salloc: debug:  _is_port_ok: bind() failed port 30595 sock 7 Address already in use
salloc: Pending job allocation 7904695
salloc: job 7904695 queued and waiting for resources
salloc: Nodes node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os are ready for job
salloc: debug:  laying out the 10 tasks on 10 hosts node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os dist 8192

[ageorges@master29 ~]$ sudo scontrol show job=7904695
JobId=7904695 JobName=bash
   UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A
   Priority=24493 Nice=0 Account=gvo00002 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:32 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2019-03-29T10:38:50 EligibleTime=2019-03-29T10:38:50
   StartTime=2019-03-29T10:39:11 EndTime=2019-03-29T10:49:11 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-29T10:39:11
   Partition=victini AllocNode:Sid=gligar04:27220
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os
   BatchHost=node3216.victini.os
   NumNodes=10 NumCPUs=10 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=10,mem=24500M,node=10,billing=10
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2450M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/kyukon/home/gent/400/vsc40075
   Comment=stdout=/kyukon/home/gent/400/vsc40075/slurm-7904695.out
   Power=


>  
> > Which is also not what I want: NumCPUs is too high
> 
> It can be set by adding "--cpus-per-task=1" to your command, but it should
> default to 1. I'm thinking something else may be changing the request.
> 
> > I am probably misunderstanding something about the combination ntasks and
> > ntasks-per-node, although the documentation indicates
> > 
> > "--ntasks-per-node=<ntasks>
> > Request that ntasks be invoked on each node. If used with the --ntasks
> > option, the --ntasks option will take precedence and the --ntasks-per-node
> > will be treated as a maximum count of tasks per node. Meant to be used with
> > the --nodes option. This is related to --cpus-per-task=ncpus, but does not
> > require knowledge of the actual number of cpus on each node."
> > 
> > So I am slightly confused as to the exact meaning.
> > 
> > As per your comment about srun, I would need to no specify the --nodes and
> > only use --ntasks and --ntasks-per-node?
> Not specifying them leaves Slurm the freedom to find any placement it can,
> potentially running more jobs.

Right. This works in some cases, for example 

salloc -vv --ntasks=10 --ntasks-per-node=10 --time=10

However, 

salloc -vv --ntasks=40 --ntasks-per-node=10 --time=10

vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc -vv --ntasks=40 --ntasks-per-node=6 --time=10
salloc: defined options for program `salloc'
salloc: --------------- ---------------------
salloc: user           : `vsc40075'
salloc: uid            : 2540075
salloc: gid            : 2540075
salloc: ntasks         : 40 (set)
salloc: cpus_per_task  : 0 (default)
salloc: nodes          : 1 (default)
salloc: partition      : default
salloc: job name       : `bash'
salloc: reservation    : `(null)'
salloc: wckey          : `(null)'
salloc: distribution   : unknown
salloc: verbose        : 2
salloc: immediate      : false
salloc: overcommit     : false
salloc: time_limit     : 10
salloc: nice           : -2
salloc: account        : (null)
salloc: comment        : (null)
salloc: dependency     : (null)
salloc: network        : (null)
salloc: power          :
salloc: profile        : `NotSet'
salloc: qos            : (null)
salloc: constraints    :
salloc: geometry       : (null)
salloc: reboot         : yes
salloc: rotate         : no
salloc: mail_type      : NONE
salloc: mail_user      : (null)
salloc: sockets-per-node  : -2
salloc: cores-per-socket  : -2
salloc: threads-per-core  : -2
salloc: ntasks-per-node   : 6
salloc: ntasks-per-socket : -2
salloc: ntasks-per-core   : -2
salloc: plane_size        : 4294967294
salloc: mem-bind          : default
salloc: user command   : `/bin/bash'
salloc: cpu_freq_min   : 4294967294
salloc: cpu_freq_max   : 4294967294
salloc: cpu_freq_gov   : 4294967294
salloc: switches          : -1
salloc: wait-for-switches : -1
salloc: core-spec         : NA
salloc: burst_buffer      : `(null)'
salloc: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
salloc: debug:  Munge authentication plugin loaded
salloc: debug:  slurmdbd: Sent PersistInit msg
salloc: Serial Job Resource Selection plugin loaded with argument 20
salloc: Consumable Resources (CR) Node Selection plugin loaded with argument 20
salloc: Cray node selection plugin loaded
salloc: Linear node selection plugin loaded with argument 20
salloc: debug:  slurmdbd: Sent fini msg
salloc: debug:  Entering slurm_allocation_msg_thr_create()
salloc: debug:  port from net_stream_listen is 32720
salloc: debug:  Entering _msg_thr_internal
salloc: debug:  _is_port_ok: bind() failed port 32720 sock 7 Address already in use
salloc: Pending job allocation 7904700
salloc: job 7904700 queued and waiting for resources
salloc: job 7904700 has been allocated resources
salloc: Granted job allocation 7904700
salloc: debug:  laying out the 40 tasks on 7 hosts node3205.victini.os,node3206.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os dist 8192


and

[ageorges@master29 ~]$ sudo scontrol show job=7904700
JobId=7904700 JobName=bash
   UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A
   Priority=23749 Nice=0 Account=gvo00002 QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2019-03-29T10:45:42 EligibleTime=2019-03-29T10:45:42
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-03-29T10:45:43
   Partition=victini AllocNode:Sid=gligar04:27220
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=7 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=40,mem=98000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=6:0:*:* CoreSpec=*
   MinCPUsNode=6 MinMemoryCPU=2450M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/kyukon/home/gent/400/vsc40075
   Comment=stdout=/kyukon/home/gent/400/vsc40075/slurm-7904700.out
   Power=

So this seems to work. Sort of. On each of the nodes, the job occupied 6 cores, for a total of 42 cores instead of 40.

[ageorges@gligar04 ~]$ for node in `echo node3205.victini.os,node3206.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os | tr "," " "`; do ssh $node "cat /sys/fs/cgroup/cpuset/slurm/uid_2540075/job_7904700/cpuset.cpus"; done
23,25,27,29,31,35
3,11,23,31,33,35
12,16,24,28,30,32
15,19,23,27,31,35
10,15,26-27,31,34
1,19,23,27,31,35
0,4,12,16,18,30

 
> > This seems to work in some case, if ntasks is smaller of equal to
> > ntasks-per-node, not if the former is larger than the latter. I do have a
> > number of nodes in mix state free, so in theory slurm could assign cores,
> > like it did for the other argument combinations I tried, but for
> > 
> > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc  --ntasks=40
> > --ntasks-per-node=36 --time=10 --mem-per-cpu=100m
> >
> > [root@master29 ~]# scontrol show job=7903981
> >    NumNodes=2 NumCPUs=40 NumTasks=40 CPUs/Task=1
> >    NtasksPerN:B:S:C=36:0:*:* CoreSpec=*
> >    MinCPUsNode=36 MinMemoryCPU=100M MinTmpDiskNode=0
> >
> > I am not sure if these resources can ever be met.
> 
> That request should never run: MinCPUsNode=36 < 40 / 2

But ntasks-per-node according to the documentation is the _maximum_ not the minimum? And you mean 36 > 40/2, right?

So to ensure that we use the max available cores on a node, would it not make sense to set --ntasks-per-node=36 (i.e., the number of cores each node has)?

Condireing the above, it does not seem there is a straightforward way to pack jobs, as the number of nodes will always be (afaict) 

floor(ntasks / ntasks-per-node) + (ntasks `mod` ntasks-per-node == 0 ? 0 : 1), so I might be wasting most of a node, e.g., when asking for 40 tasks on nodes with 36 cores, I'd get two full nodes ...

This will enhance the throughput of our system compare to spreading our the request over 40 nodes like we have now, but it will also waste resources that could be used by single core jobs. 


Thanks for the answers, I'm guessing I'm not really making this easy :)


Kind regards,
-- Andy



-- Andy

Comment 6 Nate Rini 2019-04-01 13:47:11 MDT

(In reply to hpc-admin from comment #5)
> I used 40, since I do not have 50 nodes in an idle or mix state atm
> 
> vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=40-40
> --ntasks-per-node=1 --mem-per-cpu=100m
Looks like that worked as expected, 40 unique nodes listed.
 
> > Can you please provide the output of this:
> > > slurmd -C
> 
> vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> srun slurmd -C
First time I have seen that done via srun.

Actual:
> CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9  ThreadsPerCore=1 RealMemory=94976
Configured:
> CPUs=36                            CoresPerSocket=18 ThreadsPerCore=1 RealMemory=91341 Sockets=2

I think the slurm.conf nodes should be this instead (leaving memory slightly less) before we continue:
> CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=91341

Having the node config incorrect can cause Slurm to do surprising things.

--Nate

Comment 7 Nate Rini 2019-04-15 12:33:29 MDT

Andy

Any updates on fixing the configuration?

We haven't received an update since lasts month. I'm going to time this ticket out. It will automatically re-open if you reply.

--Nate