Hi, We are migrating from Moab/Torque to Slurm. Under the former, a user requesting N nodes with one core per node essentially simply requested N cores and Moab could assign these however it saw fit (taking the memory requirements into consideration). For example: qsub -l nodes=50:ppn=1 Under Slurm, we are currently using the (modified) qsub wrapper to ease the transition for our users, calling sbatch with --ntasks-per-node=P in case they specify ppn=P. This obviously has the undesired effect that the allocation is then spread across N nodes, with P cores per node. I get why, we're asking for exactly this. However, we'd like to compact such jobs to use less nodes. According to the documentation, I'd expect for example salloc --nodes=1-2 --ntasks=40 --ntasks-per-node=36 to allocate 40 tasks, 1 task per core, so 40 cores in total, spread across two nodes (our nodes have 36 cores). When looking at the cpuset in the cgroup however, I notice that on both nodes, I get 36 cores assigned to the job and scontrol for my job shows NumNodes=2 NumCPUs=72 NumTasks=40 CPUs/Task=1 Is there a way to get NumCPUs to also be 40 in this case? This would be preferred, especially if I need more than the per core memory for my job -- then I could get 2 cores per node, the mem-per-cpu set to what I need and keep the remaining cores (and memory) available for other jobs. Of course, I could set ntasks-per-node=20, but that would again result in potentially occupying more nodes than strictly needed (note that we do not have a single node policy) thereby occupying complete node resources for other jobs in the queue. Kind regards, -- Andy
(In reply to hpc-admin from comment #0) > We are migrating from Moab/Torque to Slurm. Under the former, a user > requesting N nodes with one core per node essentially simply requested N > cores and Moab could assign these however it saw fit (taking the memory > requirements into consideration). Can you please attach your slurm.conf. Settings for DefMemPerCPU, DefMemPerCPU and DefMemPerNode change how Slurm will respond. > For example: qsub -l nodes=50:ppn=1 Please try using "salloc --nodes=50 --ntasks-per-node=1". > Under Slurm, we are currently using the (modified) qsub wrapper to ease the > transition for our users, calling sbatch with --ntasks-per-node=P in > case they specify ppn=P. This obviously has the undesired effect that > the allocation is then spread across N nodes, with P cores per node. > > I get why, we're asking for exactly this. However, we'd like to compact > such jobs to use less nodes. According to the documentation, I'd expect > for example > > salloc --nodes=1-2 --ntasks=40 --ntasks-per-node=36 This request allows the scheduler to choose the number nodes, which it can do 1 per node or up to 36 per node. If the node count is specified, then it will honor that request per the Slurm srun man page: > If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. > to allocate 40 tasks, 1 task per core, so 40 cores in total, spread > across two nodes (our nodes have 36 cores). When looking at the cpuset > in the cgroup however, I notice that on both nodes, I get 36 cores > assigned to the job and scontrol for my job shows > > NumNodes=2 NumCPUs=72 NumTasks=40 CPUs/Task=1 > > Is there a way to get NumCPUs to also be 40 in this case? This would be > preferred, especially if I need more than the per core memory for my > job -- then I could get 2 cores per node, the mem-per-cpu set to what > I need and keep the remaining cores (and memory) available for other jobs. > > Of course, I could set ntasks-per-node=20, but that would again result in > potentially occupying more nodes than strictly needed (note that we do not > have a single node policy) thereby occupying complete node resources for > other jobs in the queue. Is hyperthreading enabled?
Dear Nate, Thanks for the swift reponse. --nodes=50 --ntasks-per-node=1 is what we're using right now in the qsub wrapper as arguments to sbatch. This causes 50 nodes to be allocated, 1 core per task, 1 task per node. vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=10 --ntasks-per-node=1 --time=10 salloc: Granted job allocation 7903969 salloc: Waiting for resource configuration salloc: Nodes node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os are ready for job At least a number of nodes could easily accomodate more than one of the tasks: vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> scontrol show node=node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os | grep AllocT AllocTRES=cpu=35,mem=88594M AllocTRES=cpu=29,mem=73462M AllocTRES=cpu=29,mem=73462M AllocTRES=cpu=31,mem=78536M AllocTRES=cpu=31,mem=78536M AllocTRES=cpu=31,mem=78536M AllocTRES=cpu=31,mem=78536M AllocTRES=cpu=22,mem=55711M AllocTRES=cpu=31,mem=78536M AllocTRES=cpu=31,mem=65654M For this cluster, we have: MaxMemPerNode=91341 DefMemPerCPU=2450 No other limits are specified afaik (slurm.conf attached). So that certainly does not seem to be the solution. The following works better: vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=1-10 --ntasks=10 --ntasks-per-node=10 --time=10 salloc: Granted job allocation 7903973 salloc: Waiting for resource configuration salloc: Nodes node3250.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3264.victini.os,node3268.victini.os,node3270.victini.os are ready for job vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> scontrol show node=node3250.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3264.victini.os,node3268.victini.os,node3270.victini.os | grep AllocT AllocTRES=cpu=31,mem=77761M AllocTRES=cpu=31,mem=77761M AllocTRES=cpu=21,mem=91014M AllocTRES=cpu=30,mem=75224M AllocTRES=cpu=31,mem=77761M AllocTRES=cpu=31,mem=77761M AllocTRES=cpu=19,mem=85940M But then I see NumNodes=7 NumCPUs=70 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Which is also not what I want: NumCPUs is too high I am probably misunderstanding something about the combination ntasks and ntasks-per-node, although the documentation indicates "--ntasks-per-node=<ntasks> Request that ntasks be invoked on each node. If used with the --ntasks option, the --ntasks option will take precedence and the --ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the --nodes option. This is related to --cpus-per-task=ncpus, but does not require knowledge of the actual number of cpus on each node." So I am slightly confused as to the exact meaning. As per your comment about srun, I would need to no specify the --nodes and only use --ntasks and --ntasks-per-node? This seems to work in some case, if ntasks is smaller of equal to ntasks-per-node, not if the former is larger than the latter. I do have a number of nodes in mix state free, so in theory slurm could assign cores, like it did for the other argument combinations I tried, but for vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --ntasks=40 --ntasks-per-node=36 --time=10 --mem-per-cpu=100m salloc: Pending job allocation 7903981 salloc: job 7903981 queued and waiting for resources I see [root@master29 ~]# scontrol show job=7903981 JobId=7903981 JobName=bash UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A Priority=23756 Nice=0 Account=gvo00002 QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2019-03-28T22:33:36 EligibleTime=2019-03-28T22:33:36 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-03-28T22:33:37 Partition=victini AllocNode:Sid=gligar05:9509 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=2 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=40,mem=4000M,node=1 Socks/Node=* NtasksPerN:B:S:C=36:0:*:* CoreSpec=* MinCPUsNode=36 MinMemoryCPU=100M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/kyukon/home/gent/400/vsc40075 Comment=stdout=/kyukon/home/gent/400/vsc40075/slurm-7903981.out Power= I am not sure if these resources can ever be met. -- Andy
Created attachment 9731 [details] slurm.conf
(In reply to hpc-admin from comment #2) > --nodes=50 --ntasks-per-node=1 > is what we're using right now in the qsub wrapper as arguments to sbatch. > This causes 50 nodes to be allocated, 1 core per task, 1 task per node. Please try (although it should not be needed): > --nodes=50:50 --ntasks-per-node=1 Can you please provide the output of this: > slurmd -C > scontrol show job $JOBID Please provide the all the jobs mentioned. > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=10 > --ntasks-per-node=1 --time=10 > salloc: Granted job allocation 7903969 > salloc: Waiting for resource configuration > salloc: Nodes > node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini. > os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264. > victini.os,node3265.victini.os,node3266.victini.os are ready for job Looks like 10 nodes as requested. > At least a number of nodes could easily accomodate more than one of the > tasks: Are you looking to pack the number of tasks into a node? > MaxMemPerNode=91341 The value of "RealMemory=91341" on the node should be sufficient unless you want that max to be less than the actual. > DefMemPerCPU=2450 91341 / 2450 = 37.28 CPUs max > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=1-10 --ntasks=10 > --ntasks-per-node=10 --time=10 This is requesting 1 to 10 nodes with at most 10 tasks per node with a total of 10 tasks. > NumNodes=7 NumCPUs=70 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:* There are CPUs=36 per node per your slurm.conf: 70/36 = 1.94 nodes min with 7 CPUs per Task Odd, that should 1 by default per the srun man page: > -c, --cpus-per-task=<ncpus> > The default is one CPU per process. Is SLURM_CPUS_PER_TASK set in your env? Can you please provide this output: > salloc -vv --nodes=10 --ntasks-per-node=1 --time=10 > Which is also not what I want: NumCPUs is too high It can be set by adding "--cpus-per-task=1" to your command, but it should default to 1. I'm thinking something else may be changing the request. > I am probably misunderstanding something about the combination ntasks and > ntasks-per-node, although the documentation indicates > > "--ntasks-per-node=<ntasks> > Request that ntasks be invoked on each node. If used with the --ntasks > option, the --ntasks option will take precedence and the --ntasks-per-node > will be treated as a maximum count of tasks per node. Meant to be used with > the --nodes option. This is related to --cpus-per-task=ncpus, but does not > require knowledge of the actual number of cpus on each node." > > So I am slightly confused as to the exact meaning. > > As per your comment about srun, I would need to no specify the --nodes and > only use --ntasks and --ntasks-per-node? Not specifying them leaves Slurm the freedom to find any placement it can, potentially running more jobs. > This seems to work in some case, if ntasks is smaller of equal to > ntasks-per-node, not if the former is larger than the latter. I do have a > number of nodes in mix state free, so in theory slurm could assign cores, > like it did for the other argument combinations I tried, but for > > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --ntasks=40 > --ntasks-per-node=36 --time=10 --mem-per-cpu=100m > > [root@master29 ~]# scontrol show job=7903981 > NumNodes=2 NumCPUs=40 NumTasks=40 CPUs/Task=1 > NtasksPerN:B:S:C=36:0:*:* CoreSpec=* > MinCPUsNode=36 MinMemoryCPU=100M MinTmpDiskNode=0 > > I am not sure if these resources can ever be met. That request should never run: MinCPUsNode=36 < 40 / 2
Hi Nate, (In reply to Nate Rini from comment #4) > (In reply to hpc-admin from comment #2) > > --nodes=50 --ntasks-per-node=1 > > > is what we're using right now in the qsub wrapper as arguments to sbatch. > > This causes 50 nodes to be allocated, 1 core per task, 1 task per node. > > Please try (although it should not be needed): > > --nodes=50:50 --ntasks-per-node=1 I used 40, since I do not have 50 nodes in an idle or mix state atm vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=40-40 --ntasks-per-node=1 --mem-per-cpu=100m salloc: Pending job allocation 7904659 salloc: job 7904659 queued and waiting for resources salloc: job 7904659 has been allocated resources salloc: Granted job allocation 7904659 salloc: Waiting for resource configuration salloc: Nodes node3200.victini.os,node3205.victini.os,node3206.victini.os,node3207.victini.os,node3209.victini.os,node3210.victini.os,node3216.victini.os,node3217.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3238.victini.os,node3240.victini.os,node3249.victini.os,node3250.victini.os,node3252.victini.os,node3253.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3260.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os,node3269.victini.os,node3270.victini.os,node3271.victini.os,node3273.victini.os,node3274.victini.os,node3275.victini.os,node3276.victini.os,node3278.victini.os,node3279.victini.os,node3290.victini.os,node3291.victini.os,node3292.victini.os are ready for job > Can you please provide the output of this: > > slurmd -C vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> srun slurmd -C NodeName=node3290 CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976 UpTime=77-21:39:28 slurmd: Considering each NUMA node as a socket NodeName=node3268 CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976 UpTime=76-00:26:03 slurmd: Considering each NUMA node as a socket slurmd: Considering each NUMA node as a socket NodeName=node3263 CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976 UpTime=77-22:09:18 <snip> > > scontrol show job $JOBID [ageorges@master29 ~]$ sudo scontrol show job=7904659 JobId=7904659 JobName=bash UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A Priority=27290 Nice=0 Account=gvo00002 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:01:27 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2019-03-29T10:18:12 EligibleTime=2019-03-29T10:18:12 StartTime=2019-03-29T10:18:13 EndTime=2019-03-29T11:18:13 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-03-29T10:18:13 Partition=victini AllocNode:Sid=gligar04:27220 ReqNodeList=(null) ExcNodeList=(null) NodeList=node3200.victini.os,node3205.victini.os,node3206.victini.os,node3207.victini.os,node3209.victini.os,node3210.victini.os,node3216.victini.os,node3217.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3238.victini.os,node3240.victini.os,node3249.victini.os,node3250.victini.os,node3252.victini.os,node3253.victini.os,node3254.victini.os,node3256.victini.os,node3257.victini.os,node3260.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os,node3269.victini.os,node3270.victini.os,node3271.victini.os,node3273.victini.os,node3274.victini.os,node3275.victini.os,node3276.victini.os,node3278.victini.os,node3279.victini.os,node3290.victini.os,node3291.victini.os,node3292.victini.os BatchHost=node3200.victini.os NumNodes=40 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=40,mem=4000M,node=40,billing=40 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/kyukon/home/gent/400/vsc40075 > Please provide the all the jobs mentioned. ? > > > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=10 > > --ntasks-per-node=1 --time=10 > > salloc: Granted job allocation 7903969 > > salloc: Waiting for resource configuration > > salloc: Nodes > > node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini. > > os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264. > > victini.os,node3265.victini.os,node3266.victini.os are ready for job > Looks like 10 nodes as requested. > > > At least a number of nodes could easily accomodate more than one of the > > tasks: > Are you looking to pack the number of tasks into a node? I am looking to replicate the behaviour Moab exhibited (to some extent), so yes, if possible packing things on a node or on multiple nodes or parts of multiple nodes. If this is not possible, I fully understand as it seems to me that this makes scheduling much more complex (to consider possible scenarios etc.) > > > MaxMemPerNode=91341 > The value of "RealMemory=91341" on the node should be sufficient unless you > want that max to be less than the actual. Yes, max should be a bit less, since there is the GPFS pagepool and such to take into account. > > > DefMemPerCPU=2450 > > 91341 / 2450 = 37.28 CPUs max > > > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=1-10 --ntasks=10 > > --ntasks-per-node=10 --time=10 > This is requesting 1 to 10 nodes with at most 10 tasks per node with a total > of 10 tasks. That is what I meant and understood from the documentation as well. The idea here is that I meant that I need place for 10 tasks, and slurm can consider how to reserve those resources, going from 1 to at most 10 nodes, but reserve no more than 10 cores (as the default is 1 task/core). > > > NumNodes=7 NumCPUs=70 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > > There are CPUs=36 per node per your slurm.conf: > 70/36 = 1.94 nodes min with 7 CPUs per Task > > Odd, that should 1 by default per the srun man page: > > -c, --cpus-per-task=<ncpus> > > The default is one CPU per process. Uhu. That is what I also read. Maybe I should specify it explicitly, but CPUs/Task=1 seems to suggest that is not needed. > > Is SLURM_CPUS_PER_TASK set in your env? Can you please provide this output: > > salloc -vv --nodes=10 --ntasks-per-node=1 --time=10 vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc -vv --nodes=10 --ntasks-per-node=1 --time=10 salloc: defined options for program `salloc' salloc: --------------- --------------------- salloc: user : `vsc40075' salloc: uid : 2540075 salloc: gid : 2540075 salloc: ntasks : 10 (set) salloc: cpus_per_task : 0 (default) salloc: nodes : 10-10 salloc: partition : default salloc: job name : `bash' salloc: reservation : `(null)' salloc: wckey : `(null)' salloc: distribution : unknown salloc: verbose : 2 salloc: immediate : false salloc: overcommit : false salloc: time_limit : 10 salloc: nice : -2 salloc: account : (null) salloc: comment : (null) salloc: dependency : (null) salloc: network : (null) salloc: power : salloc: profile : `NotSet' salloc: qos : (null) salloc: constraints : salloc: geometry : (null) salloc: reboot : yes salloc: rotate : no salloc: mail_type : NONE salloc: mail_user : (null) salloc: sockets-per-node : -2 salloc: cores-per-socket : -2 salloc: threads-per-core : -2 salloc: ntasks-per-node : 1 salloc: ntasks-per-socket : -2 salloc: ntasks-per-core : -2 salloc: plane_size : 4294967294 salloc: mem-bind : default salloc: user command : `/bin/bash' salloc: cpu_freq_min : 4294967294 salloc: cpu_freq_max : 4294967294 salloc: cpu_freq_gov : 4294967294 salloc: switches : -1 salloc: wait-for-switches : -1 salloc: core-spec : NA salloc: burst_buffer : `(null)' salloc: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) salloc: debug: Munge authentication plugin loaded salloc: debug: slurmdbd: Sent PersistInit msg salloc: Serial Job Resource Selection plugin loaded with argument 20 salloc: Consumable Resources (CR) Node Selection plugin loaded with argument 20 salloc: Cray node selection plugin loaded salloc: Linear node selection plugin loaded with argument 20 salloc: debug: slurmdbd: Sent fini msg salloc: debug: Entering slurm_allocation_msg_thr_create() salloc: debug: port from net_stream_listen is 30595 salloc: debug: Entering _msg_thr_internal salloc: debug: _is_port_ok: bind() failed port 30595 sock 7 Address already in use salloc: Pending job allocation 7904695 salloc: job 7904695 queued and waiting for resources salloc: Nodes node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os are ready for job salloc: debug: laying out the 10 tasks on 10 hosts node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os dist 8192 [ageorges@master29 ~]$ sudo scontrol show job=7904695 JobId=7904695 JobName=bash UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A Priority=24493 Nice=0 Account=gvo00002 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:32 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2019-03-29T10:38:50 EligibleTime=2019-03-29T10:38:50 StartTime=2019-03-29T10:39:11 EndTime=2019-03-29T10:49:11 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-03-29T10:39:11 Partition=victini AllocNode:Sid=gligar04:27220 ReqNodeList=(null) ExcNodeList=(null) NodeList=node3216.victini.os,node3233.victini.os,node3234.victini.os,node3235.victini.os,node3236.victini.os,node3262.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os BatchHost=node3216.victini.os NumNodes=10 NumCPUs=10 NumTasks=10 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=10,mem=24500M,node=10,billing=10 Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2450M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/kyukon/home/gent/400/vsc40075 Comment=stdout=/kyukon/home/gent/400/vsc40075/slurm-7904695.out Power= > > > Which is also not what I want: NumCPUs is too high > > It can be set by adding "--cpus-per-task=1" to your command, but it should > default to 1. I'm thinking something else may be changing the request. > > > I am probably misunderstanding something about the combination ntasks and > > ntasks-per-node, although the documentation indicates > > > > "--ntasks-per-node=<ntasks> > > Request that ntasks be invoked on each node. If used with the --ntasks > > option, the --ntasks option will take precedence and the --ntasks-per-node > > will be treated as a maximum count of tasks per node. Meant to be used with > > the --nodes option. This is related to --cpus-per-task=ncpus, but does not > > require knowledge of the actual number of cpus on each node." > > > > So I am slightly confused as to the exact meaning. > > > > As per your comment about srun, I would need to no specify the --nodes and > > only use --ntasks and --ntasks-per-node? > Not specifying them leaves Slurm the freedom to find any placement it can, > potentially running more jobs. Right. This works in some cases, for example salloc -vv --ntasks=10 --ntasks-per-node=10 --time=10 However, salloc -vv --ntasks=40 --ntasks-per-node=10 --time=10 vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc -vv --ntasks=40 --ntasks-per-node=6 --time=10 salloc: defined options for program `salloc' salloc: --------------- --------------------- salloc: user : `vsc40075' salloc: uid : 2540075 salloc: gid : 2540075 salloc: ntasks : 40 (set) salloc: cpus_per_task : 0 (default) salloc: nodes : 1 (default) salloc: partition : default salloc: job name : `bash' salloc: reservation : `(null)' salloc: wckey : `(null)' salloc: distribution : unknown salloc: verbose : 2 salloc: immediate : false salloc: overcommit : false salloc: time_limit : 10 salloc: nice : -2 salloc: account : (null) salloc: comment : (null) salloc: dependency : (null) salloc: network : (null) salloc: power : salloc: profile : `NotSet' salloc: qos : (null) salloc: constraints : salloc: geometry : (null) salloc: reboot : yes salloc: rotate : no salloc: mail_type : NONE salloc: mail_user : (null) salloc: sockets-per-node : -2 salloc: cores-per-socket : -2 salloc: threads-per-core : -2 salloc: ntasks-per-node : 6 salloc: ntasks-per-socket : -2 salloc: ntasks-per-core : -2 salloc: plane_size : 4294967294 salloc: mem-bind : default salloc: user command : `/bin/bash' salloc: cpu_freq_min : 4294967294 salloc: cpu_freq_max : 4294967294 salloc: cpu_freq_gov : 4294967294 salloc: switches : -1 salloc: wait-for-switches : -1 salloc: core-spec : NA salloc: burst_buffer : `(null)' salloc: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) salloc: debug: Munge authentication plugin loaded salloc: debug: slurmdbd: Sent PersistInit msg salloc: Serial Job Resource Selection plugin loaded with argument 20 salloc: Consumable Resources (CR) Node Selection plugin loaded with argument 20 salloc: Cray node selection plugin loaded salloc: Linear node selection plugin loaded with argument 20 salloc: debug: slurmdbd: Sent fini msg salloc: debug: Entering slurm_allocation_msg_thr_create() salloc: debug: port from net_stream_listen is 32720 salloc: debug: Entering _msg_thr_internal salloc: debug: _is_port_ok: bind() failed port 32720 sock 7 Address already in use salloc: Pending job allocation 7904700 salloc: job 7904700 queued and waiting for resources salloc: job 7904700 has been allocated resources salloc: Granted job allocation 7904700 salloc: debug: laying out the 40 tasks on 7 hosts node3205.victini.os,node3206.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os dist 8192 and [ageorges@master29 ~]$ sudo scontrol show job=7904700 JobId=7904700 JobName=bash UserId=vsc40075(2540075) GroupId=vsc40075(2540075) MCS_label=N/A Priority=23749 Nice=0 Account=gvo00002 QOS=normal JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2019-03-29T10:45:42 EligibleTime=2019-03-29T10:45:42 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-03-29T10:45:43 Partition=victini AllocNode:Sid=gligar04:27220 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=7 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=40,mem=98000M,node=1 Socks/Node=* NtasksPerN:B:S:C=6:0:*:* CoreSpec=* MinCPUsNode=6 MinMemoryCPU=2450M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/kyukon/home/gent/400/vsc40075 Comment=stdout=/kyukon/home/gent/400/vsc40075/slurm-7904700.out Power= So this seems to work. Sort of. On each of the nodes, the job occupied 6 cores, for a total of 42 cores instead of 40. [ageorges@gligar04 ~]$ for node in `echo node3205.victini.os,node3206.victini.os,node3263.victini.os,node3264.victini.os,node3265.victini.os,node3266.victini.os,node3268.victini.os | tr "," " "`; do ssh $node "cat /sys/fs/cgroup/cpuset/slurm/uid_2540075/job_7904700/cpuset.cpus"; done 23,25,27,29,31,35 3,11,23,31,33,35 12,16,24,28,30,32 15,19,23,27,31,35 10,15,26-27,31,34 1,19,23,27,31,35 0,4,12,16,18,30 > > This seems to work in some case, if ntasks is smaller of equal to > > ntasks-per-node, not if the former is larger than the latter. I do have a > > number of nodes in mix state free, so in theory slurm could assign cores, > > like it did for the other argument combinations I tried, but for > > > > vsc40075@gligar05 (SLURM_NOT_TORQUE_PBS) ~> salloc --ntasks=40 > > --ntasks-per-node=36 --time=10 --mem-per-cpu=100m > > > > [root@master29 ~]# scontrol show job=7903981 > > NumNodes=2 NumCPUs=40 NumTasks=40 CPUs/Task=1 > > NtasksPerN:B:S:C=36:0:*:* CoreSpec=* > > MinCPUsNode=36 MinMemoryCPU=100M MinTmpDiskNode=0 > > > > I am not sure if these resources can ever be met. > > That request should never run: MinCPUsNode=36 < 40 / 2 But ntasks-per-node according to the documentation is the _maximum_ not the minimum? And you mean 36 > 40/2, right? So to ensure that we use the max available cores on a node, would it not make sense to set --ntasks-per-node=36 (i.e., the number of cores each node has)? Condireing the above, it does not seem there is a straightforward way to pack jobs, as the number of nodes will always be (afaict) floor(ntasks / ntasks-per-node) + (ntasks `mod` ntasks-per-node == 0 ? 0 : 1), so I might be wasting most of a node, e.g., when asking for 40 tasks on nodes with 36 cores, I'd get two full nodes ... This will enhance the throughput of our system compare to spreading our the request over 40 nodes like we have now, but it will also waste resources that could be used by single core jobs. Thanks for the answers, I'm guessing I'm not really making this easy :) Kind regards, -- Andy -- Andy
(In reply to hpc-admin from comment #5) > I used 40, since I do not have 50 nodes in an idle or mix state atm > > vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> salloc --nodes=40-40 > --ntasks-per-node=1 --mem-per-cpu=100m Looks like that worked as expected, 40 unique nodes listed. > > Can you please provide the output of this: > > > slurmd -C > > vsc40075@gligar04 (SLURM_NOT_TORQUE_PBS) ~> srun slurmd -C First time I have seen that done via srun. Actual: > CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=94976 Configured: > CPUs=36 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=91341 Sockets=2 I think the slurm.conf nodes should be this instead (leaving memory slightly less) before we continue: > CPUs=36 Boards=1 SocketsPerBoard=4 CoresPerSocket=9 ThreadsPerCore=1 RealMemory=91341 Having the node config incorrect can cause Slurm to do surprising things. --Nate
Andy Any updates on fixing the configuration? We haven't received an update since lasts month. I'm going to time this ticket out. It will automatically re-open if you reply. --Nate