Ticket 7029

Summary:	MPI users only want to schedule on physical cores
Product:	Slurm	Reporter:	Doug Meyer <dameyer>
Component:	Configuration	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	ben, felip.moll
Version:	18.08.5
Hardware:	Linux
OS:	Linux
Site:	Raytheon Missile, Space and Airborne	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	RHEL	Machine Name:	slurm02
CLE Version:		Version Fixed:	7.6
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf test error message job results good results sample array submission

Description Doug Meyer 2019-05-15 14:58:14 MDT

MPI users only want to run on physical cores.  HT is enabled on all systems allowing the ST jobs to make maximum use of the nodes.  In order to get just physical cores the engineers have been submitting jobs requesting exclusive use of the nodes but then requesting only half the threads.  Would like them to be able to submit a job requesting a number of cores, say 400, stipulating fill up the node and only use physical cores while the HT users can load up all phys/log threads normally.

Comment 1 Felip Moll 2019-05-16 03:18:38 MDT

Hi Doug, a couple of questions first:

1. I have in my database that you are using CR_ONE_TASK_PER_CORE. Is that still true? Could you please attach your up-to-date slurm.conf?

2. I guess you want that an 'MPI user' to exclusively allocate cores, but 'non-MPI' ones to allocate threads. Is that it?

I understand you don't want a non-MPI user to allocate unused threads of a Core used by an MPI user, right?

3. Do you want it to be possibly overwritten by the users or is it a hard enforcement?

Comment 2 Doug Meyer 2019-05-16 09:33:51 MDT

Created attachment 10244 [details]
slurm.conf

Comment 3 Doug Meyer 2019-05-16 09:41:19 MDT

slurm.conf is attached.  Using: CR_CPU_Memory

Wen the non-MPI jobs are submitted we don't care if they run on cores or threads.  For MPI we want them to only run on physical without having to "block" or be charged for the HT threads.  Believe then we could ignore the cores per node and simply request slurm provide the requested number of cores and fill nodes with tasks until requirement satisfied.

I understand you don't want a non-MPI user to allocate unused threads of a Core used by an MPI user, right?  Correct.  

3. Do you want it to be possibly overwritten by the users or is it a hard enforcement?  Don't understand but never want a core used for MPI to be available for a non-MPI logical thread.  

Thank you.

Comment 4 Felip Moll 2019-05-17 12:40:17 MDT

Doug,

Given your configuration I checked a couple of options and I think you will get the desired behavior just doing a request with '--ntasks-per-core' option.

For example:

srun  --mem 10 --ntasks-per-core=1 --ntasks=56 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n

will allocate two entire nodes for the job, and will run 56 tasks on them. You will see the output as every task uses 2 cores.

For example, a job requesting 28 tasks will get 56 threads:

[slurm@moll0 18.08]$ srun  --mem 10 --ntasks-per-core=1  --ntasks=28 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n 
 0,1
 2,3
 4,5
 6,7
 8,9
 10,11
 12,13
 14,15
 16,17
 18,19
 20,21
 22,23
 24,25
 26,27
 28,29
 30,31
 32,33
 34,35
 36,37
 38,39
 40,41
 42,43
 44,45
 46,47
 48,49
 50,51
 52,53
 54,55

Running less tasks will let some cores free for other jobs (from 20-55 in this example):
[slurm@moll0 18.08]$ srun  --mem 10 --ntasks-per-core=1  --ntasks=10 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n 
 0,1
 2,3
 4,5
 6,7
 8,9
 10,11
 12,13
 14,15
 16,17
 18,19

And asking more than 28 tasks will grant you more than one node.


The same can be achieved asking for cpus-per-task instead:

]$ srun  --mem 10 --cpus-per-task=2  --ntasks=10 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n

In that case, if you are inside an allocation, note that sruns run inside should use the exclusive flag since otherwise the bind mask will be inherited from the parent, so for example, this will allocate two nodes, and when inside the allocation the srun must run with --exclusive:

[slurm@moll0 18.08]$ salloc  --mem 10 --cpus-per-task=2  --ntasks=56
[slurm@moll1 18.08]$ srun -n10 --exclusive bash -c 'taskset -pc $$'

I recommend reading carefully about --cpus-per-task, --ntasks-per-core and --cpu-bind options of sbatch/salloc/srun.

All of this suggestions are also dependent on how the user programs are run afterwards, so please, if you are not fine with it just show me a concrete example of a  batch script.

I'll be waiting for your feedback.

Comment 5 Doug Meyer 2019-05-17 13:28:45 MDT

Hi,

When I attempt the command

srun  --mem 10 --ntasks-per-core=1  --ntasks=28 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n

I get a complaint about an undeclared variable.

I am missing something...

Thank you

Comment 6 Felip Moll 2019-05-20 02:20:58 MDT

(In reply to Doug Meyer from comment #5)
> Hi,
> 
> When I attempt the command
> 
> srun  --mem 10 --ntasks-per-core=1  --ntasks=28 bash -c "taskset -cp
> \$\$"|cut -d":" -f 2|sort -n
> 
> I get a complaint about an undeclared variable.
> 
> I am missing something...
> 
> Thank you

A copy paste of this command works for me.

This command is trying to run 28 tasks. The command executed will be:

bash -c "taskset -cp $$"

And then the output will be truncated (cut) amd sorted.

Can you please check if your copy paste messed the \$\$ or similar?
I tried again with a copy past of your comment and it works for me. If it is not working, please paste here the exact output of your terminal.

Comment 7 Doug Meyer 2019-05-20 11:33:35 MDT

Created attachment 10289 [details]
test error message

Should state the server is RHEL7.6, the submit host was tried from RHEL6 & 7.  Target is RHEL6.

Comment 8 Doug Meyer 2019-05-20 12:49:58 MDT

Created attachment 10293 [details]
job results

Put command in a script and ran srun.  Voila!

Comment 9 Felip Moll 2019-05-21 10:15:11 MDT

That's because you are using csh and it doesn't like the \$\$.

Escapes in csh should be done differently, i.e., this is in a csh:

]$ bash -c "echo \$\$"
Variable name must contain alphanumeric characters.

]$ bash -c "echo '$$'"
5277

]$ bash -c 'echo $$'
5348

so it seems the right command for csh is the last one.

srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n


In any case, from the results in comment 8 shows I cannot say it if it is assigning one core, or one thread to each task. I do really need the output of taskset.

[2019-05-20T11:14:51.144] JobId=3902209 StepId=0
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[0] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[1] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[2] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[3] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[4] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[5] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[6] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[7] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[8] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[9] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[10] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[11] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[12] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[13] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[0] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[1] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[2] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[3] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[4] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[5] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[6] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[7] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[8] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[9] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[10] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[11] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[12] is allocated
[2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[13] is allocated



So,

a) repeat the test with proper escaping and paste here the output,

srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n  

or just attach here the output of the script which run correctly in your comment 8.

I need the output of taskset to see where it effectively binds the processes to.

b) Set the slurmd debug level in slurm.conf to 'debug', then restart slurmd daemons and attach the full slurmd log, I need to see from the start of the daemon until you run the job.

c) Let's work in this bug for now, later we will continue on 7033 just to not repeat things in both places.

Comment 10 Doug Meyer 2019-05-21 11:12:43 MDT

Hi,

Should have noted the shell.  Still blows up in bash but fine in csh.  
session01:csh-NIL: srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n
 0,28
 .
 .
 27,55


Before I push the change to the slurm.conf I would like to make sure I am clear on what you would like.  Looked for an exact match for setting slurmd debug level in the slurm.conf but did not see that.

set the SlurmctldDebug=3 to "debug", restart slurmctld, push conf to all nodes, and test.  

Thank you for your patience.

Comment 11 Felip Moll 2019-05-21 12:46:06 MDT

(In reply to Doug Meyer from comment #10)
> Hi,
> 
> Should have noted the shell.  Still blows up in bash but fine in csh.  
> session01:csh-NIL: srun --partition=hpc3a --mem 10 --ntasks-per-core=1
> --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n
>  0,28
>  .
>  .
>  27,55
> 

If your output is like the one you show here:

session01:csh-NIL: srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n
 0,28
 .
 .
 27,55


everything is already working correctly. Here we see Slurm is assigning pairs of threads (from one core) to each task, so using the --ntasks-per-core=1 schedule tasks effectively on physical cores.

0,28 -> is core 0  <-- assigned to task 1
.
.
27,55 -> is core 27 <-- assigned to task 28


The only think that can happen here, is that if you don't enable ConstrainCores in cgroup.conf, this will be a 'soft' binding that will be possibly changed by an application, so I encourage you to set what I already commented, the task/cgroup,task/affinity in slurm.conf + the ConstrainCores=yes + TaskAffinity=No in cgroup.conf.

> 
> Before I push the change to the slurm.conf I would like to make sure I am
> clear on what you would like.  Looked for an exact match for setting slurmd
> debug level in the slurm.conf but did not see that.
> 
> set the SlurmctldDebug=3 to "debug", restart slurmctld, push conf to all
> nodes, and test.  

Don't need to do it now, since I see everything is fine.


That also confirms hyperthreading is working fine, so bug 7033 should be fine to.
Tell me if you see it different or if something is not ok for your use case.

Comment 12 Doug Meyer 2019-05-21 13:50:40 MDT

Created attachment 10311 [details]
good results

Comment 13 Doug Meyer 2019-05-21 13:53:23 MDT

Please see attached notes.  Believe we have a win.

Comment 14 Felip Moll 2019-05-22 05:41:27 MDT

I understand all situations are fine then?
Is your case resolved and the behavior expected?

Using --ntasks-per-core seems to give what you want, right?

Thanks :)

Comment 15 Doug Meyer 2019-05-22 09:09:12 MDT

Ties back to 7033.  Using #SBATCH --ntasks-per-core=2 or 1 has no impact on the ST array jobs.  I will ask an MPI user to dispatch a test job this morning and see what the result is.  Thank you.

Comment 16 Felip Moll 2019-05-28 01:52:44 MDT

*** Ticket 7033 has been marked as a duplicate of this ticket. ***

Comment 17 Doug Meyer 2019-06-04 08:32:13 MDT

Any updates?

Comment 18 Felip Moll 2019-06-04 09:09:05 MDT

(In reply to Doug Meyer from comment #15)
> I will ask an MPI user to dispatch a test job this morning and see what the result is.  Thank you.

Doug, I think both of us were waiting to each other. I was waiting for your feedback about the comment above.

I will review again the latest comments and configs and come up with a response later.

Comment 19 Doug Meyer 2019-06-04 11:28:29 MDT

I dropped the ball.  The engineer was able to run successfully.  40 cores requested from a partition with 28-core systems ran with 28 cores on one host and 12 on another.  Asked the engineer to review planar vs cyclic distribution.  His next concern was about the possibility of other jobs utilizing the unused second core, logical thread, on the cpu core.  Shared the documentation showing this would not happen.  Remaining concern is "what if other jobs interfere with my job".  Our nodes have a single fabric port so we need to restrict the nodes to one MPI job at a time.  Only two options I can think of.  Use exclusive and waste the unused cores or create a gres for the fabric port with a count of one so that only a single MPI job would be supported at a time, but we need to figure out how to enforce use of the gres.  Would like to pursue the problem of arrays not using the logical threads next.

Comment 20 Felip Moll 2019-06-04 12:00:42 MDT

Good,

I will think later a bit more about:

> Use exclusive and waste the unused cores or create a gres for the fabric port with a count of one so that only a single MPI job would be supported at a time, but we need to figure out how to enforce use of the gres.

but I think you're on the right path.

As you see enforcing the use of the gres is not trivial. You could create a prolog that sets some rules on iptables depending on how the user submitted the job (requested gres or not), and set the interface to be only usable by this user. Then the mpi program should explicitly use these interfaces, while other softwares shouldn't never use it. This only works if one use TCP/UDP traffic, I don't think it is possible with RDMA. It's just a vague idea and I haven't though seriously about any implications.

Another way could be making use of cgroups device interface, we could study this possibility.

The easiest and most reliable way is as you say, to use the --exclusive flag.

> Would like to pursue the problem of arrays not using the logical threads next.

I will look into that too. Can you give me an exact run and output?

Thanks

Comment 21 Doug Meyer 2019-06-04 13:49:00 MDT

Created attachment 10498 [details]
sample array submission

Attachment has the sbatch and the sleep job.  On a partition where we only have the CPU declaration, CPUs=56, we run on all cores/threads.  If I run on a node where we use the detailed description, CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2, we only run on the physical cores.

Comment 22 Felip Moll 2019-06-05 03:31:39 MDT

(In reply to Doug Meyer from comment #21)
> Created attachment 10498 [details]
> sample array submission
> 
> Attachment has the sbatch and the sleep job.  On a partition where we only
> have the CPU declaration, CPUs=56, we run on all cores/threads.  If I run on
> a node where we use the detailed description, CPUs=56 Sockets=2
> CoresPerSocket=14 ThreadsPerCore=2, we only run on the physical cores.

Doug,

In nodes with only CPUs declared Slurm doesn't use the real topology, so you should try the following if you want full cores to each task:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2


In my tests, with cpus-per-task=2:
moll3: 12274 0,1
moll3: 12277 2,3
moll3: 12279 4,5


and with cpus-per-task=1:
moll3: 12558 0
moll3: 12559 2
moll3: 12562 1
moll3: 12566 3
moll3: 12568 4
moll3: 12572 5
moll3: 12574 6


For the opposite, on a node with all the topology defined (CPUs + sockets + cores + threads), if you want to run tasks on a single thread:

- With srun you can use --cpu-bind=thread, or --cpu-bind=core.
- With sbatch, the allocation will not be bind to threads but will get all the necessary resources, which means that an script like yours will run with all the allocation granted resources. If you want to restrict each task to single threads then you must run a task inside this allocation, for example:

]$ cat run-serial.sh
#!/bin/bash
#SBATCH --job-name=serial
#SBATCH --array=0-300
#SBATCH --ntasks-per-core=1
#SBATCH --cpus-per-task=1

srun --cpu-bind=thread /bin/sleep $(($RANDOM % 100))


Then you get:
moll1: 25893 0,1 <--- Part of the allocation 1
moll1: 25915 0,1 <--- Part of the allocation 1

moll1: 25916 2,3 <--- Part of the allocation 2
moll1: 25931 2,3 <--- Part of the allocation 2

moll1: 25888 4,5 <--- Part of the allocation 3
moll1: 25899 4,5 <--- Part of the allocation 3

moll1: 25954 6,7 <--- Part of the allocation 4
moll1: 25946 6,7 <--- Part of the allocation 4

moll1: 25886 8,9 <--- Part of the allocation 5
moll1: 25890 8,9 <--- Part of the allocation 5

moll1: 25953 0  <-- Task 1
moll1: 25974 2  <-- Task 2
moll1: 25948 4  <-- Task 3
moll1: 25976 6  <-- Task 4
moll1: 25945 8  <-- Task 5
...


That's because the binding is per task or process, and not per allocation. Using mpirun inside the sbatch script will also provide the binding per task.

With srun, note these differences:

]$ srun -w moll1 --cpu-bind=core --mem 10 --ntasks-per-core=1 --ntasks=6 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n   ### Here we get the full core, but still running 1 task per core
 0,1
 2,3
 4,5
 6,7
 8,9
 10,11

]$ srun -w moll1 --cpu-bind=thread --mem 10 --ntasks-per-core=1 --ntasks=6 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n ### Here we get just one thread, and one task max. in each core
 0
 2
 4
 6
 8
 10

[slurm@moll0 ~]$ srun -w moll1 --cpu-bind=thread --mem 10 --ntasks-per-core=2 --ntasks=6 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n ### Here we get one task per thread
 0
 1
 2
 3
 4
 5


My lstopo looks like:
Machine (985MB)
  Package L#0
    L2 L#0 (4096KB) + Core L#0
      L1d L#0 (32KB) + L1i L#0 (32KB) + PU L#0 (P#0)
      L1d L#1 (32KB) + L1i L#1 (32KB) + PU L#1 (P#1)
    L2 L#1 (4096KB) + Core L#1
      L1d L#2 (32KB) + L1i L#2 (32KB) + PU L#2 (P#2)
      L1d L#3 (32KB) + L1i L#3 (32KB) + PU L#3 (P#3)
    L2 L#2 (4096KB) + Core L#2
      L1d L#4 (32KB) + L1i L#4 (32KB) + PU L#4 (P#4)
      L1d L#5 (32KB) + L1i L#5 (32KB) + PU L#5 (P#5)
    L2 L#3 (4096KB) + Core L#3
      L1d L#6 (32KB) + L1i L#6 (32KB) + PU L#6 (P#6)
      L1d L#7 (32KB) + L1i L#7 (32KB) + PU L#7 (P#7)
    L2 L#4 (4096KB) + Core L#4
      L1d L#8 (32KB) + L1i L#8 (32KB) + PU L#8 (P#8)
      L1d L#9 (32KB) + L1i L#9 (32KB) + PU L#9 (P#9)
    L2 L#5 (4096KB) + Core L#5
      L1d L#10 (32KB) + L1i L#10 (32KB) + PU L#10 (P#10)
      L1d L#11 (32KB) + L1i L#11 (32KB) + PU L#11 (P#11)



Look if it makes sense to you, if not I can do more testing and go a bit deeper if needed.

Comment 23 Felip Moll 2019-06-13 04:57:26 MDT

Hi Doug,

Since the behavior you experienced is ok for you and given that I see all correctly and I provided further info, this bug will be now marked as fixed and Infogiven.

Please, don't hesitate to mark it as OPEN again if you still have questions, or just open a new one.

Best regards,
Felip

Comment 24 Doug Meyer 2019-06-13 09:24:40 MDT

Please do not close this ticket until we address the array question that was introduced in 7033.  That ticket was closed as duplicate of 7029.  
Summary:
In order to get HT available for ST array jobs we had to set CPUS=<HT thread count> in the slurm.conf.  However, in this config the MPI users could not request one task per physical core.  We went to the recommended slurm,conf config and the sbatch parameter for one task per core and MPI works.  However, we are back to see the HT logical cores as unavailable to ST array jobs.  A sample job was shared.  Either reopen 7033 or reopen this case. 
I appreciate the queries are similar but we have not addressed the inability to utilize HT logical threads.

Comment 26 Albert Gil 2019-06-17 10:04:50 MDT

Hi Doug,

Felip won't be available for some days, so I will assist you instead.
I've gone through this bug and 7033 and I'm not totally sure if I understand the problem.

> In order to get HT available for ST array jobs we had to set CPUS=<HT thread
> count> in the slurm.conf.

Ok.

I've  setup equivalent to yours:
- HT enabled
- CPUs set to match Threads (not Cores)
- Using CR_CPU_MEMORY
- NOT using CR_ONE_TASK_PER_CORE
- Using cgroups con constraint cores


$ slurmd -C
NodeName=agildell CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=15911

slurm.conf:
NodeName=DEFAULT Sockets=1 CPUS=4 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=4096

scontrol show config|grep  "TaskPlugin\|Select\|ConstrainCores"
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU_MEMORY
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
ConstrainCores          = yes

My test cluster is only 4 nodes of 2 sockets * 2 threads per socket.

> However, in this config the MPI users could not
> request one task per physical core.

As Felip suggested in comment 4 I'm able to do it just by using the --ntasks-per-core=1 (to sbatch for job arrays):

$ sbatch --array=0-7 --ntasks-per-core=1 --wrap "srun bash -c 'printenv SLURMD_NODENAME; taskset -cp \$\$'"
Submitted batch job 520

$ tail slurm-520_*
==> slurm-520_0.out <==
c1
pid 25834's current affinity list: 0,2

==> slurm-520_1.out <==
c1
pid 25857's current affinity list: 1,3

==> slurm-520_2.out <==
c2
pid 25917's current affinity list: 0,2

==> slurm-520_3.out <==
c2
pid 25955's current affinity list: 1,3

==> slurm-520_4.out <==
c3
pid 25936's current affinity list: 0,2

==> slurm-520_5.out <==
c3
pid 25942's current affinity list: 1,3

==> slurm-520_6.out <==
c4
pid 25962's current affinity list: 0,2

==> slurm-520_7.out <==
c4
pid 25975's current affinity list: 1,3


Are you getting different results?
Although you were not using job arrays, from your comment 8 and comment 10 I guess that this is also working for you?
I don't see any problem here, please let me know if I'm missing something important here.

> We went to the recommended slurm,conf
> config and the sbatch parameter for one task per core and MPI works. 
> However, we are back to see the HT logical cores as unavailable to ST array
> jobs.

Here I think that I understand your problem.
Following the recommendations in comment 22 I do get:

$ sbatch --array=0-15 --ntasks-per-core=1 --cpus-per-task=1 --wrap "srun --cpu-bind=thread bash -c 'printenv SLURMD_NODENAME; taskset -cp \$\$; sleep 60'"                                   
Submitted batch job 683

$ tail slurm-683_*
==> slurm-683_0.out <==
c1
pid 10174's current affinity list: 0

==> slurm-683_10.out <==
c2
pid 11242's current affinity list: 0

==> slurm-683_11.out <==
c2
pid 11171's current affinity list: 1

==> slurm-683_12.out <==
c3
pid 11232's current affinity list: 0

==> slurm-683_13.out <==
c3
pid 11245's current affinity list: 1

==> slurm-683_14.out <==
c1
pid 11372's current affinity list: 1

==> slurm-683_15.out <==
c4
pid 11373's current affinity list: 1

==> slurm-683_1.out <==
c1
pid 10336's current affinity list: 1

==> slurm-683_2.out <==
c2
pid 10202's current affinity list: 0

==> slurm-683_3.out <==
c2
pid 10262's current affinity list: 1

==> slurm-683_4.out <==
c3
pid 10299's current affinity list: 0

==> slurm-683_5.out <==
c3
pid 10248's current affinity list: 1

==> slurm-683_6.out <==
c4
pid 10305's current affinity list: 0

==> slurm-683_7.out <==
c4
pid 10373's current affinity list: 1

==> slurm-683_8.out <==
c1
pid 11216's current affinity list: 0

==> slurm-683_9.out <==
c4
pid 11186's current affinity list: 0


Meaning that half of the 16 available threads are not used:

$ squeue                                                                                                                                                                                     
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        683_[8-15]     batch     wrap     agil PD       0:00      1 (Resources)
             683_0     batch     wrap     agil  R       0:10      1 c1
             683_1     batch     wrap     agil  R       0:10      1 c1
             683_2     batch     wrap     agil  R       0:10      1 c2
             683_3     batch     wrap     agil  R       0:10      1 c2
             683_4     batch     wrap     agil  R       0:10      1 c3
             683_5     batch     wrap     agil  R       0:10      1 c3
             683_6     batch     wrap     agil  R       0:10      1 c4
             683_7     batch     wrap     agil  R       0:10      1 c4


Is that your problem?
Or I misunderstood something important?

Regards,
Albert

Comment 27 Doug Meyer 2019-06-17 10:33:45 MDT

Hi,
Please see the sample array submission in the attachments and let me know if this works for you.  Two identical sets of servers

NodeName=ta1l-h[1001-1044] CPUs=56 ThreadsPerCore=2 RealMemory=256000
NodeName=ta1l-h[1045-1088] CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000

The array job works perfectly on 1001 -1044 but fails to use the HT logical threads on 1045-1088. 

I believe you have captured the situation perfectly.

I am certain we have a misstep, I just cannot figure out where.

Thank you

Comment 33 Albert Gil 2019-06-18 11:22:27 MDT

Hi Doug,

Please checkout these couple of links about CR_CPU and CPUs configuration with and without Cores and ThreadsPerCore:
https://slurm.schedmd.com/slurm.conf.html#OPT_CR_CPU
https://slurm.schedmd.com/faq.html#cpu_count

There you can read that, although you use CPUs as Threads, if you also specify Cores and ThreadsPerCore, by default Slurm won't share cores between jobs (and jo arrays are N jobs in that sense). Slurm does share cores between different tasks of the same jobs as we have seen before, but not for jobs arrays, that I think that it was your concern in bug 7033 and lastly here.
If you only specify CPUs and not Cores and ThreadsPerCore, then you share cores, but you are not able to work at core level (that you also need, right?).

Anyway, as usual in Slurm, although by default it may not behave as you want, we have someway to set it up to obtain quite what we want.
In our case, I think that the Partition OverSubscribe parameter is the way to go.
Please check it out at https://slurm.schedmd.com/cons_res_share.html

The following example should illustrate how to obtain what you are looking for.
My entire cluster contains 16 therads on 8 cores, and I've oversubscribe enabled (with YES:2, but maybe FORCE:2 is also fine for you?):

$ sinfo --Format partition,nodelist,sockets,cores,threads,cpus,oversubscribe

PARTITION           NODELIST            SOCKETS             CORES               THREADS             CPUS                OVERSUBSCRIBE
debug               c[1-4]              1                   2                   2                   4                   YES:2
batch*              c[1-4]              1                   2                   2                   4                   YES:2


With this setup, and because we've YES instead of FORCE, I still need to use --oversubscribe in the clients to actually tell slurm to share cores between the jobs of my job array,:

$ sbatch --oversubscribe --array=0-15 --wrap "srun bash -c 'printenv SLURMD_NODENAME; taskset -cp \$\$; sleep 60'"
Submitted batch job 825

Now, unlike in comment 26, all 16 threads of my cluster are used by the array (no job is PD).

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             825_0     batch     wrap     agil  R       0:02      1 c1
             825_1     batch     wrap     agil  R       0:02      1 c1
             825_2     batch     wrap     agil  R       0:02      1 c2
             825_3     batch     wrap     agil  R       0:02      1 c2
             825_4     batch     wrap     agil  R       0:02      1 c3
             825_5     batch     wrap     agil  R       0:02      1 c3
             825_6     batch     wrap     agil  R       0:02      1 c4
             825_7     batch     wrap     agil  R       0:02      1 c4
             825_8     batch     wrap     agil  R       0:02      1 c1
             825_9     batch     wrap     agil  R       0:02      1 c1
            825_10     batch     wrap     agil  R       0:02      1 c2
            825_11     batch     wrap     agil  R       0:02      1 c2
            825_12     batch     wrap     agil  R       0:02      1 c3
            825_13     batch     wrap     agil  R       0:02      1 c3
            825_14     batch     wrap     agil  R       0:02      1 c4
            825_15     batch     wrap     agil  R       0:02      1 c4

# cat slurm-825*
c1
pid 8978's current affinity list: 0,2
c2
pid 9340's current affinity list: 0,2
c2
pid 9396's current affinity list: 1,3
c3
pid 9375's current affinity list: 0,2
c3
pid 9415's current affinity list: 1,3
c4
pid 9399's current affinity list: 0,2
c4
pid 9424's current affinity list: 1,3
c1
pid 9064's current affinity list: 1,3
c2
pid 8992's current affinity list: 0,2
c2
pid 9091's current affinity list: 1,3
c3
pid 9085's current affinity list: 0,2
c3
pid 9163's current affinity list: 1,3
c4
pid 9169's current affinity list: 0,2
c4
pid 9260's current affinity list: 1,3
c1
pid 9224's current affinity list: 0,2
c1
pid 9282's current affinity list: 1,3


Is that what you are looking for?

Hope that helps,
Albert

Comment 34 Doug Meyer 2019-06-19 17:35:17 MDT

Thank you.  Will tread and test what you have shared.  Great detail in the response!

Comment 35 Albert Gil 2019-07-01 07:25:41 MDT

Hi Doug, 

> Thank you.

You are welcome!

> Will tread and test what you have shared.

Have you tried it?
How it went?
Can we close the ticket as infogiven?

> Great detail in the response!

Hope that helped!
Albert

Comment 36 Albert Gil 2019-07-11 08:25:33 MDT

Hi Doug,

I'm closing it as infogiven.
Please reopen if you have further questions.

Regards,
Albert

Comment 37 Doug Meyer 2019-07-12 16:50:49 MDT

That is completely fair.  It will be two weeks I am afraid before I can get a test done.  But your solution looks great.

Thank you for your support.