| Summary: | MPI users only want to schedule on physical cores | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Meyer <dameyer> |
| Component: | Configuration | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ben, felip.moll |
| Version: | 18.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Raytheon Missile, Space and Airborne | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | RHEL |
| Machine Name: | slurm02 | CLE Version: | |
| Version Fixed: | 7.6 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
test error message job results good results sample array submission |
||
|
Description
Doug Meyer
2019-05-15 14:58:14 MDT
Hi Doug, a couple of questions first: 1. I have in my database that you are using CR_ONE_TASK_PER_CORE. Is that still true? Could you please attach your up-to-date slurm.conf? 2. I guess you want that an 'MPI user' to exclusively allocate cores, but 'non-MPI' ones to allocate threads. Is that it? I understand you don't want a non-MPI user to allocate unused threads of a Core used by an MPI user, right? 3. Do you want it to be possibly overwritten by the users or is it a hard enforcement? Created attachment 10244 [details]
slurm.conf
slurm.conf is attached. Using: CR_CPU_Memory Wen the non-MPI jobs are submitted we don't care if they run on cores or threads. For MPI we want them to only run on physical without having to "block" or be charged for the HT threads. Believe then we could ignore the cores per node and simply request slurm provide the requested number of cores and fill nodes with tasks until requirement satisfied. I understand you don't want a non-MPI user to allocate unused threads of a Core used by an MPI user, right? Correct. 3. Do you want it to be possibly overwritten by the users or is it a hard enforcement? Don't understand but never want a core used for MPI to be available for a non-MPI logical thread. Thank you. Doug, Given your configuration I checked a couple of options and I think you will get the desired behavior just doing a request with '--ntasks-per-core' option. For example: srun --mem 10 --ntasks-per-core=1 --ntasks=56 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n will allocate two entire nodes for the job, and will run 56 tasks on them. You will see the output as every task uses 2 cores. For example, a job requesting 28 tasks will get 56 threads: [slurm@moll0 18.08]$ srun --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n 0,1 2,3 4,5 6,7 8,9 10,11 12,13 14,15 16,17 18,19 20,21 22,23 24,25 26,27 28,29 30,31 32,33 34,35 36,37 38,39 40,41 42,43 44,45 46,47 48,49 50,51 52,53 54,55 Running less tasks will let some cores free for other jobs (from 20-55 in this example): [slurm@moll0 18.08]$ srun --mem 10 --ntasks-per-core=1 --ntasks=10 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n 0,1 2,3 4,5 6,7 8,9 10,11 12,13 14,15 16,17 18,19 And asking more than 28 tasks will grant you more than one node. The same can be achieved asking for cpus-per-task instead: ]$ srun --mem 10 --cpus-per-task=2 --ntasks=10 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n In that case, if you are inside an allocation, note that sruns run inside should use the exclusive flag since otherwise the bind mask will be inherited from the parent, so for example, this will allocate two nodes, and when inside the allocation the srun must run with --exclusive: [slurm@moll0 18.08]$ salloc --mem 10 --cpus-per-task=2 --ntasks=56 [slurm@moll1 18.08]$ srun -n10 --exclusive bash -c 'taskset -pc $$' I recommend reading carefully about --cpus-per-task, --ntasks-per-core and --cpu-bind options of sbatch/salloc/srun. All of this suggestions are also dependent on how the user programs are run afterwards, so please, if you are not fine with it just show me a concrete example of a batch script. I'll be waiting for your feedback. Hi, When I attempt the command srun --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n I get a complaint about an undeclared variable. I am missing something... Thank you (In reply to Doug Meyer from comment #5) > Hi, > > When I attempt the command > > srun --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c "taskset -cp > \$\$"|cut -d":" -f 2|sort -n > > I get a complaint about an undeclared variable. > > I am missing something... > > Thank you A copy paste of this command works for me. This command is trying to run 28 tasks. The command executed will be: bash -c "taskset -cp $$" And then the output will be truncated (cut) amd sorted. Can you please check if your copy paste messed the \$\$ or similar? I tried again with a copy past of your comment and it works for me. If it is not working, please paste here the exact output of your terminal. Created attachment 10289 [details]
test error message
Should state the server is RHEL7.6, the submit host was tried from RHEL6 & 7. Target is RHEL6.
Created attachment 10293 [details]
job results
Put command in a script and ran srun. Voila!
That's because you are using csh and it doesn't like the \$\$. Escapes in csh should be done differently, i.e., this is in a csh: ]$ bash -c "echo \$\$" Variable name must contain alphanumeric characters. ]$ bash -c "echo '$$'" 5277 ]$ bash -c 'echo $$' 5348 so it seems the right command for csh is the last one. srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n In any case, from the results in comment 8 shows I cannot say it if it is assigning one core, or one thread to each task. I do really need the output of taskset. [2019-05-20T11:14:51.144] JobId=3902209 StepId=0 [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[0] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[1] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[2] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[3] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[4] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[5] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[6] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[7] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[8] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[9] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[10] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[11] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[12] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[0] Core[13] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[0] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[1] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[2] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[3] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[4] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[5] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[6] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[7] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[8] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[9] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[10] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[11] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[12] is allocated [2019-05-20T11:14:51.144] JobNode[0] Socket[1] Core[13] is allocated So, a) repeat the test with proper escaping and paste here the output, srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n or just attach here the output of the script which run correctly in your comment 8. I need the output of taskset to see where it effectively binds the processes to. b) Set the slurmd debug level in slurm.conf to 'debug', then restart slurmd daemons and attach the full slurmd log, I need to see from the start of the daemon until you run the job. c) Let's work in this bug for now, later we will continue on 7033 just to not repeat things in both places. Hi, Should have noted the shell. Still blows up in bash but fine in csh. session01:csh-NIL: srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n 0,28 . . 27,55 Before I push the change to the slurm.conf I would like to make sure I am clear on what you would like. Looked for an exact match for setting slurmd debug level in the slurm.conf but did not see that. set the SlurmctldDebug=3 to "debug", restart slurmctld, push conf to all nodes, and test. Thank you for your patience. (In reply to Doug Meyer from comment #10) > Hi, > > Should have noted the shell. Still blows up in bash but fine in csh. > session01:csh-NIL: srun --partition=hpc3a --mem 10 --ntasks-per-core=1 > --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n > 0,28 > . > . > 27,55 > If your output is like the one you show here: session01:csh-NIL: srun --partition=hpc3a --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c 'taskset -cp $$' | cut -d ":" -f 2 | sort -n 0,28 . . 27,55 everything is already working correctly. Here we see Slurm is assigning pairs of threads (from one core) to each task, so using the --ntasks-per-core=1 schedule tasks effectively on physical cores. 0,28 -> is core 0 <-- assigned to task 1 . . 27,55 -> is core 27 <-- assigned to task 28 The only think that can happen here, is that if you don't enable ConstrainCores in cgroup.conf, this will be a 'soft' binding that will be possibly changed by an application, so I encourage you to set what I already commented, the task/cgroup,task/affinity in slurm.conf + the ConstrainCores=yes + TaskAffinity=No in cgroup.conf. > > Before I push the change to the slurm.conf I would like to make sure I am > clear on what you would like. Looked for an exact match for setting slurmd > debug level in the slurm.conf but did not see that. > > set the SlurmctldDebug=3 to "debug", restart slurmctld, push conf to all > nodes, and test. Don't need to do it now, since I see everything is fine. That also confirms hyperthreading is working fine, so bug 7033 should be fine to. Tell me if you see it different or if something is not ok for your use case. Created attachment 10311 [details]
good results
Please see attached notes. Believe we have a win. I understand all situations are fine then? Is your case resolved and the behavior expected? Using --ntasks-per-core seems to give what you want, right? Thanks :) Ties back to 7033. Using #SBATCH --ntasks-per-core=2 or 1 has no impact on the ST array jobs. I will ask an MPI user to dispatch a test job this morning and see what the result is. Thank you. *** Ticket 7033 has been marked as a duplicate of this ticket. *** Any updates? (In reply to Doug Meyer from comment #15) > I will ask an MPI user to dispatch a test job this morning and see what the result is. Thank you. Doug, I think both of us were waiting to each other. I was waiting for your feedback about the comment above. I will review again the latest comments and configs and come up with a response later. I dropped the ball. The engineer was able to run successfully. 40 cores requested from a partition with 28-core systems ran with 28 cores on one host and 12 on another. Asked the engineer to review planar vs cyclic distribution. His next concern was about the possibility of other jobs utilizing the unused second core, logical thread, on the cpu core. Shared the documentation showing this would not happen. Remaining concern is "what if other jobs interfere with my job". Our nodes have a single fabric port so we need to restrict the nodes to one MPI job at a time. Only two options I can think of. Use exclusive and waste the unused cores or create a gres for the fabric port with a count of one so that only a single MPI job would be supported at a time, but we need to figure out how to enforce use of the gres. Would like to pursue the problem of arrays not using the logical threads next. Good, I will think later a bit more about: > Use exclusive and waste the unused cores or create a gres for the fabric port with a count of one so that only a single MPI job would be supported at a time, but we need to figure out how to enforce use of the gres. but I think you're on the right path. As you see enforcing the use of the gres is not trivial. You could create a prolog that sets some rules on iptables depending on how the user submitted the job (requested gres or not), and set the interface to be only usable by this user. Then the mpi program should explicitly use these interfaces, while other softwares shouldn't never use it. This only works if one use TCP/UDP traffic, I don't think it is possible with RDMA. It's just a vague idea and I haven't though seriously about any implications. Another way could be making use of cgroups device interface, we could study this possibility. The easiest and most reliable way is as you say, to use the --exclusive flag. > Would like to pursue the problem of arrays not using the logical threads next. I will look into that too. Can you give me an exact run and output? Thanks Created attachment 10498 [details]
sample array submission
Attachment has the sbatch and the sleep job. On a partition where we only have the CPU declaration, CPUs=56, we run on all cores/threads. If I run on a node where we use the detailed description, CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2, we only run on the physical cores.
(In reply to Doug Meyer from comment #21) > Created attachment 10498 [details] > sample array submission > > Attachment has the sbatch and the sleep job. On a partition where we only > have the CPU declaration, CPUs=56, we run on all cores/threads. If I run on > a node where we use the detailed description, CPUs=56 Sockets=2 > CoresPerSocket=14 ThreadsPerCore=2, we only run on the physical cores. Doug, In nodes with only CPUs declared Slurm doesn't use the real topology, so you should try the following if you want full cores to each task: #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 In my tests, with cpus-per-task=2: moll3: 12274 0,1 moll3: 12277 2,3 moll3: 12279 4,5 and with cpus-per-task=1: moll3: 12558 0 moll3: 12559 2 moll3: 12562 1 moll3: 12566 3 moll3: 12568 4 moll3: 12572 5 moll3: 12574 6 For the opposite, on a node with all the topology defined (CPUs + sockets + cores + threads), if you want to run tasks on a single thread: - With srun you can use --cpu-bind=thread, or --cpu-bind=core. - With sbatch, the allocation will not be bind to threads but will get all the necessary resources, which means that an script like yours will run with all the allocation granted resources. If you want to restrict each task to single threads then you must run a task inside this allocation, for example: ]$ cat run-serial.sh #!/bin/bash #SBATCH --job-name=serial #SBATCH --array=0-300 #SBATCH --ntasks-per-core=1 #SBATCH --cpus-per-task=1 srun --cpu-bind=thread /bin/sleep $(($RANDOM % 100)) Then you get: moll1: 25893 0,1 <--- Part of the allocation 1 moll1: 25915 0,1 <--- Part of the allocation 1 moll1: 25916 2,3 <--- Part of the allocation 2 moll1: 25931 2,3 <--- Part of the allocation 2 moll1: 25888 4,5 <--- Part of the allocation 3 moll1: 25899 4,5 <--- Part of the allocation 3 moll1: 25954 6,7 <--- Part of the allocation 4 moll1: 25946 6,7 <--- Part of the allocation 4 moll1: 25886 8,9 <--- Part of the allocation 5 moll1: 25890 8,9 <--- Part of the allocation 5 moll1: 25953 0 <-- Task 1 moll1: 25974 2 <-- Task 2 moll1: 25948 4 <-- Task 3 moll1: 25976 6 <-- Task 4 moll1: 25945 8 <-- Task 5 ... That's because the binding is per task or process, and not per allocation. Using mpirun inside the sbatch script will also provide the binding per task. With srun, note these differences: ]$ srun -w moll1 --cpu-bind=core --mem 10 --ntasks-per-core=1 --ntasks=6 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n ### Here we get the full core, but still running 1 task per core 0,1 2,3 4,5 6,7 8,9 10,11 ]$ srun -w moll1 --cpu-bind=thread --mem 10 --ntasks-per-core=1 --ntasks=6 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n ### Here we get just one thread, and one task max. in each core 0 2 4 6 8 10 [slurm@moll0 ~]$ srun -w moll1 --cpu-bind=thread --mem 10 --ntasks-per-core=2 --ntasks=6 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n ### Here we get one task per thread 0 1 2 3 4 5 My lstopo looks like: Machine (985MB) Package L#0 L2 L#0 (4096KB) + Core L#0 L1d L#0 (32KB) + L1i L#0 (32KB) + PU L#0 (P#0) L1d L#1 (32KB) + L1i L#1 (32KB) + PU L#1 (P#1) L2 L#1 (4096KB) + Core L#1 L1d L#2 (32KB) + L1i L#2 (32KB) + PU L#2 (P#2) L1d L#3 (32KB) + L1i L#3 (32KB) + PU L#3 (P#3) L2 L#2 (4096KB) + Core L#2 L1d L#4 (32KB) + L1i L#4 (32KB) + PU L#4 (P#4) L1d L#5 (32KB) + L1i L#5 (32KB) + PU L#5 (P#5) L2 L#3 (4096KB) + Core L#3 L1d L#6 (32KB) + L1i L#6 (32KB) + PU L#6 (P#6) L1d L#7 (32KB) + L1i L#7 (32KB) + PU L#7 (P#7) L2 L#4 (4096KB) + Core L#4 L1d L#8 (32KB) + L1i L#8 (32KB) + PU L#8 (P#8) L1d L#9 (32KB) + L1i L#9 (32KB) + PU L#9 (P#9) L2 L#5 (4096KB) + Core L#5 L1d L#10 (32KB) + L1i L#10 (32KB) + PU L#10 (P#10) L1d L#11 (32KB) + L1i L#11 (32KB) + PU L#11 (P#11) Look if it makes sense to you, if not I can do more testing and go a bit deeper if needed. Hi Doug, Since the behavior you experienced is ok for you and given that I see all correctly and I provided further info, this bug will be now marked as fixed and Infogiven. Please, don't hesitate to mark it as OPEN again if you still have questions, or just open a new one. Best regards, Felip Please do not close this ticket until we address the array question that was introduced in 7033. That ticket was closed as duplicate of 7029. Summary: In order to get HT available for ST array jobs we had to set CPUS=<HT thread count> in the slurm.conf. However, in this config the MPI users could not request one task per physical core. We went to the recommended slurm,conf config and the sbatch parameter for one task per core and MPI works. However, we are back to see the HT logical cores as unavailable to ST array jobs. A sample job was shared. Either reopen 7033 or reopen this case. I appreciate the queries are similar but we have not addressed the inability to utilize HT logical threads. Hi Doug, Felip won't be available for some days, so I will assist you instead. I've gone through this bug and 7033 and I'm not totally sure if I understand the problem. > In order to get HT available for ST array jobs we had to set CPUS=<HT thread > count> in the slurm.conf. Ok. I've setup equivalent to yours: - HT enabled - CPUs set to match Threads (not Cores) - Using CR_CPU_MEMORY - NOT using CR_ONE_TASK_PER_CORE - Using cgroups con constraint cores $ slurmd -C NodeName=agildell CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=15911 slurm.conf: NodeName=DEFAULT Sockets=1 CPUS=4 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=4096 scontrol show config|grep "TaskPlugin\|Select\|ConstrainCores" SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY TaskPlugin = task/affinity,task/cgroup TaskPluginParam = (null type) ConstrainCores = yes My test cluster is only 4 nodes of 2 sockets * 2 threads per socket. > However, in this config the MPI users could not > request one task per physical core. As Felip suggested in comment 4 I'm able to do it just by using the --ntasks-per-core=1 (to sbatch for job arrays): $ sbatch --array=0-7 --ntasks-per-core=1 --wrap "srun bash -c 'printenv SLURMD_NODENAME; taskset -cp \$\$'" Submitted batch job 520 $ tail slurm-520_* ==> slurm-520_0.out <== c1 pid 25834's current affinity list: 0,2 ==> slurm-520_1.out <== c1 pid 25857's current affinity list: 1,3 ==> slurm-520_2.out <== c2 pid 25917's current affinity list: 0,2 ==> slurm-520_3.out <== c2 pid 25955's current affinity list: 1,3 ==> slurm-520_4.out <== c3 pid 25936's current affinity list: 0,2 ==> slurm-520_5.out <== c3 pid 25942's current affinity list: 1,3 ==> slurm-520_6.out <== c4 pid 25962's current affinity list: 0,2 ==> slurm-520_7.out <== c4 pid 25975's current affinity list: 1,3 Are you getting different results? Although you were not using job arrays, from your comment 8 and comment 10 I guess that this is also working for you? I don't see any problem here, please let me know if I'm missing something important here. > We went to the recommended slurm,conf > config and the sbatch parameter for one task per core and MPI works. > However, we are back to see the HT logical cores as unavailable to ST array > jobs. Here I think that I understand your problem. Following the recommendations in comment 22 I do get: $ sbatch --array=0-15 --ntasks-per-core=1 --cpus-per-task=1 --wrap "srun --cpu-bind=thread bash -c 'printenv SLURMD_NODENAME; taskset -cp \$\$; sleep 60'" Submitted batch job 683 $ tail slurm-683_* ==> slurm-683_0.out <== c1 pid 10174's current affinity list: 0 ==> slurm-683_10.out <== c2 pid 11242's current affinity list: 0 ==> slurm-683_11.out <== c2 pid 11171's current affinity list: 1 ==> slurm-683_12.out <== c3 pid 11232's current affinity list: 0 ==> slurm-683_13.out <== c3 pid 11245's current affinity list: 1 ==> slurm-683_14.out <== c1 pid 11372's current affinity list: 1 ==> slurm-683_15.out <== c4 pid 11373's current affinity list: 1 ==> slurm-683_1.out <== c1 pid 10336's current affinity list: 1 ==> slurm-683_2.out <== c2 pid 10202's current affinity list: 0 ==> slurm-683_3.out <== c2 pid 10262's current affinity list: 1 ==> slurm-683_4.out <== c3 pid 10299's current affinity list: 0 ==> slurm-683_5.out <== c3 pid 10248's current affinity list: 1 ==> slurm-683_6.out <== c4 pid 10305's current affinity list: 0 ==> slurm-683_7.out <== c4 pid 10373's current affinity list: 1 ==> slurm-683_8.out <== c1 pid 11216's current affinity list: 0 ==> slurm-683_9.out <== c4 pid 11186's current affinity list: 0 Meaning that half of the 16 available threads are not used: $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 683_[8-15] batch wrap agil PD 0:00 1 (Resources) 683_0 batch wrap agil R 0:10 1 c1 683_1 batch wrap agil R 0:10 1 c1 683_2 batch wrap agil R 0:10 1 c2 683_3 batch wrap agil R 0:10 1 c2 683_4 batch wrap agil R 0:10 1 c3 683_5 batch wrap agil R 0:10 1 c3 683_6 batch wrap agil R 0:10 1 c4 683_7 batch wrap agil R 0:10 1 c4 Is that your problem? Or I misunderstood something important? Regards, Albert Hi, Please see the sample array submission in the attachments and let me know if this works for you. Two identical sets of servers NodeName=ta1l-h[1001-1044] CPUs=56 ThreadsPerCore=2 RealMemory=256000 NodeName=ta1l-h[1045-1088] CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 The array job works perfectly on 1001 -1044 but fails to use the HT logical threads on 1045-1088. I believe you have captured the situation perfectly. I am certain we have a misstep, I just cannot figure out where. Thank you Hi Doug, Please checkout these couple of links about CR_CPU and CPUs configuration with and without Cores and ThreadsPerCore: https://slurm.schedmd.com/slurm.conf.html#OPT_CR_CPU https://slurm.schedmd.com/faq.html#cpu_count There you can read that, although you use CPUs as Threads, if you also specify Cores and ThreadsPerCore, by default Slurm won't share cores between jobs (and jo arrays are N jobs in that sense). Slurm does share cores between different tasks of the same jobs as we have seen before, but not for jobs arrays, that I think that it was your concern in bug 7033 and lastly here. If you only specify CPUs and not Cores and ThreadsPerCore, then you share cores, but you are not able to work at core level (that you also need, right?). Anyway, as usual in Slurm, although by default it may not behave as you want, we have someway to set it up to obtain quite what we want. In our case, I think that the Partition OverSubscribe parameter is the way to go. Please check it out at https://slurm.schedmd.com/cons_res_share.html The following example should illustrate how to obtain what you are looking for. My entire cluster contains 16 therads on 8 cores, and I've oversubscribe enabled (with YES:2, but maybe FORCE:2 is also fine for you?): $ sinfo --Format partition,nodelist,sockets,cores,threads,cpus,oversubscribe PARTITION NODELIST SOCKETS CORES THREADS CPUS OVERSUBSCRIBE debug c[1-4] 1 2 2 4 YES:2 batch* c[1-4] 1 2 2 4 YES:2 With this setup, and because we've YES instead of FORCE, I still need to use --oversubscribe in the clients to actually tell slurm to share cores between the jobs of my job array,: $ sbatch --oversubscribe --array=0-15 --wrap "srun bash -c 'printenv SLURMD_NODENAME; taskset -cp \$\$; sleep 60'" Submitted batch job 825 Now, unlike in comment 26, all 16 threads of my cluster are used by the array (no job is PD). $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 825_0 batch wrap agil R 0:02 1 c1 825_1 batch wrap agil R 0:02 1 c1 825_2 batch wrap agil R 0:02 1 c2 825_3 batch wrap agil R 0:02 1 c2 825_4 batch wrap agil R 0:02 1 c3 825_5 batch wrap agil R 0:02 1 c3 825_6 batch wrap agil R 0:02 1 c4 825_7 batch wrap agil R 0:02 1 c4 825_8 batch wrap agil R 0:02 1 c1 825_9 batch wrap agil R 0:02 1 c1 825_10 batch wrap agil R 0:02 1 c2 825_11 batch wrap agil R 0:02 1 c2 825_12 batch wrap agil R 0:02 1 c3 825_13 batch wrap agil R 0:02 1 c3 825_14 batch wrap agil R 0:02 1 c4 825_15 batch wrap agil R 0:02 1 c4 # cat slurm-825* c1 pid 8978's current affinity list: 0,2 c2 pid 9340's current affinity list: 0,2 c2 pid 9396's current affinity list: 1,3 c3 pid 9375's current affinity list: 0,2 c3 pid 9415's current affinity list: 1,3 c4 pid 9399's current affinity list: 0,2 c4 pid 9424's current affinity list: 1,3 c1 pid 9064's current affinity list: 1,3 c2 pid 8992's current affinity list: 0,2 c2 pid 9091's current affinity list: 1,3 c3 pid 9085's current affinity list: 0,2 c3 pid 9163's current affinity list: 1,3 c4 pid 9169's current affinity list: 0,2 c4 pid 9260's current affinity list: 1,3 c1 pid 9224's current affinity list: 0,2 c1 pid 9282's current affinity list: 1,3 Is that what you are looking for? Hope that helps, Albert Thank you. Will tread and test what you have shared. Great detail in the response! Hi Doug, > Thank you. You are welcome! > Will tread and test what you have shared. Have you tried it? How it went? Can we close the ticket as infogiven? > Great detail in the response! Hope that helped! Albert Hi Doug, I'm closing it as infogiven. Please reopen if you have further questions. Regards, Albert That is completely fair. It will be two weeks I am afraid before I can get a test done. But your solution looks great. Thank you for your support. |