Ticket 15120 - pack_serial_at_end doesn't seem to work for cpu-only jobs submitted to gpu nodes
Summary: pack_serial_at_end doesn't seem to work for cpu-only jobs submitted to gpu nodes
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 22.05.2
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-10-07 09:03 MDT by Renata Dart
Modified: 2022-10-31 16:17 MDT (History)
1 user (show)

See Also:
Site: SLAC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (10.55 KB, text/plain)
2022-10-07 09:03 MDT, Renata Dart
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Renata Dart 2022-10-07 09:03:14 MDT
Created attachment 27173 [details]
slurm.conf

Hi SchedMD, we have a cluster that is a mix of cpu-only nodes and gpu nodes.
We run with pack_serial_at_end and it seems to work fine for the cpu-only nodes, but the jobs spread out for cpu-only jobs submitted to the gpu nodes.   I tested this by submitting 8 jobs in quick succession to a gpu partition that only sleep and print the date:

[renata@sdf-login03 mpi]$ squeue -u renata
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      5875326  neutrino run.n.sh   renata PD       0:00      1 (Resources)
      5875319  neutrino run.n.sh   renata  R       0:11      1 tur026
      5875320  neutrino run.n.sh   renata  R       0:11      1 tur026
      5875321  neutrino run.n.sh   renata  R       0:11      1 tur026
      5875322  neutrino run.n.sh   renata  R       0:11      1 tur025
      5875323  neutrino run.n.sh   renata  R       0:11      1 tur022
      5875324  neutrino run.n.sh   renata  R       0:11      1 tur022
      5875325  neutrino run.n.sh   renata  R       0:11      1 tur022

[renata@sdf-login03 mpi]$ squeue -w tur022 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %c %m"
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) MIN_CPUS MIN_MEMORY
      5875323  neutrino run.n.sh   renata  R       0:32      1 tur022 1 4000M
      5875324  neutrino run.n.sh   renata  R       0:32      1 tur022 1 4000M
      5875325  neutrino run.n.sh   renata  R       0:32      1 tur022 1 4000M
      5874771  neutrino reviewED zhulcher  R    1:29:04      1 tur022 1 20G
      5874461  neutrino reviewED zhulcher  R    2:25:24      1 tur022 1 20G
      5874291  neutrino reviewED zhulcher  R    2:54:34      1 tur022 1 20G
      5873579  neutrino reviewED zhulcher  R    5:06:26      1 tur022 1 20G

[renata@sdf-login03 mpi]$ squeue -w tur025 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %c %m"
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) MIN_CPUS MIN_MEMORY
      5861593    cryoem sys/dash rkretsch  R 1-05:43:57      1 tur025 44 125G
      5875322  neutrino run.n.sh   renata  R       0:40      1 tur025 1 4000M
      5873866  neutrino reviewED zhulcher  R    4:11:10      1 tur025 1 20G

[renata@sdf-login03 mpi]$ squeue -w tur026 -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %c %m"
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) MIN_CPUS MIN_MEMORY
      5875319  neutrino run.n.sh   renata  R       0:48      1 tur026 1 4000M
      5875320  neutrino run.n.sh   renata  R       0:48      1 tur026 1 4000M
      5875321  neutrino run.n.sh   renata  R       0:48      1 tur026 1 4000M
      5874912  neutrino reviewED zhulcher  R    1:04:11      1 tur026 1 20G
      5874737  neutrino reviewED zhulcher  R    1:37:23      1 tur026 1 20G
      5874681  neutrino reviewED zhulcher  R    1:45:25      1 tur026 1 20G
      5875256  neutrino reviewED zhulcher  R       9:51      1 tur026 1 20G

The tur nodes look like this:

NodeName=tur[000-026]   CPUs=48 RealMemory=191552 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=gpu:geforce_rtx_2080_ti:10 Features=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5       Weight=56117   State=UNKNOWN

I have included our slurm.conf.

Thanks,
Renata
Comment 1 Scott Hilton 2022-10-07 15:36:46 MDT
Renata,

Can I get your gres.conf for the node in question as well?

-Scott
Comment 2 Renata Dart 2022-10-07 16:50:04 MDT
Hi Scott, this is the gres.conf:

################################################################################
## slurm gres conf
################################################################################

#AutoDetect=nvml

###
# hep nodes
###
#NodeName=hep-gpu01           Name=gpu   Type=geforce_gtx_1080_ti   Count=8   File=/dev/nvidia[0,2-4,6-9]
#NodeName=hep-gpu01           Name=gpu   Type=titan_xp              Count=2   File=/dev/nvidia[1,5]

###
# pascal 1080ti nodes
###
NodeName=psc[000-009]        Name=gpu   Type=geforce_gtx_1080_ti   Count=10  File=/dev/nvidia[0-9]

###
# turing 2080 nodes
###
NodeName=tur[000-026]         Name=gpu   Type=geforce_rtx_2080_ti   Count=10  File=/dev/nvidia[0-9]

###
# volta v100 nodes
###
NodeName=volt[000-005]         Name=gpu   Type=v100                  Count=4  File=/dev/nvidia[0-3]

###
# ampere a100 nodes
###
NodeName=ampt[000-020]         Name=gpu  Type=a100                   Count=4 File=/dev/nvidia[0-3]


Reanta

On Fri, 7 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>Scott Hilton <scott@schedmd.com> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>                 CC|                            |scott@schedmd.com
>
>--- Comment #1 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>Can I get your gres.conf for the node in question as well?
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 3 Renata Dart 2022-10-11 08:27:11 MDT
Hi Scott, just wondering if there is any update on this issue?

Thanks,
Renata
Comment 4 Scott Hilton 2022-10-11 14:38:28 MDT
Renata,

Sorry for the delay, I still need more information to understand what is going on.

Please send the command and script you used to launch gpu jobs.

Also send the output of this command:
>sacct -po jobid,nodelist,reqtres,alloctres -j 5875319,5875320,5875321,5875322,5875323,5875324,5875325,5875326,5874912,5874737,5874681,5875256,5861593,5873866

-Scott
Comment 5 Renata Dart 2022-10-11 15:17:20 MDT
Hi Scott, here is the output of the command requested:

[renata@sdf-login03 bin]$ sacct -po jobid,nodelist,reqtres,alloctres -j 5875319,5875320,5875321,5875322,5875323,5875324,5875325,5875326,5874912,5874737,5874681,5875256,5861593,5873866
JobID|NodeList|ReqTRES|AllocTRES|
5861593|tur025|billing=44,cpu=44,gres/gpu:geforce_rtx_2080_ti=8,gres/gpu=8,mem=125G,node=1|billing=44,cpu=44,gres/gpu:geforce_rtx_2080_ti=8,gres/gpu=8,mem=125G,node=1|
5861593.batch|tur025||cpu=44,gres/gpu:geforce_rtx_2080_ti=8,gres/gpu=8,mem=125G,node=1|
5861593.extern|tur025||billing=44,cpu=44,gres/gpu:geforce_rtx_2080_ti=8,gres/gpu=8,mem=125G,node=1|
5873866|tur025|billing=1,cpu=1,mem=20G,node=1|billing=2,cpu=2,mem=40G,node=1|
5873866.batch|tur025||cpu=2,mem=40G,node=1|
5873866.extern|tur025||billing=2,cpu=2,mem=40G,node=1|
5874681|tur026|billing=1,cpu=1,mem=20G,node=1|billing=2,cpu=2,mem=40G,node=1|
5874681.batch|tur026||cpu=2,mem=40G,node=1|
5874681.extern|tur026||billing=2,cpu=2,mem=40G,node=1|
5874737|tur026|billing=1,cpu=1,mem=20G,node=1|billing=2,cpu=2,mem=40G,node=1|
5874737.batch|tur026||cpu=2,mem=40G,node=1|
5874737.extern|tur026||billing=2,cpu=2,mem=40G,node=1|
5874912|tur026|billing=1,cpu=1,mem=20G,node=1|billing=2,cpu=2,mem=40G,node=1|
5874912.batch|tur026||cpu=2,mem=40G,node=1|
5874912.extern|tur026||billing=2,cpu=2,mem=40G,node=1|
5875256|tur026|billing=1,cpu=1,mem=20G,node=1|billing=2,cpu=2,mem=40G,node=1|
5875256.batch|tur026||cpu=2,mem=40G,node=1|
5875256.extern|tur026||billing=2,cpu=2,mem=40G,node=1|
5875319|tur026|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875319.batch|tur026||cpu=2,mem=8000M,node=1|
5875319.extern|tur026||billing=2,cpu=2,mem=8000M,node=1|
5875320|tur026|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875320.batch|tur026||cpu=2,mem=8000M,node=1|
5875320.extern|tur026||billing=2,cpu=2,mem=8000M,node=1|
5875321|tur026|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875321.batch|tur026||cpu=2,mem=8000M,node=1|
5875321.extern|tur026||billing=2,cpu=2,mem=8000M,node=1|
5875322|tur025|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875322.batch|tur025||cpu=2,mem=8000M,node=1|
5875322.extern|tur025||billing=2,cpu=2,mem=8000M,node=1|
5875323|tur022|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875323.batch|tur022||cpu=2,mem=8000M,node=1|
5875323.extern|tur022||billing=2,cpu=2,mem=8000M,node=1|
5875324|tur022|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875324.batch|tur022||cpu=2,mem=8000M,node=1|
5875324.extern|tur022||billing=2,cpu=2,mem=8000M,node=1|
5875325|tur022|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875325.batch|tur022||cpu=2,mem=8000M,node=1|
5875325.extern|tur022||billing=2,cpu=2,mem=8000M,node=1|
5875326|tur026|billing=1,cpu=1,mem=4000M,node=1|billing=2,cpu=2,mem=8000M,node=1|
5875326.batch|tur026||cpu=2,mem=8000M,node=1|
5875326.extern|tur026||billing=2,cpu=2,mem=8000M,node=1|


I'll have to investigate what command/script may have been used.

Renata

On Tue, 11 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #4 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>Sorry for the delay, I still need more information to understand what is going
>on.
>
>Please send the command and script you used to launch gpu jobs.
>
>Also send the output of this command:
>>sacct -po jobid,nodelist,reqtres,alloctres -j 5875319,5875320,5875321,5875322,5875323,5875324,5875325,5875326,5874912,5874737,5874681,5875256,5861593,5873866
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 6 Renata Dart 2022-10-11 15:48:42 MDT
Hi Scott, actually I was submitting non-gpu jobs to the neutrino partition as a test because zhulcher was doing the same and it was noticed that his jobs were spreading rather than packing.  My jobs is simple:

#!/bin/sh

#SBATCH --partition=neutrino
#SBATCH --ntasks-per-node=1
#
sleep 60
date

Thanks,
Renata
Comment 7 Renata Dart 2022-10-11 15:56:16 MDT
Hi again, just to be clear, the neutrino partition only has gpu hosts, but they also have cpu available on them, so some users like zhulcher send cpu-only jobs there.

Renata
Comment 8 Scott Hilton 2022-10-11 16:13:00 MDT
Renata,

It looks like tur026 ran out of memory according to AllocTRES. (4 jobs using 40GB and 3 jobs using 8GB)

So the algorithm started scheduling on tur025 until it ran out of cpus. Because 5861593 was using 44/48 cpus there was only room for two single core jobs. 

I would presume tur024 and tur023 were also busy because the rest were scheduled on tur022.

This seems to be behaving properly to me. pack_serial_at_end means that slurm will start at the end of the node list and works its way down. This seems to be tur026 in this case.

Let me know if you have any questions or if I am misunderstanding the issue.

-Scott
Comment 9 Renata Dart 2022-10-12 09:32:54 MDT
Hi Scott, thanks for the pointer to check both reqtres and alloctres.
I didn't realize that hyperthreading appears to be turned on on the
gpu nodes.  It looks like packing is working just fine once I check
alloctres against what I thought I submitted.  I'll pass this back
to the admin who manages the gpu nodes.

Thanks,

Renata

On Tue, 11 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #8 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>It looks like tur026 ran out of memory according to AllocTRES. (4 jobs using
>40GB and 3 jobs using 8GB)
>
>So the algorithm started scheduling on tur025 until it ran out of cpus. Because
>5861593 was using 44/48 cpus there was only room for two single core jobs. 
>
>I would presume tur024 and tur023 were also busy because the rest were
>scheduled on tur022.
>
>This seems to be behaving properly to me. pack_serial_at_end means that slurm
>will start at the end of the node list and works its way down. This seems to be
>tur026 in this case.
>
>Let me know if you have any questions or if I am misunderstanding the issue.
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 10 Renata Dart 2022-10-12 09:39:31 MDT
Hi again Scott, what is the way to request one core and 20GB memory
in this situation?

Renata

On Tue, 11 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #8 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>It looks like tur026 ran out of memory according to AllocTRES. (4 jobs using
>40GB and 3 jobs using 8GB)
>
>So the algorithm started scheduling on tur025 until it ran out of cpus. Because
>5861593 was using 44/48 cpus there was only room for two single core jobs. 
>
>I would presume tur024 and tur023 were also busy because the rest were
>scheduled on tur022.
>
>This seems to be behaving properly to me. pack_serial_at_end means that slurm
>will start at the end of the node list and works its way down. This seems to be
>tur026 in this case.
>
>Let me know if you have any questions or if I am misunderstanding the issue.
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 11 Renata Dart 2022-10-12 10:31:09 MDT
Hi Scott, in case it is needed:  

[renata@sdf-login03 mpi]$ sinfo -p neutrino -o %z
S:C:T
2:8+:2


Renata

On Tue, 11 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #8 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>It looks like tur026 ran out of memory according to AllocTRES. (4 jobs using
>40GB and 3 jobs using 8GB)
>
>So the algorithm started scheduling on tur025 until it ran out of cpus. Because
>5861593 was using 44/48 cpus there was only room for two single core jobs. 
>
>I would presume tur024 and tur023 were also busy because the rest were
>scheduled on tur022.
>
>This seems to be behaving properly to me. pack_serial_at_end means that slurm
>will start at the end of the node list and works its way down. This seems to be
>tur026 in this case.
>
>Let me know if you have any questions or if I am misunderstanding the issue.
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 12 Scott Hilton 2022-10-12 10:56:31 MDT
Renata,

This should do it:
>srun --mem=20G hostname
Because you use CR_Core_Memory you will always get allocations of whole cores. This would only give you 1 core as well because each core has 2 threads or "cpus":
>srun --mem=20G -n2 hostname
If you want 1 core per task you could use -c2:
>srun --mem=20G -n2 -c2 hostname
If you want 20G per core you could do this:
>srun --mem-per-cpu=10G hostname
-Scott
Comment 13 Renata Dart 2022-10-12 14:24:15 MDT
Hi Scott, it looks like I cannot get just 1 cpu allocated:

[renata@sdf-login03 mpi]$ cat testcpu.sh
#!/bin/sh

#SBATCH --partition=neutrino
#SBATCH --mem=4G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#
sleep 60
date
[renata@sdf-login03 mpi]$ 
[renata@sdf-login03 mpi]$ 
[renata@sdf-login03 mpi]$ 
[renata@sdf-login03 mpi]$ sbatch testcpu.sh
Submitted batch job 5949412
[renata@sdf-login03 mpi]$ 
[renata@sdf-login03 mpi]$ 
[renata@sdf-login03 mpi]$ sacct -Xpo jobid,user,nodelist,reqtres,alloctres -j 5949412
JobID|User|NodeList|ReqTRES|AllocTRES|
5949412|renata|tur026|billing=1,cpu=1,mem=4G,node=1|billing=2,cpu=2,mem=4G,node=1|

Since we have DefMemPerCpu=4000, it looks like specifying mem=4G does 
restrict the memory.  Without specifying mem=4G, it wants to allocate 8,
I guess because it cannot give me 1 core?

Renata


On Wed, 12 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #12 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>This should do it:
>>srun --mem=20G hostname
>Because you use CR_Core_Memory you will always get allocations of whole cores.
>This would only give you 1 core as well because each core has 2 threads or
>"cpus":
>>srun --mem=20G -n2 hostname
>If you want 1 core per task you could use -c2:
>>srun --mem=20G -n2 -c2 hostname
>If you want 20G per core you could do this:
>>srun --mem-per-cpu=10G hostname
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 14 Scott Hilton 2022-10-12 15:10:36 MDT
Renata,

You cannot allocate a single cpu because you use CR_Core_Memory. You will always get allocations of whole cores which means 2,4,6,etc. cpus. In this case, "cpu" means thread. Each core has 2 threads due to hyperthreading. 

Yes, since you get 2 cpus by default you will by default get 8000 MB when DefMemPerCpu=4000.

-Scott
Comment 15 Renata Dart 2022-10-12 15:59:31 MDT
Hi Scott, if we switched to

SelectTypeParameters=CR_ONE_TASK_PER_CORE

keeping 

SelectType=select/cons_tres
DefMemPerCpu=4000

would that then provide 1 cpu for jobs submitted to
the hyperthreaded hosts that request 1 cpu?  And if so
would gpu jobs see a change in their default cpu allocation, 
or jobs submitted to cpu-only hosts?

Thanks,
Renata

On Wed, 12 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #14 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>You cannot allocate a single cpu because you use CR_Core_Memory. You will
>always get allocations of whole cores which means 2,4,6,etc. cpus. In this
>case, "cpu" means thread. Each core has 2 threads due to hyperthreading. 
>
>Yes, since you get 2 cpus by default you will by default get 8000 MB when
>DefMemPerCpu=4000.
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 16 Scott Hilton 2022-10-12 16:47:23 MDT
Renata,

No, CR_ONE_TASK_PER_CORE means that each task will request a core. With hyperthreading that means that a job with 4 tasks will get 4 cores which is (8 cpus/threads).
Here is an example from my test cluster:
>$ srun -n4 hostname
>$ sacct --start=now-20minutes -o jobid,nodelist,reqtres%40,alloctres%40
>JobID               NodeList                                  ReqTRES                                AllocTRES 
>------------ --------------- ---------------------------------------- ---------------------------------------- 
>1957                   node0          billing=4,cpu=4,mem=400M,node=1          billing=8,cpu=8,mem=800M,node=1 
>1957.0                 node0                                                             cpu=8,mem=800M,node=1 
Also, the proper config for CR_ONE_TASK_PER_CORE is this:
>SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE

-Scott
Comment 17 Renata Dart 2022-10-12 21:06:35 MDT
Hi Scott, thanks for that clarification.  Is there any change I could
make to change the behavior so a user could be allocated just one core
on the hyperthreaded systems?  I guess I could change the entry for
the NodeName in slurm.conf from

ThreadsPerCore=2

to 

ThreadsPerCore=1

but is there any other way?  Just want to make sure I understand
all of our options.

Thanks,
Renata

On Wed, 12 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #16 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>No, CR_ONE_TASK_PER_CORE means that each task will request a core. With
>hyperthreading that means that a job with 4 tasks will get 4 cores which is (8
>cpus/threads).
>Here is an example from my test cluster:
>>$ srun -n4 hostname
>>$ sacct --start=now-20minutes -o jobid,nodelist,reqtres%40,alloctres%40
>>JobID               NodeList                                  ReqTRES                                AllocTRES 
>>------------ --------------- ---------------------------------------- ---------------------------------------- 
>>1957                   node0          billing=4,cpu=4,mem=400M,node=1          billing=8,cpu=8,mem=800M,node=1 
>>1957.0                 node0                                                             cpu=8,mem=800M,node=1 
>Also, the proper config for CR_ONE_TASK_PER_CORE is this:
>>SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 18 Renata Dart 2022-10-13 09:26:23 MDT
Hi Scott, I am a bit confused about this.  Here is the users sbatch script:

#SBATCH --job-name=reviewEDEP
#SBATCH --output=output/output-%j.txt
#SBATCH --error=error/error-%j.txt
#SBATCH --partition=neutrino
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=20g
#SBATCH --time=3:00:00
#SBATCH --exclude=ampt010

He has a few cpu jobs running now on different types of gpu hosts all of which are 
hyperthreaded.  This shows that hyperthreading is turned on for an ampt and a tur node:

renata@sdf-login03 test]$ ssh ampt020 cat /sys/devices/system/cpu/smt/active
1
[renata@sdf-login03 test]$ ssh tur024 cat /sys/devices/system/cpu/smt/active
1



These are the entries in slurm.conf for them:

NodeName=ampt[000-020]   CPUs=128 RealMemory=1029344 Sockets=2 CoresPerSocket=64  ThreadsPerCore=2 Gres=gpu:a100:4
NodeName=tur[000-026]   CPUs=48 RealMemory=191552 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 Gres=gpu:geforce_rtx_2080_ti:10 



But the job running on the ampt node registers an allocation of just
1 cpu and 20G of memory:

[renata@sdf-login03 test]$ sacct -Xpo jobid,user,nodelist,reqtres,alloctres -j 5956961
JobID|User|NodeList|ReqTRES|AllocTRES|
5956961|zhulcher|ampt020|billing=1,cpu=1,mem=20G,node=1|billing=1,cpu=1,mem=20G,node=1|

while the one on the tur node shows the double allocation:

[renata@sdf-login03 test]$ sacct -Xpo jobid,user,nodelist,reqtres,alloctres -j 5956762
JobID|User|NodeList|ReqTRES|AllocTRES|
5956762|zhulcher|tur024|billing=1,cpu=1,mem=20G,node=1|billing=2,cpu=2,mem=40G,node=1|

Renata
Comment 19 Scott Hilton 2022-10-13 15:04:50 MDT
(In reply to Renata Dart from comment #18)
> But the job running on the ampt node registers an allocation of just
> 1 cpu and 20G of memory:
>
>... 
>
>while the one on the tur node shows the double allocation:

The ampt nodes has a "CPU" count equal to the core count 2*64=128. (Sockets * CoresPerSocket = CPUS)
>NodeName=ampt[000-020]   CPUs=128 RealMemory=1029344 Sockets=2 CoresPerSocket=64  ThreadsPerCore=2 ...

The tur nodes have a "CPU" count equal to the thread count 2*12*2=48. (Sockets * CoresPerSocket * ThreadsPerCore = CPUS)
>NodeName=tur[000-026]   CPUs=48 RealMemory=191552 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 ,,,

Slurm supports both situations. In the first a "CPU" is a core in the second a "CPU is a thread. See the documentation:
https://slurm.schedmd.com/slurm.conf.html#OPT_CPUs

-Scott
Comment 20 Scott Hilton 2022-10-13 15:43:33 MDT
Renata,

What is your use case for this setup? 

Are you saying you want jobs to be allowed to be allocated just a 1 thread on a core instead of both? i.e. allocating half a core at a time.

Do your users want to run multithreaded jobs or non-multithreaded jobs choose with each job? 

Do you want the "CPU" count to match the core count for accounting purposes?

-Scott
Comment 21 Renata Dart 2022-10-13 15:58:17 MDT
Hi Scott, aha, I see.  Let's say that the admin who defined the gpu
host entries in slurm.conf made a mistake and really wanted the tur
nodes to be set up like the ampt nodes (I don't know if that is the
case but want to be prepared in case that is what happened).  Would
changing the tur nodes to be CoresPerSocket=24 have an impact on the
gpu job submissions?  That is, would the gpu users see any difference
in the way their jobs were allocated resources?  And in order to 
change the tur node entry in slurm.conf could I just make that change
and then scontrol reconfig?  Or would I need to restart all of the
slurmds?

Renata

  On Thu, 13 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #19 from Scott Hilton <scott@schedmd.com> ---
>(In reply to Renata Dart from comment #18)
>> But the job running on the ampt node registers an allocation of just
>> 1 cpu and 20G of memory:
>>
>>... 
>>
>>while the one on the tur node shows the double allocation:
>
>The ampt nodes has a "CPU" count equal to the core count 2*64=128. (Sockets *
>CoresPerSocket = CPUS)
>>NodeName=ampt[000-020]   CPUs=128 RealMemory=1029344 Sockets=2 CoresPerSocket=64  ThreadsPerCore=2 ...
>
>The tur nodes have a "CPU" count equal to the thread count 2*12*2=48. (Sockets
>* CoresPerSocket * ThreadsPerCore = CPUS)
>>NodeName=tur[000-026]   CPUs=48 RealMemory=191552 Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 ,,,
>
>Slurm supports both situations. In the first a "CPU" is a core in the second a
>"CPU is a thread. See the documentation:
>https://slurm.schedmd.com/slurm.conf.html#OPT_CPUs
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 22 Renata Dart 2022-10-13 16:14:52 MDT
Hi Scott, all fair questions to which I don't know the answers.  I'm
giving feedback to the gpu admin now and will see if he needs any
further help in understanding how to configure the entries.   If the
decision is to change the turs to be like the ampt hosts, will that
require more than a restart of slurmctld and an scontrol reconfig?

Renata

 On Thu, 13 Oct 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=15120
>
>--- Comment #20 from Scott Hilton <scott@schedmd.com> ---
>Renata,
>
>What is your use case for this setup? 
>
>Are you saying you want jobs to be allowed to be allocated just a 1 thread on a
>core instead of both? i.e. allocating half a core at a time.
>
>Do your users want to run multithreaded jobs or non-multithreaded jobs choose
>with each job? 
>
>Do you want the "CPU" count to match the core count for accounting purposes?
>
>-Scott
>
>-- 
>You are receiving this mail because:
>You reported the bug.
Comment 23 Scott Hilton 2022-10-13 16:57:26 MDT
(In reply to Renata Dart from comment #22)
> If the
> decision is to change the turs to be like the ampt hosts, will that
> require more than a restart of slurmctld and an scontrol reconfig?
I think that should work.
Comment 24 Scott Hilton 2022-10-31 16:17:12 MDT
Renata,

I am closing this ticket as info given. If you have questions about this specific issue feel free to reopen this ticket.

-Scott