Ticket 16495

Summary: Too many cores allocated on a node
Product: Slurm Reporter: BASC Admins <basc>
Component: ConfigurationAssignee: Oscar Hernández <oscar.hernandez>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: oscar.hernandez
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: ECODEV Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: output of scontrol show job <jobid>

Description BASC Admins 2023-04-10 20:25:22 MDT
Dear SchedMD,

I think we might have a misconfiguration.

root@mgt4:~# sinfo -V
slurm 21.08.8-2

root@mgt4:~# scontrol show node comp105
NodeName=comp105 Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=48 CPUTot=48 CPULoad=16.95
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:v100:2
   NodeAddr=comp105 NodeHostName=comp105 Version=21.08.8-2
   OS=Linux 3.10.0-1160.88.1.el7.x86_64 #1 SMP Tue Mar 7 15:41:52 UTC 2023 
   RealMemory=757760 AllocMem=584704 FreeMem=396771 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=8 Owner=N/A MCS_label=N/A
   Partitions=asreml,gpu,debug 
   BootTime=2023-03-16T15:53:33 SlurmdStartTime=2023-03-16T15:56:55
   LastBusyTime=2023-04-10T18:59:53
   CfgTRES=cpu=48,mem=740G,billing=48,gres/gpu=2
   AllocTRES=cpu=48,mem=571G,gres/gpu=2
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

root@mgt4:~# squeue -w comp105 -o %i,%c
JOBID,MIN_CPUS
13819314,16
13819313,16
13819312,16
13819975,48

As per the above, 96 cores are allocated on comp105, yet the node only has 48 cores.

The node has the following config lines:

gres.conf
NodeName=comp105 Name=gpu Type=v100 File=/dev/nvidia[0-1]

nodes.conf
NodeName=comp105 Sockets=2 RealMemory=757760 Weight=8 CoresPerSocket=24 State=UNKNOWN Gres=gpu:v100:2

What might be wrong? (Note we're planning to upgrade to 23.02.1 soon.)

Thank you,
Ben
Comment 1 Oscar Hernández 2023-04-11 04:51:17 MDT
Hi Ben,

If jobs are still running, could you share the output of each job:

$ scontrol show job $JOBID

This situation could happen when having partition configured with Oversubscribe=YES[1](or force) in slurm.conf. Could this be your case? Could you share your partition configuration?

Do you have any kind of job preemption configured[2] in slurm.conf?

Kind regards,
Oscar

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_OverSubscribe
..
[2]https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptMode
Comment 2 BASC Admins 2023-04-11 18:54:38 MDT
Created attachment 29791 [details]
output of scontrol show job <jobid>
Comment 3 BASC Admins 2023-04-11 19:05:02 MDT
(In reply to Oscar Hernández from comment #1)
> If jobs are still running, could you share the output of each job:
> 
> $ scontrol show job $JOBID

Hi Oscar, please find the output of the above in the attached file 'jobs_comp105.txt'.

> This situation could happen when having partition configured with
> Oversubscribe=YES[1](or force) in slurm.conf. Could this be your case? Could
> you share your partition configuration?

We don't seem to have the Oversubscribe parameter set.

> Do you have any kind of job preemption configured[2] in slurm.conf?

It appears we do - but perhaps we shouldn't.

# scontrol show config | grep -i preemp
PreemptMode             = GANG,SUSPEND
PreemptType             = preempt/partition_prio
PreemptExemptTime       = 00:00:00

Cheers,
Ben
Comment 4 Oscar Hernández 2023-04-12 05:05:36 MDT
Hi Ben,

I have been able to reproduce the behavior you are observing with similar preemption settings as yours.

It does happen consistently when having configured:

PreemptMode             = GANG,SUSPEND
PreemptType             = preempt/partition_prio

Then, when a node is shared between partitions (the case of comp105), it does allocate all resources independently to both partitions. In your case:

Jobs 13819312, 13819313, 13819314 were submitted to partition "asreml". And 13819977 was submitted to partition "gpu". 

Still need to evaluate if it is a limitation, or a fix is possible. As in "slurm.conf" it is stated the following:

>NOTE: Gang scheduling is performed independently for each partition, so if you
>only want time-slicing by OverSubscribe, without any preemption, then
>configuring partitions with overlapping nodes is not recommended.  

But in any case, given that you are currently in 21.08. Lets try to adjust settings to prevent this from happening. Some questions first to understand your needs:

Why do you have gang scheduling enabled? Are you interested in time-slicing or in partition preemption?

- Time slicing allows for a node to execute simultaneously many jobs, running/suspending each other every configured time slice. So, time to run both jobs will be the same as running them sequentially, but they do progress in a similar pace.

- Preemption is used to directly suspend a job while a higher prio one is executed. Once the higher prio one ends, the suspended job is resumed.

Both settings are really specific and would't recommend to enable them unless you already have a specific use case. Specially because suspending some codes can make them crash.

The "preempt/partition_prio" setting seems to show you were interested in preemption, however, I would also expect some "oversusbscription" setting configured, as well as "prioritytier" values assigned to partitions.

Could you share your current partition configurations?

Knowing that details, will help to get into a configuration in which the reported resource oversubscription is not happening.

Kind regards,
Oscar
Comment 5 BASC Admins 2023-04-12 19:50:03 MDT
Hi Oscar,

Thank you for your extremely helpful response!

Sadly, the settings we had set for PreemptMode and PreemptType were set by someone configuring the cluster when it was first installed. I don't know why those values were chosen, and they haven't been revised since. It looks like the current settings enable gang scheduling for jobs in a partition with a higher PriorityJobFactor. Currently we have two such partitions, asreml & shortrun.

$ cat partitions.conf 
# Note that DefMemPerCPU is also set in slurm.conf
# default partition
PartitionName=DEFAULT	Nodes=comp[001-110,501,503-523] DefMemPerCPU=8192 DefaultTime=24:00:00 MaxTime=365-0 DisableRootJobs=yes
PartitionName=batch	Nodes=comp[001-104] Default=YES State=DOWN PriorityJobFactor=100
PartitionName=asreml	Nodes=comp[001-110] State=DOWN PriorityJobFactor=150
PartitionName=bulls1k	Nodes=comp[001-104,107-110,501,503-523] State=DOWN PriorityJobFactor=100
PartitionName=gydle	Nodes=comp[001-062] State=DOWN PriorityJobFactor=100
PartitionName=shortrun	Nodes=comp[001-104] State=DOWN DefaultTime=2:00:00 MaxTime=6:00:00 PriorityJobFactor=500
PartitionName=gpu Nodes=comp105 State=DOWN PriorityJobFactor=100
PartitionName=shortgpu Nodes=comp106 State=DOWN DefaultTime=2:00:00 MaxTime=5-0 PriorityJobFactor=500
PartitionName=epyc	Nodes=comp[107-110] State=DOWN PriorityJobFactor=100
PartitionName=haswell	Nodes=comp[501,503-523] State=DOWN PriorityJobFactor=100
PartitionName=debug	Nodes=comp[001-110,501,503-523] State=UP AllowGroups=admin_g PriorityJobFactor=10000

Reading the man page for slurm.conf, it looks like we should switch to 

PreemptMode = OFF
PreemptType = preempt/none

We'll also edit the asreml partition as it shouldn't contain comp105 or comp106 (this was a mistake made in a recent change).

Thanks again for all your help,
Ben
Comment 6 Oscar Hernández 2023-04-13 04:00:24 MDT
Hi Ben,
 
> Thank you for your extremely helpful response!
You are welcome!

>It looks like the current settings enable gang scheduling for jobs in a 
>partition with a higher PriorityJobFactor. Currently we have two such 
>partitions, asreml & shortrun.
Well, I am not sure if this has ever worked as intended. The relevant priority parameter for job preemption is "prioritytier"[1]. PriorityJobFactor is only used for job priorization, but not having effect on preemption. 

> Reading the man page for slurm.conf, it looks like we should switch to 
> 
> PreemptMode = OFF
> PreemptType = preempt/none
Since you do not seem to have any partition configured with prioritytier nor oversubscription, I do not think you were currently benefiting from any preemption configuration. So yes, I do agree on disabling it, that should make your config more coherent.

> We'll also edit the asreml partition as it shouldn't contain comp105 or
> comp106 (this was a mistake made in a recent change).
Great then. I am pretty sure that with this fixes you should not experience the over-allocation again.

I will leave it open until you are able to test the suggested changes. We can close it once you confirm things do work as intended.

Cheers,
Oscar

[1]https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityTier
Comment 7 BASC Admins 2023-04-16 19:06:33 MDT
Hi Oscar,

We are no longer experiencing this problem. Thank you for all your help!

Cheers,
Ben
Comment 8 Oscar Hernández 2023-04-17 02:32:53 MDT
Great! 
Closing ticket then :)