| Summary: | Too many cores allocated on a node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | BASC Admins <basc> |
| Component: | Configuration | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | oscar.hernandez |
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | ECODEV | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | output of scontrol show job <jobid> | ||
|
Description
BASC Admins
2023-04-10 20:25:22 MDT
Hi Ben, If jobs are still running, could you share the output of each job: $ scontrol show job $JOBID This situation could happen when having partition configured with Oversubscribe=YES[1](or force) in slurm.conf. Could this be your case? Could you share your partition configuration? Do you have any kind of job preemption configured[2] in slurm.conf? Kind regards, Oscar [1]https://slurm.schedmd.com/slurm.conf.html#OPT_OverSubscribe .. [2]https://slurm.schedmd.com/slurm.conf.html#OPT_PreemptMode Created attachment 29791 [details]
output of scontrol show job <jobid>
(In reply to Oscar Hernández from comment #1) > If jobs are still running, could you share the output of each job: > > $ scontrol show job $JOBID Hi Oscar, please find the output of the above in the attached file 'jobs_comp105.txt'. > This situation could happen when having partition configured with > Oversubscribe=YES[1](or force) in slurm.conf. Could this be your case? Could > you share your partition configuration? We don't seem to have the Oversubscribe parameter set. > Do you have any kind of job preemption configured[2] in slurm.conf? It appears we do - but perhaps we shouldn't. # scontrol show config | grep -i preemp PreemptMode = GANG,SUSPEND PreemptType = preempt/partition_prio PreemptExemptTime = 00:00:00 Cheers, Ben Hi Ben,
I have been able to reproduce the behavior you are observing with similar preemption settings as yours.
It does happen consistently when having configured:
PreemptMode = GANG,SUSPEND
PreemptType = preempt/partition_prio
Then, when a node is shared between partitions (the case of comp105), it does allocate all resources independently to both partitions. In your case:
Jobs 13819312, 13819313, 13819314 were submitted to partition "asreml". And 13819977 was submitted to partition "gpu".
Still need to evaluate if it is a limitation, or a fix is possible. As in "slurm.conf" it is stated the following:
>NOTE: Gang scheduling is performed independently for each partition, so if you
>only want time-slicing by OverSubscribe, without any preemption, then
>configuring partitions with overlapping nodes is not recommended.
But in any case, given that you are currently in 21.08. Lets try to adjust settings to prevent this from happening. Some questions first to understand your needs:
Why do you have gang scheduling enabled? Are you interested in time-slicing or in partition preemption?
- Time slicing allows for a node to execute simultaneously many jobs, running/suspending each other every configured time slice. So, time to run both jobs will be the same as running them sequentially, but they do progress in a similar pace.
- Preemption is used to directly suspend a job while a higher prio one is executed. Once the higher prio one ends, the suspended job is resumed.
Both settings are really specific and would't recommend to enable them unless you already have a specific use case. Specially because suspending some codes can make them crash.
The "preempt/partition_prio" setting seems to show you were interested in preemption, however, I would also expect some "oversusbscription" setting configured, as well as "prioritytier" values assigned to partitions.
Could you share your current partition configurations?
Knowing that details, will help to get into a configuration in which the reported resource oversubscription is not happening.
Kind regards,
Oscar
Hi Oscar, Thank you for your extremely helpful response! Sadly, the settings we had set for PreemptMode and PreemptType were set by someone configuring the cluster when it was first installed. I don't know why those values were chosen, and they haven't been revised since. It looks like the current settings enable gang scheduling for jobs in a partition with a higher PriorityJobFactor. Currently we have two such partitions, asreml & shortrun. $ cat partitions.conf # Note that DefMemPerCPU is also set in slurm.conf # default partition PartitionName=DEFAULT Nodes=comp[001-110,501,503-523] DefMemPerCPU=8192 DefaultTime=24:00:00 MaxTime=365-0 DisableRootJobs=yes PartitionName=batch Nodes=comp[001-104] Default=YES State=DOWN PriorityJobFactor=100 PartitionName=asreml Nodes=comp[001-110] State=DOWN PriorityJobFactor=150 PartitionName=bulls1k Nodes=comp[001-104,107-110,501,503-523] State=DOWN PriorityJobFactor=100 PartitionName=gydle Nodes=comp[001-062] State=DOWN PriorityJobFactor=100 PartitionName=shortrun Nodes=comp[001-104] State=DOWN DefaultTime=2:00:00 MaxTime=6:00:00 PriorityJobFactor=500 PartitionName=gpu Nodes=comp105 State=DOWN PriorityJobFactor=100 PartitionName=shortgpu Nodes=comp106 State=DOWN DefaultTime=2:00:00 MaxTime=5-0 PriorityJobFactor=500 PartitionName=epyc Nodes=comp[107-110] State=DOWN PriorityJobFactor=100 PartitionName=haswell Nodes=comp[501,503-523] State=DOWN PriorityJobFactor=100 PartitionName=debug Nodes=comp[001-110,501,503-523] State=UP AllowGroups=admin_g PriorityJobFactor=10000 Reading the man page for slurm.conf, it looks like we should switch to PreemptMode = OFF PreemptType = preempt/none We'll also edit the asreml partition as it shouldn't contain comp105 or comp106 (this was a mistake made in a recent change). Thanks again for all your help, Ben Hi Ben, > Thank you for your extremely helpful response! You are welcome! >It looks like the current settings enable gang scheduling for jobs in a >partition with a higher PriorityJobFactor. Currently we have two such >partitions, asreml & shortrun. Well, I am not sure if this has ever worked as intended. The relevant priority parameter for job preemption is "prioritytier"[1]. PriorityJobFactor is only used for job priorization, but not having effect on preemption. > Reading the man page for slurm.conf, it looks like we should switch to > > PreemptMode = OFF > PreemptType = preempt/none Since you do not seem to have any partition configured with prioritytier nor oversubscription, I do not think you were currently benefiting from any preemption configuration. So yes, I do agree on disabling it, that should make your config more coherent. > We'll also edit the asreml partition as it shouldn't contain comp105 or > comp106 (this was a mistake made in a recent change). Great then. I am pretty sure that with this fixes you should not experience the over-allocation again. I will leave it open until you are able to test the suggested changes. We can close it once you confirm things do work as intended. Cheers, Oscar [1]https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityTier Hi Oscar, We are no longer experiencing this problem. Thank you for all your help! Cheers, Ben Great! Closing ticket then :) |