Ticket 10391

Summary:	Packing nodes
Product:	Slurm	Reporter:	Tom Wurgler <twurgl>
Component:	Configuration	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	mcmullan, thu-ha.tran
Version:	20.02.4
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=12605
Site:	Goodyear	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	log after setting the variables

Description Tom Wurgler 2020-12-08 09:18:32 MST

When we started working with Slurm, all our tests showed that Slurm would "pack" a node.  In other words, If our nodes have 24 cores and I submitted a 32 core job, it would use all of one node and 8 cores on the second node.

Now either we changed something or Slurm changed.  If I submit the 32 core job now, it splits 2 nodes with 16 cores each.

What we'd like is the former method, and then if another job comes along to use the remaining cores on that second node,

Here is our slurm-common.conf (included from slurm.conf):
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.

# TmpFS file system (default /tmp)
TmpFS=/scratch

# Partition limits always enforced at submission time
EnforcePartLimits=ALL

# added by pqd
# Sets the amount of time backup will wait before switching
SlurmctldTimeout=30

# added by tcw
PluginDir=/usr/local/slurm/lib/slurm

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor
#PriorityFlags=FAIR_TREE # default since 19.05

# 2 week half-life
#PriorityDecayHalfLife=14-0  # DEFAULT: 7 days

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# Using examples from https://slurm.schedmd.com/priority_multifactor.html
PriorityWeightAge=1000         # x Job Age factor (0-1)
PriorityWeightFairshare=10000  # x FairshareTree factor
PriorityWeightJobSize=1000     # x Job size    
PriorityWeightPartition=1000   # x normalized partition prio (configured on partition)
PriorityWeightQOS=0            # don't use qos factor

AccountingStorageEnforce=limits,qos

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#

# Danny from SchedMD says set this to cgroup
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup

ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/apps/local/slurm/spool
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
TaskPluginParam=Cores
TaskProlog=/u/rds4020/t901353/set_slurm_display
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
#SchedulerType=sched/backfill  # needs users to specify runtime!!!!
SchedulerType=sched/builtin
SelectType=select/cons_tres # for gpu
SelectTypeParameters=CR_Core,CR_Pack_Nodes

# LICENSES
#Licenses=acfd_3:3,abaqus:390,standard:390
Licenses=catia2acis:1
# You can view the following with: sacctmgr show tres
# You can view the following with: sreport -T license/abaqus@rdsabalic 
#               sreport: cluster AccountUtilizationByUser
#NAH# AccountingStorageTRES=license/abaqus@gisabalic,license/abaqus@rdsabalic,license/acfd_3@ansyslic,license/anshpc_pack@ansyslic,license/catia2acis
# This is for tracking abaqus licenses per user within slurm itself (in theory :-)
AccountingStorageTRES=license/abaqus@rdsabalic

# trying to fix locked mem issues
PropagateResourceLimitsExcept=ALL

#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.%n.log

# This plugin keeps linlarge from spanning physical clusters
TopologyPlugin=topology/tree

# Generic Resources
GresTypes=gpu,fv

# Plugins
#JobSubmitPlugins=lua
# out for 8.2....CliFilterPlugins=lua

#Include /apps/local/slurm/conf/slurm_nodes_and_partitions


Thanks for any guidance on how to pack nodes effectively.
-tom

Comment 1 Tim McMullan 2020-12-09 09:50:20 MST

Hey Tom,

Can you provide an example of a job submission that doesn't behave as expected?  I'm interested in what options are getting passed in so I can properly replicate the situation!

I did notice "CR_Pack_Nodes" which should (and so far in my testing seems to work) compact tasks like that if the tasks being launched don't consume the whole allocation.

Thanks!
--Tim

Comment 2 Tom Wurgler 2020-12-09 13:42:24 MST

We run a FEA code (that I can’t share).
But I think if you use our slurm-common.conf with a slurm.conf that has the partitions etc in it and includes the common file,
Then submit a job requiring more than one node worth of cores, you should see the problem.
It just did the same kind of thing to another user here on another code.  He asked for 960 cores and we have 24 cores per node.
So ideally, he should have gotten 40 nodes.  He really got 41, so it used most of the cores and left some unused and just added another node…

The ideal situation would be that a job would use whole nodes if available, and only split a job up when it has to (ie, not pend).

We can arrange a call if I am not very clear.

Thanks
tom

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, December 9, 2020 11:50 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 10391] Packing nodes

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 1<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c1&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MmY3%2BJascwfJanoBAMv%2BSD2kotwrqiyc0CZYN9Xy3Jg%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cl1ZZH47idYnki5wtVBgkE6IVA2%2BqqywwDzt284qcqk%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hey Tom,



Can you provide an example of a job submission that doesn't behave as expected?

 I'm interested in what options are getting passed in so I can properly

replicate the situation!



I did notice "CR_Pack_Nodes" which should (and so far in my testing seems to

work) compact tasks like that if the tasks being launched don't consume the

whole allocation.



Thanks!

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 3 Tom Wurgler 2020-12-11 06:26:53 MST

We found our issue.  Last September we changed

SelectType=select/cons_res

to

SelectType=select/cons_tres

And cons_tres doesn’t do the packing like it was.  Put it back to cons_res and it works as it did.

Is this expected behavior?

Thanks!



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Wednesday, December 9, 2020 11:50 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 10391] Packing nodes

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 1<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c1&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MmY3%2BJascwfJanoBAMv%2BSD2kotwrqiyc0CZYN9Xy3Jg%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cl1ZZH47idYnki5wtVBgkE6IVA2%2BqqywwDzt284qcqk%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hey Tom,



Can you provide an example of a job submission that doesn't behave as expected?

 I'm interested in what options are getting passed in so I can properly

replicate the situation!



I did notice "CR_Pack_Nodes" which should (and so far in my testing seems to

work) compact tasks like that if the tasks being launched don't consume the

whole allocation.



Thanks!

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 4 Tom Wurgler 2020-12-14 09:06:02 MST

Hi

Ok, We really want the cons-tres for use with gpu's.

If we have tres, it splits jobs evenly across nodes
If we have res, it fills a node then uses part of another node if needed.
Then the next job fills that unused portion of the node.  This "res" behavior is what we want, but we need "tres" for the gpu.

What combination of settings can give us res packing with tres?

This is a pretty big thing for us, so I raised the priority level.

Thanks!
tom

We can have a teams meeting if we need to.
Thanks

Comment 5 Tim McMullan 2020-12-14 10:00:32 MST

Hey Tom,

Sorry about the delay but I was out of the office at the end of the week so didn't make much progress there.

I've been able to replicate and I'm currently examining the logic around this choice to see how exactly its making this decision, at which point I should have a better handle on it.

Thanks!
--Tim

Comment 7 Tim McMullan 2020-12-15 13:08:51 MST

Hey Tom,

Just wanted to update you on this - after looking through the code and chatting we've determined this is not the desired behavior.  Right now, it doesn't appear that there is a good way to get the old allocation method back without code changes.  I'm looking into some possible workarounds until a patch can be made.  I'll update you with more when I can!

Thanks!
--Tim

Comment 9 Tim McMullan 2020-12-22 08:04:45 MST

Hey Tom,

I've been looking at this a lot and unfortunately there doesn't seem to be a clear way out here.  I have a suggestion though that might get you similar behavior until a proper fix can be implemented.

Can you try using "plane" distribution for now?  EG: a job submission like "srun -m plane=12 --cpus-per-task=2 --ntasks=18" when allocating on empty 24 core nodes should result in 1 full node and 1 half-full node like it would in cons_res.  Its worth noting here that the tasks might not be organized the same way as they would have been in cons_res.  You can just specify something like "-m plane=24" so that regardless of the cpu/task count nodes should be filled before moving on to a different node.

Can you try this out for a few of your jobs and let me know if it is a functional workaround for you?

Thanks,
--Tim

Comment 11 Tim McMullan 2021-01-05 09:00:39 MST

Hey Tom,
 
I just wanted to check in and see if you had a chance to look at the workaround I suggested!

Thanks,
--Tim

Comment 12 Tom Wurgler 2021-01-05 09:02:26 MST

I'm sorry, but not yet. Been off for 3 weeks and just ramping up again...
Will this be fixed in a future version?
If so, any idea when that version might come out?


From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Tuesday, January 5, 2021 11:01 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 10391] Packing nodes

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 11<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c11&data=04%7C01%7Ctwurgl%40goodyear.com%7C74014cdabc034d0e4bda08d8b1930ccd%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637454592434941000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qU9X59o%2BvnecgyU6spC3PuWcEb63lWAwFS6YZqRKloE%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7C74014cdabc034d0e4bda08d8b1930ccd%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637454592434951003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tAppj5Edj3UQY9Zs3c5h4UoT2ona%2BhTFccbHES1GYNY%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com>

Hey Tom,



I just wanted to check in and see if you had a chance to look at the workaround

I suggested!



Thanks,

--Tim

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 13 Tim McMullan 2021-01-05 09:12:30 MST

No worries, I hope you had a good vacation!

We are looking at fixing this in a future release.  I can't confirm when a fix would come out at the moment, but I will try to get that figured out for you.

Thanks!
--Tim

Comment 15 Tom Wurgler 2021-02-03 11:45:04 MST

I am back at looking at this.  Sorry for the delay.

Using -m plane=24, I submitted 3 jobs.  Note our cluster has 2 sockets, 12 cores per socket.
Job 1 - 8 core job
Job 2 - 8 core job
Job 3 - 32 core job

and we get this:
 JOBID      USER   STATE     PARTITION   EXEC_HOST       SUBMIT_TIME     JOB_NAME    
256840   t901353  RUNNING     linlarge   8*rdsxen129     Feb 03 13:26    batch_script_7942.sh
256841   t901353  RUNNING     linlarge   8*rdsxen129     Feb 03 13:28    batch_script_8799.sh
256842   t901353  RUNNING     linlarge   7*rdsxen129:24*rdsxen130:1*rdsxen131
                                          (  32 cpus)    Feb 03 13:29    batch_script_9127.sh


Note that desired behavior is 8 on xen129, 8 more on xen129 and 8 more on xen129 before moving to xen130.

So no, -m plane isn't giving us what we want.  So should I have put a different number than 24 in the plane=24?

And not to mention we have other clusters with different numbers of cores per node.

Any more thoughts here?

Is a potential fix in the source code being planned?

So I am not sure what we need out of con_tres for using gpus.  Can we not take advantage of gpus if we dont' have cons_tres set?  Can any gpu advanatages be used via the command line or possibly on a partition of gpu nodes and those advantages applied to the partition or per job?

Just trying to find a way around this issue.
thanks
tom

Comment 17 Tim McMullan 2021-02-10 12:47:36 MST

(In reply to Tom Wurgler from comment #15)
> I am back at looking at this.  Sorry for the delay.

No worries, I'm glad you have been able to test the workaround!

> Using -m plane=24, I submitted 3 jobs.  Note our cluster has 2 sockets, 12
> cores per socket.
> Job 1 - 8 core job
> Job 2 - 8 core job
> Job 3 - 32 core job
> 
> and we get this:
>  JOBID      USER   STATE     PARTITION   EXEC_HOST       SUBMIT_TIME    
> JOB_NAME    
> 256840   t901353  RUNNING     linlarge   8*rdsxen129     Feb 03 13:26   
> batch_script_7942.sh
> 256841   t901353  RUNNING     linlarge   8*rdsxen129     Feb 03 13:28   
> batch_script_8799.sh
> 256842   t901353  RUNNING     linlarge   7*rdsxen129:24*rdsxen130:1*rdsxen131
>                                           (  32 cpus)    Feb 03 13:29   
> batch_script_9127.sh

That's a strange result.  I haven't been able to replicate that behavior locally, it seems to assign correctly for me. I'll try to see if there is something else going on.

> Note that desired behavior is 8 on xen129, 8 more on xen129 and 8 more on
> xen129 before moving to xen130.

Totally understand that :)

> So no, -m plane isn't giving us what we want.  So should I have put a
> different number than 24 in the plane=24?

> And not to mention we have other clusters with different numbers of cores
> per node.

You can put an arbitrarily large number in plane=$count and it will split up based on the actual job/node size so you don't have to worry about it as much.  Count could just be the largest number of CPUs you have in the cluster.

> Any more thoughts here?

If your above test is repeatable, it might be useful to have the slurmctld log with "SlurmctldDebug=debug2" and "DebugFlags=SelectType" in the slurm.conf and doing and scontrol reload. You can also enable those with:

scontrol setdebug debug2
scontrol setdebugflags +SelectType

We would at least need the logs from just before they were submitted to any time after they were scheduled.

> Is a potential fix in the source code being planned?

The ticket has been re-assigned to Dominik who is working in this code already.  He is looking at the possible fixes.

> So I am not sure what we need out of con_tres for using gpus.  Can we not
> take advantage of gpus if we dont' have cons_tres set?  Can any gpu
> advanatages be used via the command line or possibly on a partition of gpu
> nodes and those advantages applied to the partition or per job?

With cons_res, you can still schedule with GPUs, but its not as easy at job submit time.  You can do something like "--gres=gpu:1" for a job and it will schedule it, and attempt to be smart about numa awareness if you've given it that information.

--cpus-per-gpu, --gpus, --gpu-bind, --gpu-freq, --gpus-per-node, --gpus-per-socket, --gpus-per-task, --mem-per-gpu are all cons_tres specific.  I'm not sure what your jobs GPU requirements are and how much of an issue this would or would not be, but its an option.

Thanks!
--Tim

Comment 20 Tom Wurgler 2021-02-12 07:48:49 MST

I redid my tests with "-m plane 40"
Note I just made up this number.

And thus far it is doing the correct thing.

I am going to install the change to our standard script and watch how the packing goes.

We are continuing to test..
thanks 
tom

Comment 21 Tom Wurgler 2021-02-15 08:17:39 MST

Adding the -m plane to the standard scripts still did not pack efficiently.
thanks
tom

Comment 22 Dominik Bartkiewicz 2021-02-17 02:56:50 MST

Hi

Could you send us  logs requested in comment 17?

Dominik

Comment 23 Tom Wurgler 2021-02-22 07:23:43 MST

Created attachment 18048 [details]
log after setting the variables

Comment 24 Tom Wurgler 2021-02-22 07:25:18 MST

scontrol setdebug debug2
scontrol setdebugflags +SelectType

Not exactly reproducible

l:t901353@rds4020:slurm5 > sjobs
 JOBID      USER   STATE     PARTITION   EXEC_HOST       SUBMIT_TIME     JOB_NAME    
257920   t901353  RUNNING     linlarge   8*rdsxen65      Feb 22 08:56    sierra_batch_script_8363.sh
257921   t901353  RUNNING     linlarge   8*rdsxen65      Feb 22 08:56    sierra_batch_script_8743.sh
257922   t901353  RUNNING     linlarge   8*rdsxen65:24*rdsxen66
                                          (  32 cpus)    Feb 22 08:57    sierra_batch_script_8976.sh
257923   t901353  RUNNING     linlarge   24*rdsxen259:8*rdsxen260
                                          (  32 cpus)    Feb 22 08:58    sierra_batch_script_9371.sh
257924   t901353  RUNNING     linlarge   4*rdsxen324:4*rdsxen325:4*rdsxen326:4*rdsxen327:16*rdsxen328
                                          (  32 cpus)    Feb 22 08:58    sierra_batch_script_9612.sh
l:t901353@rds4020:slurm5 >

Still weird splitting
thanks

Comment 26 Dominik Bartkiewicz 2021-02-25 07:23:27 MST

Hi

Slurmctld spread out job 257924 over five nodes because
it fills up partially allocated nodes rdsxen[324-327] as first.
It looks strange, but this is correct behavior.

I also have some updates about the primary issue of this bug.
Unfortunately, the fix for it will be available only on 21.08.
It will require a significant change in cons_tres, which we can't do in 20.11.

Could we drop the severity of this bug to 3?

Dominik

Comment 27 Tom Wurgler 2021-03-01 09:11:56 MST

It may be correct behavior, but overall this is not desirable behavior.

This is still a big deal to us without having data to back it up.  We think having jobs spread out like this will hurt performance.

So I am going to run a series of tests to try to judge the effect of the current packing.

As far as lowering priority, does the work on fixing it slow down then?



From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, February 25, 2021 9:23 AM
To: Tom Wurgler <twurgl@goodyear.com>
Subject: [EXT] [Bug 10391] Packing nodes

 External Email....WARNING....Think before you click or respond....WARNING


Comment # 26<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c26&data=04%7C01%7Ctwurgl%40goodyear.com%7C32672a0a50684614481e08d8d998eb5b%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637498598094001328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SsMWT%2FgPke6bsZTlvhlnAtWATqebnmAaFGN8OEqUWng%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7C32672a0a50684614481e08d8d998eb5b%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637498598094011316%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9n0cjLPNF8SqWmf6NzlRhZQmSY%2BhRXn9XSQzzXZjKYM%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com>

Hi



Slurmctld spread out job 257924 over five nodes because

it fills up partially allocated nodes rdsxen[324-327] as first.

It looks strange, but this is correct behavior.



I also have some updates about the primary issue of this bug.

Unfortunately, the fix for it will be available only on 21.08.

It will require a significant change in cons_tres, which we can't do in 20.11.



Could we drop the severity of this bug to 3?



Dominik

________________________________
You are receiving this mail because:

  *   You reported the bug.

Comment 28 Dominik Bartkiewicz 2021-03-02 08:55:53 MST

Hi

Topology-aware resource allocation minimizes the count of switches and simply takes nodes in order of bitmap position.
Other flavors of allocation _eval_nodes_*() work differently.
But generally, by default Slurm tries to use partially allocated nodes first. This behavior can be changed by LLN option.

This fix for sure will be included in 21.09, and the severity level of this bug doesn't change this.

Dominik

Comment 29 Tom Wurgler 2021-03-08 05:31:10 MST

Thanks for working on this.  We look forward to the 21.8 version.
I am lowering the priority back to 3.
-tom

Comment 38 Dominik Bartkiewicz 2021-07-28 09:00:15 MDT

Hi

In 21.08 cons_tres should "pack" nodes as expected.
https://github.com/SchedMD/slurm/compare/a3ff3d71cc20f...04f94c6a2229e6

Please reopen if anything else is found on this.

Dominik

Comment 39 Marcin Stolarek 2021-10-20 07:19:33 MDT

*** Ticket 12605 has been marked as a duplicate of this ticket. ***