When we started working with Slurm, all our tests showed that Slurm would "pack" a node. In other words, If our nodes have 24 cores and I submitted a 32 core job, it would use all of one node and 8 cores on the second node. Now either we changed something or Slurm changed. If I submit the 32 core job now, it splits 2 nodes with 16 cores each. What we'd like is the former method, and then if another job comes along to use the remaining cores on that second node, Here is our slurm-common.conf (included from slurm.conf): # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # TmpFS file system (default /tmp) TmpFS=/scratch # Partition limits always enforced at submission time EnforcePartLimits=ALL # added by pqd # Sets the amount of time backup will wait before switching SlurmctldTimeout=30 # added by tcw PluginDir=/usr/local/slurm/lib/slurm # Activate the Multi-factor Job Priority Plugin with decay PriorityType=priority/multifactor #PriorityFlags=FAIR_TREE # default since 19.05 # 2 week half-life #PriorityDecayHalfLife=14-0 # DEFAULT: 7 days # The larger the job, the greater its job size priority. PriorityFavorSmall=NO # The job's age factor reaches 1.0 after waiting in the queue for 2 weeks. PriorityMaxAge=14-0 # This next group determines the weighting of each of the # components of the Multi-factor Job Priority Plugin. # Using examples from https://slurm.schedmd.com/priority_multifactor.html PriorityWeightAge=1000 # x Job Age factor (0-1) PriorityWeightFairshare=10000 # x FairshareTree factor PriorityWeightJobSize=1000 # x Job size PriorityWeightPartition=1000 # x normalized partition prio (configured on partition) PriorityWeightQOS=0 # don't use qos factor AccountingStorageEnforce=limits,qos #MailProg=/bin/mail MpiDefault=none #MpiParams=ports=#-# # Danny from SchedMD says set this to cgroup ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid #SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid #SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root StateSaveLocation=/apps/local/slurm/spool SwitchType=switch/none TaskPlugin=task/affinity,task/cgroup TaskPluginParam=Cores TaskProlog=/u/rds4020/t901353/set_slurm_display # # # TIMERS #KillWait=30 #MinJobAge=300 #SlurmctldTimeout=120 #SlurmdTimeout=300 # # # SCHEDULING #SchedulerType=sched/backfill # needs users to specify runtime!!!! SchedulerType=sched/builtin SelectType=select/cons_tres # for gpu SelectTypeParameters=CR_Core,CR_Pack_Nodes # LICENSES #Licenses=acfd_3:3,abaqus:390,standard:390 Licenses=catia2acis:1 # You can view the following with: sacctmgr show tres # You can view the following with: sreport -T license/abaqus@rdsabalic # sreport: cluster AccountUtilizationByUser #NAH# AccountingStorageTRES=license/abaqus@gisabalic,license/abaqus@rdsabalic,license/acfd_3@ansyslic,license/anshpc_pack@ansyslic,license/catia2acis # This is for tracking abaqus licenses per user within slurm itself (in theory :-) AccountingStorageTRES=license/abaqus@rdsabalic # trying to fix locked mem issues PropagateResourceLimitsExcept=ALL #SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld.log #SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd.%n.log # This plugin keeps linlarge from spanning physical clusters TopologyPlugin=topology/tree # Generic Resources GresTypes=gpu,fv # Plugins #JobSubmitPlugins=lua # out for 8.2....CliFilterPlugins=lua #Include /apps/local/slurm/conf/slurm_nodes_and_partitions Thanks for any guidance on how to pack nodes effectively. -tom
Hey Tom, Can you provide an example of a job submission that doesn't behave as expected? I'm interested in what options are getting passed in so I can properly replicate the situation! I did notice "CR_Pack_Nodes" which should (and so far in my testing seems to work) compact tasks like that if the tasks being launched don't consume the whole allocation. Thanks! --Tim
We run a FEA code (that I can’t share). But I think if you use our slurm-common.conf with a slurm.conf that has the partitions etc in it and includes the common file, Then submit a job requiring more than one node worth of cores, you should see the problem. It just did the same kind of thing to another user here on another code. He asked for 960 cores and we have 24 cores per node. So ideally, he should have gotten 40 nodes. He really got 41, so it used most of the cores and left some unused and just added another node… The ideal situation would be that a job would use whole nodes if available, and only split a job up when it has to (ie, not pend). We can arrange a call if I am not very clear. Thanks tom From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, December 9, 2020 11:50 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 10391] Packing nodes External Email....WARNING....Think before you click or respond....WARNING Comment # 1<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c1&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MmY3%2BJascwfJanoBAMv%2BSD2kotwrqiyc0CZYN9Xy3Jg%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cl1ZZH47idYnki5wtVBgkE6IVA2%2BqqywwDzt284qcqk%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Hey Tom, Can you provide an example of a job submission that doesn't behave as expected? I'm interested in what options are getting passed in so I can properly replicate the situation! I did notice "CR_Pack_Nodes" which should (and so far in my testing seems to work) compact tasks like that if the tasks being launched don't consume the whole allocation. Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
We found our issue. Last September we changed SelectType=select/cons_res to SelectType=select/cons_tres And cons_tres doesn’t do the packing like it was. Put it back to cons_res and it works as it did. Is this expected behavior? Thanks! From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, December 9, 2020 11:50 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 10391] Packing nodes External Email....WARNING....Think before you click or respond....WARNING Comment # 1<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c1&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=MmY3%2BJascwfJanoBAMv%2BSD2kotwrqiyc0CZYN9Xy3Jg%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7Cf9ea50e3731845254dff08d89c628424%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637431294221313118%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cl1ZZH47idYnki5wtVBgkE6IVA2%2BqqywwDzt284qcqk%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Hey Tom, Can you provide an example of a job submission that doesn't behave as expected? I'm interested in what options are getting passed in so I can properly replicate the situation! I did notice "CR_Pack_Nodes" which should (and so far in my testing seems to work) compact tasks like that if the tasks being launched don't consume the whole allocation. Thanks! --Tim ________________________________ You are receiving this mail because: * You reported the bug.
Hi Ok, We really want the cons-tres for use with gpu's. If we have tres, it splits jobs evenly across nodes If we have res, it fills a node then uses part of another node if needed. Then the next job fills that unused portion of the node. This "res" behavior is what we want, but we need "tres" for the gpu. What combination of settings can give us res packing with tres? This is a pretty big thing for us, so I raised the priority level. Thanks! tom We can have a teams meeting if we need to. Thanks
Hey Tom, Sorry about the delay but I was out of the office at the end of the week so didn't make much progress there. I've been able to replicate and I'm currently examining the logic around this choice to see how exactly its making this decision, at which point I should have a better handle on it. Thanks! --Tim
Hey Tom, Just wanted to update you on this - after looking through the code and chatting we've determined this is not the desired behavior. Right now, it doesn't appear that there is a good way to get the old allocation method back without code changes. I'm looking into some possible workarounds until a patch can be made. I'll update you with more when I can! Thanks! --Tim
Hey Tom, I've been looking at this a lot and unfortunately there doesn't seem to be a clear way out here. I have a suggestion though that might get you similar behavior until a proper fix can be implemented. Can you try using "plane" distribution for now? EG: a job submission like "srun -m plane=12 --cpus-per-task=2 --ntasks=18" when allocating on empty 24 core nodes should result in 1 full node and 1 half-full node like it would in cons_res. Its worth noting here that the tasks might not be organized the same way as they would have been in cons_res. You can just specify something like "-m plane=24" so that regardless of the cpu/task count nodes should be filled before moving on to a different node. Can you try this out for a few of your jobs and let me know if it is a functional workaround for you? Thanks, --Tim
Hey Tom, I just wanted to check in and see if you had a chance to look at the workaround I suggested! Thanks, --Tim
I'm sorry, but not yet. Been off for 3 weeks and just ramping up again... Will this be fixed in a future version? If so, any idea when that version might come out? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Tuesday, January 5, 2021 11:01 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 10391] Packing nodes External Email....WARNING....Think before you click or respond....WARNING Comment # 11<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c11&data=04%7C01%7Ctwurgl%40goodyear.com%7C74014cdabc034d0e4bda08d8b1930ccd%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637454592434941000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=qU9X59o%2BvnecgyU6spC3PuWcEb63lWAwFS6YZqRKloE%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7C74014cdabc034d0e4bda08d8b1930ccd%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637454592434951003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tAppj5Edj3UQY9Zs3c5h4UoT2ona%2BhTFccbHES1GYNY%3D&reserved=0> from Tim McMullan<mailto:mcmullan@schedmd.com> Hey Tom, I just wanted to check in and see if you had a chance to look at the workaround I suggested! Thanks, --Tim ________________________________ You are receiving this mail because: * You reported the bug.
No worries, I hope you had a good vacation! We are looking at fixing this in a future release. I can't confirm when a fix would come out at the moment, but I will try to get that figured out for you. Thanks! --Tim
I am back at looking at this. Sorry for the delay. Using -m plane=24, I submitted 3 jobs. Note our cluster has 2 sockets, 12 cores per socket. Job 1 - 8 core job Job 2 - 8 core job Job 3 - 32 core job and we get this: JOBID USER STATE PARTITION EXEC_HOST SUBMIT_TIME JOB_NAME 256840 t901353 RUNNING linlarge 8*rdsxen129 Feb 03 13:26 batch_script_7942.sh 256841 t901353 RUNNING linlarge 8*rdsxen129 Feb 03 13:28 batch_script_8799.sh 256842 t901353 RUNNING linlarge 7*rdsxen129:24*rdsxen130:1*rdsxen131 ( 32 cpus) Feb 03 13:29 batch_script_9127.sh Note that desired behavior is 8 on xen129, 8 more on xen129 and 8 more on xen129 before moving to xen130. So no, -m plane isn't giving us what we want. So should I have put a different number than 24 in the plane=24? And not to mention we have other clusters with different numbers of cores per node. Any more thoughts here? Is a potential fix in the source code being planned? So I am not sure what we need out of con_tres for using gpus. Can we not take advantage of gpus if we dont' have cons_tres set? Can any gpu advanatages be used via the command line or possibly on a partition of gpu nodes and those advantages applied to the partition or per job? Just trying to find a way around this issue. thanks tom
(In reply to Tom Wurgler from comment #15) > I am back at looking at this. Sorry for the delay. No worries, I'm glad you have been able to test the workaround! > Using -m plane=24, I submitted 3 jobs. Note our cluster has 2 sockets, 12 > cores per socket. > Job 1 - 8 core job > Job 2 - 8 core job > Job 3 - 32 core job > > and we get this: > JOBID USER STATE PARTITION EXEC_HOST SUBMIT_TIME > JOB_NAME > 256840 t901353 RUNNING linlarge 8*rdsxen129 Feb 03 13:26 > batch_script_7942.sh > 256841 t901353 RUNNING linlarge 8*rdsxen129 Feb 03 13:28 > batch_script_8799.sh > 256842 t901353 RUNNING linlarge 7*rdsxen129:24*rdsxen130:1*rdsxen131 > ( 32 cpus) Feb 03 13:29 > batch_script_9127.sh That's a strange result. I haven't been able to replicate that behavior locally, it seems to assign correctly for me. I'll try to see if there is something else going on. > Note that desired behavior is 8 on xen129, 8 more on xen129 and 8 more on > xen129 before moving to xen130. Totally understand that :) > So no, -m plane isn't giving us what we want. So should I have put a > different number than 24 in the plane=24? > And not to mention we have other clusters with different numbers of cores > per node. You can put an arbitrarily large number in plane=$count and it will split up based on the actual job/node size so you don't have to worry about it as much. Count could just be the largest number of CPUs you have in the cluster. > Any more thoughts here? If your above test is repeatable, it might be useful to have the slurmctld log with "SlurmctldDebug=debug2" and "DebugFlags=SelectType" in the slurm.conf and doing and scontrol reload. You can also enable those with: scontrol setdebug debug2 scontrol setdebugflags +SelectType We would at least need the logs from just before they were submitted to any time after they were scheduled. > Is a potential fix in the source code being planned? The ticket has been re-assigned to Dominik who is working in this code already. He is looking at the possible fixes. > So I am not sure what we need out of con_tres for using gpus. Can we not > take advantage of gpus if we dont' have cons_tres set? Can any gpu > advanatages be used via the command line or possibly on a partition of gpu > nodes and those advantages applied to the partition or per job? With cons_res, you can still schedule with GPUs, but its not as easy at job submit time. You can do something like "--gres=gpu:1" for a job and it will schedule it, and attempt to be smart about numa awareness if you've given it that information. --cpus-per-gpu, --gpus, --gpu-bind, --gpu-freq, --gpus-per-node, --gpus-per-socket, --gpus-per-task, --mem-per-gpu are all cons_tres specific. I'm not sure what your jobs GPU requirements are and how much of an issue this would or would not be, but its an option. Thanks! --Tim
I redid my tests with "-m plane 40" Note I just made up this number. And thus far it is doing the correct thing. I am going to install the change to our standard script and watch how the packing goes. We are continuing to test.. thanks tom
Adding the -m plane to the standard scripts still did not pack efficiently. thanks tom
Hi Could you send us logs requested in comment 17? Dominik
Created attachment 18048 [details] log after setting the variables
scontrol setdebug debug2 scontrol setdebugflags +SelectType Not exactly reproducible l:t901353@rds4020:slurm5 > sjobs JOBID USER STATE PARTITION EXEC_HOST SUBMIT_TIME JOB_NAME 257920 t901353 RUNNING linlarge 8*rdsxen65 Feb 22 08:56 sierra_batch_script_8363.sh 257921 t901353 RUNNING linlarge 8*rdsxen65 Feb 22 08:56 sierra_batch_script_8743.sh 257922 t901353 RUNNING linlarge 8*rdsxen65:24*rdsxen66 ( 32 cpus) Feb 22 08:57 sierra_batch_script_8976.sh 257923 t901353 RUNNING linlarge 24*rdsxen259:8*rdsxen260 ( 32 cpus) Feb 22 08:58 sierra_batch_script_9371.sh 257924 t901353 RUNNING linlarge 4*rdsxen324:4*rdsxen325:4*rdsxen326:4*rdsxen327:16*rdsxen328 ( 32 cpus) Feb 22 08:58 sierra_batch_script_9612.sh l:t901353@rds4020:slurm5 > Still weird splitting thanks
Hi Slurmctld spread out job 257924 over five nodes because it fills up partially allocated nodes rdsxen[324-327] as first. It looks strange, but this is correct behavior. I also have some updates about the primary issue of this bug. Unfortunately, the fix for it will be available only on 21.08. It will require a significant change in cons_tres, which we can't do in 20.11. Could we drop the severity of this bug to 3? Dominik
It may be correct behavior, but overall this is not desirable behavior. This is still a big deal to us without having data to back it up. We think having jobs spread out like this will hurt performance. So I am going to run a series of tests to try to judge the effect of the current packing. As far as lowering priority, does the work on fixing it slow down then? From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, February 25, 2021 9:23 AM To: Tom Wurgler <twurgl@goodyear.com> Subject: [EXT] [Bug 10391] Packing nodes External Email....WARNING....Think before you click or respond....WARNING Comment # 26<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391%23c26&data=04%7C01%7Ctwurgl%40goodyear.com%7C32672a0a50684614481e08d8d998eb5b%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637498598094001328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=SsMWT%2FgPke6bsZTlvhlnAtWATqebnmAaFGN8OEqUWng%3D&reserved=0> on bug 10391<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D10391&data=04%7C01%7Ctwurgl%40goodyear.com%7C32672a0a50684614481e08d8d998eb5b%7C939e896692854a9a9f040887efe8aae0%7C0%7C0%7C637498598094011316%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9n0cjLPNF8SqWmf6NzlRhZQmSY%2BhRXn9XSQzzXZjKYM%3D&reserved=0> from Dominik Bartkiewicz<mailto:bart@schedmd.com> Hi Slurmctld spread out job 257924 over five nodes because it fills up partially allocated nodes rdsxen[324-327] as first. It looks strange, but this is correct behavior. I also have some updates about the primary issue of this bug. Unfortunately, the fix for it will be available only on 21.08. It will require a significant change in cons_tres, which we can't do in 20.11. Could we drop the severity of this bug to 3? Dominik ________________________________ You are receiving this mail because: * You reported the bug.
Hi Topology-aware resource allocation minimizes the count of switches and simply takes nodes in order of bitmap position. Other flavors of allocation _eval_nodes_*() work differently. But generally, by default Slurm tries to use partially allocated nodes first. This behavior can be changed by LLN option. This fix for sure will be included in 21.09, and the severity level of this bug doesn't change this. Dominik
Thanks for working on this. We look forward to the 21.8 version. I am lowering the priority back to 3. -tom
Hi In 21.08 cons_tres should "pack" nodes as expected. https://github.com/SchedMD/slurm/compare/a3ff3d71cc20f...04f94c6a2229e6 Please reopen if anything else is found on this. Dominik
*** Ticket 12605 has been marked as a duplicate of this ticket. ***