Is there a way of overwriting the limits in cgroup on a partition bases? We have a number of partitions and qos's set per department and globally sharing one slurm.conf file. I want to allow for one of the groups/departments to be able to run on their nodes bypassing the cgroup limit to allow them to use their resources without limitations - they want to oversubscribe memory. scontrol show config | egrep -i "cgroup|params|sched" SelectTypeParameters = CR_CPU_MEMORY ProctrackType = proctrack/cgroup TaskPlugin = task/cgroup I can include the slurm.conf file if it is needed.
(In reply to Jacqueline Scoggins from comment #0) > Is there a way of overwriting the limits in cgroup on a partition bases? We > have a number of partitions and qos's set per department and globally > sharing one slurm.conf file. I want to allow for one of the > groups/departments to be able to run on their nodes bypassing the cgroup > limit to allow them to use their resources without limitations - they want > to oversubscribe memory. > > scontrol show config | egrep -i "cgroup|params|sched" > SelectTypeParameters = CR_CPU_MEMORY > ProctrackType = proctrack/cgroup > TaskPlugin = task/cgroup > > I can include the slurm.conf file if it is needed. Hi Jacqueline, You may try to set Oversubscribe, SelectTypeParameters, and DefMemPerCPU to the mentioned partition, i.e: PartitionName=xxx Nodes=xxxx Default=NO Oversubscribe=FORCE SelectTypeParameters=CR_CORE DefMemPerCPU=0 In that case, for each job, the cgroup limit for memory will be set to the maximum of the node. Tell me if it does work for you.
> PartitionName=xxx Nodes=xxxx Default=NO Oversubscribe=FORCE > SelectTypeParameters=CR_CORE DefMemPerCPU=0 Note that FORCE setting allows to oversubscribe cores, I don't know if you also want that or not.
We tried the partition level setting yesterday and it brought slurmd down on the nodes. I’ll send you the message from a node when I get to my computer. Thanks Jackie Scoggins On Jul 24, 2018, at 4:36 AM, bugs@schedmd.com wrote: *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c2> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * > PartitionName=xxx Nodes=xxxx Default=NO Oversubscribe=FORCE > SelectTypeParameters=CR_CORE DefMemPerCPU=0 Note that FORCE setting allows to oversubscribe cores, I don't know if you also want that or not. ------------------------------ You are receiving this mail because: - You reported the bug.
I set the parameters you requested now I am seeing in the slurmctld log file the following message - [2018-07-24T10:01:06.579] cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core [2018-07-24T10:01:06.579] cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core Does this mean that the SelectType global variable need to be changed? Current value - SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 DefMemPerNode=260000 OverSubscribe=FORCE All of the other partitions are setup without the SelectTypeParameter setting and they should be assuming the global one correct?
We dont want to oversubscribe the cores but the memory. So should we use CR_CPU or CR_Memory instead for the SelectTypeParameter in the partition settings.
Hi, sorry, I was checking before giving you a response. > Does this mean that the SelectType global variable need to be changed? Yes, SelectTypeParameters in a partition level only works with CR_Core_* or CR_Socket_* set globally. Are you using hyperthreading on the nodes? If not, it should be safe to move to CR_Core_*. If you have it enabled, you still may want to use CR_Core_* if you don't want one job on every hyper-thread but on one core. Do you have any particular reason to use CR_CPU_*? > All of the other partitions are setup without the SelectTypeParameter > setting and they should be assuming the global one correct? Yes. They are assuming the global value. > We dont want to oversubscribe the cores but the memory. > So should we use CR_CPU or CR_Memory instead for the SelectTypeParameter in the partition settings. You should use: PartitionName=xxx Nodes=xxxx ... SelectTypeParameters=CR_Core DefMemPerCPU=0 This way only Cores will be constrained for this partition, but not memory. Doing this you are "removing" the *_Memory part that's set in the global value (SelectTypeParameters=CR_Core_Memory), which means to not control it.
Hi Jacque, I hope you are doing well. I wanted to follow up with you on this ticket to see if you needed any further clarification about what Felip has proposed? In his last response, he mentioned that you would need to modify the global "SelectTypeParameters=CR_CPU_Memory" to use one of the "CR_Core_*." for this to work properly and overwrite the DefMemPerCPU on the partition. For example: SelectTypeParameters=CR_Core_Memory PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 We are also curious to know if you are using hyperthreading on the nodes. Best regards, Jason
Jason We tried the parameter changes and it did not work as expected. Do you have time to talk now? I have a follow up meeting with the user on Monday. And I want to get it squared up before then. Thanks Jackie Scoggins On Jul 27, 2018, at 9:33 AM, bugs@schedmd.com wrote: *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c7> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth <jbooth@schedmd.com> * Hi Jacque, I hope you are doing well. I wanted to follow up with you on this ticket to see if you needed any further clarification about what Felip has proposed? In his last response, he mentioned that you would need to modify the global "SelectTypeParameters=CR_CPU_Memory" to use one of the "CR_Core_*." for this to work properly and overwrite the DefMemPerCPU on the partition. For example: SelectTypeParameters=CR_Core_Memory PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 We are also curious to know if you are using hyperthreading on the nodes. Best regards, Jason ------------------------------ You are receiving this mail because: - You reported the bug.
(In reply to Jacqueline Scoggins from comment #8) > Jason > > We tried the parameter changes and it did not work as expected. Do you have > time to talk now? I have a follow up meeting with the user on Monday. And > I want to get it squared up before then. > > Thanks > > Jackie Scoggins Hi Jacqueline, Can you please tell me why exactly it didn't work? Changing the global to CR_Core_Memory + the partition to SelectTypeParameters=CR_Core DefMemPerCPU=0 should work, I tested it in my environment before answering you and I had no problems. Thanks Felip
The jobs are getting queued when we believe it has enough resources to add more jobs if oversubscription is set. We did a test right after and it did not behave as we expected. I’ll send you some stats tomorrow if it’s still on my computer. I have a follow up meeting with the user tomorrow and I can provide more examples. Thanks Jackie Scoggins On Jul 29, 2018, at 11:03 PM, bugs@schedmd.com wrote: *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c9> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * (In reply to Jacqueline Scoggins from comment #8 <show_bug.cgi?id=5467#c8>)> Jason > > We tried the parameter changes and it did not work as expected. Do you have > time to talk now? I have a follow up meeting with the user on Monday. And > I want to get it squared up before then. > > Thanks > > Jackie Scoggins Hi Jacqueline, Can you please tell me why exactly it didn't work? Changing the global to CR_Core_Memory + the partition to SelectTypeParameters=CR_Core DefMemPerCPU=0 should work, I tested it in my environment before answering you and I had no problems. Thanks Felip ------------------------------ You are receiving this mail because: - You reported the bug.
One additional issue is changing the global parameter could affect or not affect our other customers configuration. This is a single slurm configuration managing about 10+ clusters. That’s why I’m trying to do it only at the partition level. Thanks Jackie Scoggins On Jul 29, 2018, at 11:03 PM, bugs@schedmd.com wrote: *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c9> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * (In reply to Jacqueline Scoggins from comment #8 <show_bug.cgi?id=5467#c8>)> Jason > > We tried the parameter changes and it did not work as expected. Do you have > time to talk now? I have a follow up meeting with the user on Monday. And > I want to get it squared up before then. > > Thanks > > Jackie Scoggins Hi Jacqueline, Can you please tell me why exactly it didn't work? Changing the global to CR_Core_Memory + the partition to SelectTypeParameters=CR_Core DefMemPerCPU=0 should work, I tested it in my environment before answering you and I had no problems. Thanks Felip ------------------------------ You are receiving this mail because: - You reported the bug.
(In reply to Jacqueline Scoggins from comment #11) > One additional issue is changing the global parameter could affect or not > affect our other customers configuration. This is a single slurm > configuration managing about 10+ clusters. That’s why I’m trying to do it > only at the partition level. > > Thanks > > Jackie Scoggins This was the reason that I asked if you were using hyper threading or not. If not, or if you are scheduling at a core level, this shouldn't change how things work. Can you attach your current slurm.conf (I have an old one) and I'll take a look too?
Created attachment 7451 [details] attachment-20858-0.html Here’s the slurm.conf file.
Created attachment 7452 [details] slurm.conf
I am not 100% but I think it is working as expected. The user changed his mem size from 6G to 4G per job and now he is seeing 128G of mem alloc. I will check in with him today to see if everything is working as expected. Thanks Jackie On Mon, Jul 30, 2018 at 4:02 AM, <bugs@schedmd.com> wrote: > Felip Moll <felip.moll@schedmd.com> changed bug 5467 > <https://bugs.schedmd.com/show_bug.cgi?id=5467> > What Removed Added > CC tim@schedmd.com > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c14> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jacqueline > Scoggins <jscoggins@lbl.gov> * > > Created attachment 7452 [details] <https://bugs.schedmd.com/attachment.cgi?id=7452> [details] <https://bugs.schedmd.com/attachment.cgi?id=7452&action=edit> > slurm.conf > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Jacqueline Scoggins from comment #16) > I am not 100% but I think it is working as expected. The user changed his > mem size from 6G to 4G per job and now he is seeing 128G of mem alloc. I > will check in with him today to see if everything is working as expected. > > Thanks > > Jackie With DefMemPerCpu=0 this is expected; it just changes the amount of required memory by the user to "infinite". Without changing the global to CR_Core_Memory and the partition SelectTypeParameters to CR_Core, the maximum memory that can be used at a time in the node is 128G, so no memory overcommit can happen. Changing global from CR_CPU_Memory to CR_Core_Memory shouldn't make any noticeable difference for you and is needed for changing the select type partition parameter. Before telling you to change it I will test twice and ensure that all works as expected.
I spoke too soon. Here is what the user wants and he is not getting it. He wants the number of jobs scheduled to the system to be 75% of the number of cores (28 cores with HT its 56) so he's looking for 42 jobs to run. He is requesting the jobs in his script as follows - #SBATCH --qos=alice_normal --partition=alice --ntasks=1 --cpus-per-task=1 --mem=4000M --time=48:00:00 --account=alice The qos alice_normal - Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES| alice_normal|0|00:00:00||cluster|||1.000000|||||||||||||||140||| The partition alice - PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE:56 PartitionName=alice AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=N/A DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=n000[0-3].alice[0] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:56 OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=224 TotalNodes=4 SelectTypeParameters=CR_CORE DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED node information sinfo -leN --partition=alice Mon Jul 30 14:44:45 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON n0000.alice0 1 alice mixed 56 2:28:1 128824 503836 1 alice none n0001.alice0 1 alice mixed 56 2:28:1 128824 503836 1 alice none n0002.alice0 1 alice mixed 56 2:28:1 128824 503836 1 alice none n0003.alice0 1 alice mixed 56 2:28:1 128824 503836 1 alice none They also want to make sure that if they request 6GB the job does not run over 6GB and will be killed. If you have time to talk this over please let me know. We can setup a zoom conference and I can share my screen with you. Thanks Jackie On Mon, Jul 30, 2018 at 9:29 AM, <bugs@schedmd.com> wrote: > *Comment # 17 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c17> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > (In reply to Jacqueline Scoggins from comment #16 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c16>)> I am not 100% but I think it is working as expected. The user changed his > > mem size from 6G to 4G per job and now he is seeing 128G of mem alloc. I > > will check in with him today to see if everything is working as expected. > > > > Thanks > > > > Jackie > > With DefMemPerCpu=0 this is expected; it just changes the amount of required > memory by the user to "infinite". Without changing the global to CR_Core_Memory > and the partition SelectTypeParameters to CR_Core, the maximum memory that can > be used at a time in the node is 128G, so no memory overcommit can happen. > > Changing global from CR_CPU_Memory to CR_Core_Memory shouldn't make any > noticeable difference for you and is needed for changing the select type > partition parameter. > > Before telling you to change it I will test twice and ensure that all works as > expected. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Jacqueline Scoggins from comment #18) > I spoke too soon. Here is what the user wants and he is not getting it. > > He wants the number of jobs scheduled to the system to be 75% of the > number of cores (28 cores with HT its 56) so he's looking for 42 jobs to > run. He is requesting the jobs in his script as follows - Judging from your slurm.conf file, you *are not* using hyper-threading. NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=28 Feature=alice Weight=1 # C6320 28 cores 128G RAM This seems to be Dell PowerEdge C6320 with 2 sockets and 28 cores per socket, therefor you have 56 Cores. Am I right? > The partition alice - > PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE:56 On your configuration, this FORCE:56 means one Core can run up to 56 jobs, so you could theoretically have 56*56 = 3136 jobs in the node. That maximum doesn't make much sense here. > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK > WEIGHT AVAIL_FE REASON > n0000.alice0 1 alice mixed 56 2:28:1 128824 503836 > 1 alice none > n0001.alice0 1 alice mixed 56 2:28:1 128824 503836 > 1 alice none > n0002.alice0 1 alice mixed 56 2:28:1 128824 503836 > 1 alice none > n0003.alice0 1 alice mixed 56 2:28:1 128824 503836 > 1 alice none This listing shows like nodes have only 1 thread, so no HyperThreading is enabled (S:C:T). > They also want to make sure that if they request 6GB the job does not run > over 6GB and will be killed. I am confused, in the initial comment and in comment 5 you say you want to oversubscribe memory. Let me understand: What the user wants is to launch 42 jobs, each one in one single core, and each job to have the possibility to reach 6GB RAM limit, maybe exceeding the total system memory. Is that it?
Hello, What the user wants is to launch 42 jobs, each one in one single core, and each job to have the possibility to reach 6GB RAM limit, maybe exceeding the total system memory. Is that it? Yes. Here is what I have done so far - Ok after reviewing the setup and seeing that it was not properly set on the node for HT I have fixed it. I have done the following - SelectTypeParameter is set to CR_CPU_MEMORY because when the user was requesting --cpus-per-tasks=1 I saw that he was being allocated 2 CPU's instead of 1. Changing it to CR_CPU_MEMORY the jobs are now being allocated 1 CPU. If I have it set to CR_CORE_MEMORY would you recommend they use --cores-per-tasks instead? I added ThreadsperCore=2 Sockets=2 CoresPerSocket=14 which gives me a total of 56 CPU's. I also had to add LLN because I noticed that all the jobs were being stacked on a single node first and then the next node and they don't want it to behave that way. i.e. NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2 CoresPerSocket=14 Sockets=2 Feature=alice Weight=1 PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes I am watching the jobs in the queue and the system using the following commands; squeue --user=rjporter --state=R -o "%T|%h| %r |%A | %C| %u| %j| %m| %Y|%N" sinfo -lNe | grep alice0 scontrol show node n000[0-3].alice[0] | egrep "NodeAddr|Mem" And I'm watching the slurmctld log file to see when a job is started. Is there anything else you would recommend I run to verify that it is behaving as expected - all Cores will be allocated a job expecting at least 42+ jobs to run per node? Thanks Jackie On Tue, Jul 31, 2018 at 8:26 AM, <bugs@schedmd.com> wrote: > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c19> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > (In reply to Jacqueline Scoggins from comment #18 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c18>)> I spoke too soon. Here is what the user wants and he is not getting it. > > > > He wants the number of jobs scheduled to the system to be 75% of the > > number of cores (28 cores with HT its 56) so he's looking for 42 jobs to > > run. He is requesting the jobs in his script as follows - > > Judging from your slurm.conf file, you *are not* using hyper-threading. > > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 > CoresPerSocket=28 Feature=alice Weight=1 # C6320 28 cores 128G RAM > > This seems to be Dell PowerEdge C6320 with 2 sockets and 28 cores per socket, > therefor you have 56 Cores. Am I right? > > The partition alice - > > PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE:56 > > On your configuration, this FORCE:56 means one Core can run up to 56 jobs, so > you could theoretically have 56*56 = 3136 jobs in the node. > That maximum doesn't make much sense here. > > > NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK > > WEIGHT AVAIL_FE REASON > > n0000.alice0 1 alice mixed 56 2:28:1 128824 503836 > > 1 alice none > > n0001.alice0 1 alice mixed 56 2:28:1 128824 503836 > > 1 alice none > > n0002.alice0 1 alice mixed 56 2:28:1 128824 503836 > > 1 alice none > > n0003.alice0 1 alice mixed 56 2:28:1 128824 503836 > > 1 alice none > > This listing shows like nodes have only 1 thread, so no HyperThreading is > enabled (S:C:T). > > > They also want to make sure that if they request 6GB the job does not run > > over 6GB and will be killed. > > I am confused, in the initial comment and in comment 5 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c5> you say you want to > oversubscribe memory. > > Let me understand: > > What the user wants is to launch 42 jobs, each one in one single core, and each > job to have the possibility to reach 6GB RAM limit, maybe exceeding the total > system memory. > Is that it? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Jacque, Now you have configured HyperThreading and thus using CR_CPU_Memory makes your initial request not to be possible. I am currently looking at alternatives for you. One of them is to make each "alice" node look like it has more memory than it actually does, therefore your user will be able to submit more jobs to the nodes. Note that in any case that can result in OOM, but allows some amount of extra memory for scheduling. Is this user aware that he can receive OOMs?
> when the user was requesting --cpus-per-tasks=1 I saw that he was being allocated 2 CPU's > instead of 1. Changing it to CR_CPU_MEMORY the jobs are now being > allocated 1 CPU. This is working as expected. With CR_CPU_* each task is bind to an "hyper-thread". With CR_Core_* each task is bind to a physical core. Depending on the application, it may be preferable to schedule to cores instead to hyper-threads. Are you sure your really want to bind processes to hyper-threads and not to physical cores? One hyper-thread shares resources with others. > If I have it set to CR_CORE_MEMORY would you recommend > they use --cores-per-tasks instead? In that case, --cores-per-tasks=1 will bind 1 task to 1 core. The granularity will be core, not hyper-thread. > I also had to add LLN because I noticed that all the jobs > were being stacked on a single node first and then the next node and they > don't want it to behave that way. That's fine. > i.e. > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2 > CoresPerSocket=14 Sockets=2 Feature=alice Weight=1 > PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes > SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes What's not good here is SelectTypeParameters=CR_CORE at a partition level when you have SelectTypeParameters=CR_CPU_Memory at a global level. This won't work. > Is there anything else you would recommend I run to verify that it is > behaving as expected - all Cores will be allocated a job expecting at least > 42+ jobs to run per node? With your setup, memory will still be constrained and there will not be possibility to overcommit. I.e. If one job requests 6GB and 1 CPU, you will not be able to run more than 21 jobs (128GB RAM/6GB per job). See my previous comment for more info. I am looking alternatives for you.
Jacque, I'm gonna suggest you to try: --------------------- FastSchedule=2 SelectTypeParameters=CR_CPU_Memory NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=28 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores 128G RAM PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=2000 --------------------- Explanation: FastSchedule is now set to 0. This indicates that each node reports its memory to slurmctld, and that's what is used to define the maximum memory available to the node. If slurm.conf has a node definition with > memory than the real memory, the node is *not* set to drain. Are you sure you want FastSchedule to be 0 and not 1? For your situation where you want to oversubscribe memory, FastSchedule must be set to 2. This implies that slurm.conf CPUs and Memory must be correctly set for each node. '2' allows you to define more memory than what's currently available in the node. i.e. my laptop has 8G real memory, but with FastSchedule=2 I faked it to have 256GB. Thanks to this parameter, and to the NodeName definition, where I fake the RealMemory to be 42*6GB (258048), you will be able to oversubscribe memory on these node plus constraining memory to jobs. I also removed Shared=Yes parameter, no need for it if you don't want to oversubscribe cores (moreover it is deprecated in favor of Oversubscribe=FORCE). Tell me if it does work for you.
On Wed, Aug 1, 2018 at 9:58 AM, <bugs@schedmd.com> wrote: > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c22> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > > when the user was requesting --cpus-per-tasks=1 I saw that he was being allocated 2 CPU's > > instead of 1. Changing it to CR_CPU_MEMORY the jobs are now being > > allocated 1 CPU. > > This is working as expected. > > With CR_CPU_* each task is bind to an "hyper-thread". > With CR_Core_* each task is bind to a physical core. > > Depending on the application, it may be preferable to schedule to cores instead > to hyper-threads. > Are you sure your really want to bind processes to hyper-threads and not to > physical cores? > > We want to bind to cores and not the hyper-threads so I will change it back to CR_Core. Which means the user need to request --cores-per-tasks instead of --cpus-per-tasks. > > One hyper-thread shares resources with others. > > If I have it set to CR_CORE_MEMORY would you recommend > > they use --cores-per-tasks instead? > > In that case, --cores-per-tasks=1 will bind 1 task to 1 core. The granularity > will be core, not hyper-thread. > > > I also had to add LLN because I noticed that all the jobs > > were being stacked on a single node first and then the next node and they > > don't want it to behave that way. > > That's fine. > > i.e. > > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2 > > CoresPerSocket=14 Sockets=2 Feature=alice Weight=1 > > PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes > > SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes > > What's not good here is SelectTypeParameters=CR_CORE at a partition level when > you have SelectTypeParameters=CR_CPU_Memory at a global level. This won't work. > > Will keep this once the global changes are made. > > Is there anything else you would recommend I run to verify that it is > > behaving as expected - all Cores will be allocated a job expecting at least > > 42+ jobs to run per node? > > With your setup, memory will still be constrained and there will not be > possibility to overcommit. > > Ok please advice how to make it so that memory is not constrained. They want to use there cluster to the fullest and not have memory constraining using the cores. If a jobs uses more than 6GB they want it to be killed and they are ok with that. They are submitting 2 day jobs via a program that just keeps spawning them. They are all 1 cpu , 1 tasks with 6GB of ram requested but they vary in what they actually do. Most of them run under 4GB and some maybe above but they don't know which will run in which manner so they set all of them to the highest memory they expect the job to run. If it goes over they want it to die. > I.e. If one job requests 6GB and 1 CPU, you will not be able to run more than > 21 jobs (128GB RAM/6GB per job). > > > See my previous comment for more info. I am looking alternatives for you. > > Any help here would be greatly appreciated. > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
The only concern I have with setting FastSchedule to 2 is that this is not the only cluster we have under slurm.conf and I don't want to impact the entire set of clusters with this change. Other clusters do want to have their memory constraint and we're using cgroup as we intended for the rest of the clusters. Setting this global setting will have what affect to the other clusters where users don't request memory for their job and some are exclusive and some are shared nodes. Have you reviewed the slurm.conf file to verify that this setting will not impact other clusters? Please advice before I make this change. Jackie On Wed, Aug 1, 2018 at 11:15 AM, <bugs@schedmd.com> wrote: > *Comment # 24 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c24> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > Jacque, > > I'm gonna suggest you to try: > --------------------- > FastSchedule=2 > > SelectTypeParameters=CR_CPU_Memory > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 > CoresPerSocket=28 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores > 128G RAM > PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=2000 > --------------------- > > Explanation: > > FastSchedule is now set to 0. This indicates that each node reports its memory > to slurmctld, and that's what is used to define the maximum memory available to > the node. If slurm.conf has a node definition with > memory than the real > memory, the node is *not* set to drain. Are you sure you want FastSchedule to > be 0 and not 1? > > For your situation where you want to oversubscribe memory, FastSchedule must be > set to 2. This implies that slurm.conf CPUs and Memory must be correctly set > for each node. '2' allows you to define more memory than what's currently > available in the node. i.e. my laptop has 8G real memory, but with > FastSchedule=2 I faked it to have 256GB. > > Thanks to this parameter, and to the NodeName definition, where I fake the > RealMemory to be 42*6GB (258048), you will be able to oversubscribe memory on > these node plus constraining memory to jobs. > > I also removed Shared=Yes parameter, no need for it if you don't want to > oversubscribe cores (moreover it is deprecated in favor of > Oversubscribe=FORCE). > > > Tell me if it does work for you. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jacque, Jacob mentioned that you have asked for a phone call about this issue. We work exclusively through Bugzilla so, unfortunately, this will not be possible. I have reviewed your ticket, "5467", and I see that Felip has given valid responses to your requests. Please note that the requirements you have specified have changed through the ticket such as in comment #18 and #20 so, this has added to the time it takes to respond with meaningful information. Also, the request is not clear to us and has the potential to have serious repercussions for those nodes. We understand the user wants to oversubscribe, however, it is not clear why they wish to do this thing. 1) Based on the information you have given you want to still use cgroups yet disregard the memory locking used by the cgroup? 2) Are the user's tasks ballooning in size to consume all the RAM and then decreasing in size over the lifecycle of the job? 3) Does the user just want to use the entire node (exclusively)? 4) What type of jobs are these (Matlab)? Please ask the user why they want to do this, because, if two jobs land on the node and each consumes all the memory then OOM would be invoked which is never good as this would stomp on other processes. In answer to your last question about "FastSchedule 2", this should not cause any issues since this option will look at the slurm.conf and honor the configured node attributes over the detected ones. "Consider the configuration of each node to be that specified in the slurm.conf configuration file and any node with less than the configured resources will not be set DRAIN." In regards to your other concerns about testing these parameters, setting and unsetting parameters in the conf files, have you considered testing these out in a test cluster before deploying them to production? Best regards, Jason Director of Support
This is why emails is not the most efficient way to communicate. We have about 13 clusters configured under our slurm.conf file. We have multiple configurations per clusters because they are for different users/departments. Cgroup has been setup on all of the clusters and we want to use this for all of them except this one cluster (alice). Here is what the user requested to us for his cluster (see below) and what I don't want to do is have the changes just for this partition affect the global configuration we have for all of the other 12 clusters. "Hey Karen & John, I mentioned briefly to John & Gary that I'd like to optimize the number of concurrent jobs run on the cluster. Using hyper threaded slots, we found we can get about 30% boost by doubling the # of slots relative to cores, but then we also hit occasional memory problems. So what we do at ORNL is more cautiously push hyperthreaded slots. I'd like to do the same at the HPCS cluster. Here are the conditions: 1. ALICE requires sites provide 2GB/job slot plus swap (~2-3GB/slot). 2. (Most) ALICE sites do not impose a memory limit request per job. 3. The average ALICE job, weighted by length (e.g. ignoring all the little jobs), is ~2.1GB/job 4. That average includes a long tail out to ~6 or 7 GBs 5. Internally, each ALICE jobagent monitors its payload and kills the job if it uses 8GB of mem. 6. Sites that implement #1 & #2 report on occasion (few times per year) that nodes need to be rebooted because of memory exhaustion. What we do at ORNL is implement #1 and #2 but at 3GB/slot. The ORNL folks report that they have not had to reboot a node due memory exhaustion, but in checking the logs they find that a handful of single node OOM events per month that go into swap but are able to self-recover within a couple of hours. This is a tiny loss of processing and does not impact cluster operations. To implement this on HPCS, we should first allow for 35 slots/node = 28 x 4GB/3GB. Then either: do not have memory as a consumable resource or tell the scheduler that it has more memory than physical RAM. e.g. say we define the alice job request at 6.5GB and then tell the scheduler it has 228GB of RAM. Then the scheduler will allow 35 such jobs on each node. What do you think? thanks, Jeff" Keep in mind from looking at the slurm.conf file we have multiple partitions and we use qos's to control and manage limits. We have cgroup setup on all clusters and this user want to use the cluster in the same configuration differently than the global configuration settings we have for the other cluster users. And btw if I had a test cluster I would definitely be performing these changes there. At this time there is not one setup. I hope this clears up the request from what the users email to my team members. Thanks Jackie On Wed, Aug 1, 2018 at 3:25 PM, <bugs@schedmd.com> wrote: > *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c27> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth > <jbooth@schedmd.com> * > > Hi Jacque, > > Jacob mentioned that you have asked for a phone call about this issue. We work > exclusively through Bugzilla so, unfortunately, this will not be possible. I > have reviewed your ticket, "5467", and I see that Felip has given valid > responses to your requests. Please note that the requirements you have > specified have changed through the ticket such as in comment #18 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c18> and #20 so, > this has added to the time it takes to respond with meaningful information. > Also, the request is not clear to us and has the potential to have serious > repercussions for those nodes. We understand the user wants to oversubscribe, > however, it is not clear why they wish to do this thing. > > 1) Based on the information you have given you want to still use cgroups yet > disregard the memory locking used by the cgroup? > 2) Are the user's tasks ballooning in size to consume all the RAM and then > decreasing in size over the lifecycle of the job? > 3) Does the user just want to use the entire node (exclusively)? > 4) What type of jobs are these (Matlab)? > > Please ask the user why they want to do this, because, if two jobs land on the > node and each consumes all the memory then OOM would be invoked which is never > good as this would stomp on other processes. > > In answer to your last question about "FastSchedule 2", this should not cause > any issues since this option will look at the slurm.conf and honor the > configured node attributes over the detected ones. > > "Consider the configuration of each node to be that specified in the slurm.conf > configuration file and any node with less than the configured resources will > not be set DRAIN." > > In regards to your other concerns about testing these parameters, setting and > unsetting parameters in the conf files, have you considered testing these out > in a test cluster before deploying them to production? > > Best regards, > Jason > Director of Support > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Jacqueline, It is a bit more clear now but I still don't fully understand some parts like: > we should first allow for 35 slots/node = 28 x 4GB/3GB Or this: > ALICE sites do not impose a memory limit request per job. <--- I guess this is Slurm job request limit? + > we define the alice job request at 6.5GB <--- I guess this is internal ALICE limit? ----------- In any case, the easiest and more flexible way to implement what you want is to set: FastSchedule=2 SelectTypeParameters=CR_CPU_Memory NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores 128G RAM, faked RealMemory PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=3000 (Adjust RealMemory to NumJobsYouWant*SizePerJobYouWant, i.e. 35 jobs x 6.5GB/job = 232960) (Adjust DefMemPerCPU to be less or equal than SizePerJobYouWant, i.e. 6.5GB = 6656) *BE AWARE*: Setting FastSchedule=2 implies that your node definition must be correctly and carefully defined in slurm.conf. Automatic info gathered from slurmd when node registers will *not be* honored. i.e. if you don't define RealMemory to some node, when you start this node will have RealMemory=1. Any parameter not defined in NodeName definition in slurm.conf will get the default = 1. So your work here would be to check the RealMemory/Sockets/CoresPerSocket/ThreadsPerCore/CPUs for all nodes to be correctly set, before applying the FastSchedule change. Then apply the proposed change. Restart slurmctld.
Ok I will look at this today and see if we want to make this change which will be affecting all of our node configuration settings. Thanks Jackie Scoggins On Aug 2, 2018, at 3:57 AM, bugs@schedmd.com wrote: *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c31> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * Jacqueline, It is a bit more clear now but I still don't fully understand some parts like: > we should first allow for 35 slots/node = 28 x 4GB/3GB Or this:> ALICE sites do not impose a memory limit request per job. <--- I guess this is Slurm job request limit? +> we define the alice job request at 6.5GB <--- I guess this is internal ALICE limit? ----------- In any case, the easiest and more flexible way to implement what you want is to set: FastSchedule=2 SelectTypeParameters=CR_CPU_Memory NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores 128G RAM, faked RealMemory PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=3000 (Adjust RealMemory to NumJobsYouWant*SizePerJobYouWant, i.e. 35 jobs x 6.5GB/job = 232960) (Adjust DefMemPerCPU to be less or equal than SizePerJobYouWant, i.e. 6.5GB = 6656) *BE AWARE*: Setting FastSchedule=2 implies that your node definition must be correctly and carefully defined in slurm.conf. Automatic info gathered from slurmd when node registers will *not be* honored. i.e. if you don't define RealMemory to some node, when you start this node will have RealMemory=1. Any parameter not defined in NodeName definition in slurm.conf will get the default = 1. So your work here would be to check the RealMemory/Sockets/CoresPerSocket/ThreadsPerCore/CPUs for all nodes to be correctly set, before applying the FastSchedule change. Then apply the proposed change. Restart slurmctld. ------------------------------ You are receiving this mail because: - You reported the bug.
Hi Jacque, Have you finally taken a decision on this matter? Please, keep me informed.
We have not taken a decision on this. My concern is the amount of changes that need to occur in the slurm.conf file just because we need to change FastSchedule from 0 to 2 and then add for each node memory resource limits. We currently have 63 Nodename lines that now need to be updated and I really don't want to have to do this just to allow one cluster to use a majority of their resources without limitations on memory or the number of cores. Which is only a set of 4 nodes and 1 line of change for that cluster. I feel that we have a complex setup and that this change is more work than it should be. Currently alice is setup as follows: sinfo -leN --partition=alice Thu Aug 9 17:54:38 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON n0000.alice0 1 alice allocated 56 2:14:2 128824 503836 1 alice none n0001.alice0 1 alice allocated 56 2:14:2 128824 503836 1 alice none n0002.alice0 1 alice allocated 56 2:14:2 128824 503836 1 alice none n0003.alice0 1 alice allocated 56 2:14:2 128824 503836 1 alice none [root]# grep -i alice /etc/slurm/slurm.conf ## ALICE nodes NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2 CoresPerSocket=14 Sockets=2 Feature=alice Weight=1 # C6320 28 cores 128G RAM PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes Number of running jobs per node 32 n0000.alice0 32 n0001.alice0 32 n0002.alice0 32 n0003.alice0 Number of pending jobs and reasons 9 (Priority) 1 (Resources) Memory allocated per node RealMemory=128824 AllocMem=128000 FreeMem=42331 Sockets=2 Boards=1 and the jobs pending are requesting the same parameters as the ones running STATE|OVER_SUBSCRIBE| REASON |JOBID | CPUS| USER| NAME| MIN_MEMORY| SCHEDNODES|NODELIST PENDING|YES| Priority |13838925 | 1| rjporter| AliEn.8300.155| 4000M| n0003.alice0| PENDING|YES| Priority |13838920 | 1| rjporter| AliEn.8300.155| 4000M| n0000.alice0| PENDING|YES| Priority |13838909 | 1| rjporter| AliEn.8300.155| 4000M| n0002.alice0| PENDING|YES| Resources |13838903 | 1| rjporter| AliEn.8300.155| 4000M| n0001.alice0| PENDING|YES| Priority |13838929 | 1| rjporter| AliEn.8300.155| 4000M| (null)| PENDING|YES| Priority |13838933 | 1| rjporter| AliEn.8300.155| 4000M| (null)| PENDING|YES| Priority |13839005 | 1| rjporter| AliEn.8300.155| 4000M| (null)| PENDING|YES| Priority |13839658 | 1| rjporter| AliEn.8300.156| 4000M| (null)| PENDING|YES| Priority |13839682 | 1| rjporter| AliEn.8300.156| 4000M| (null)| PENDING|YES| Priority |13840022 | 1| rjporter| AliEn.8300.156| 4000M| (null)| If there was another option that would be better I would love to hear it.
(In reply to Jacqueline Scoggins from comment #34) > We have not taken a decision on this. My concern is the amount of changes > that need to occur in the slurm.conf file just because we need to change > FastSchedule from 0 to 2 and then add for each node memory resource limits. > We currently have 63 Nodename lines that now need to be updated and I really > don't want to have to do this just to allow one cluster to use a majority of > their resources without limitations on memory or the number of cores. I talked with other engineers here. This is the most reliable way to do so. To get the realmemory of the nodes you can just do: ]$ scontrol show node|grep -i "nodename\|realmemory" and you will have each node with its current realmemory. This is the value that you must set in slurm.conf > If there was another option that would be better I would love to hear it. 3 more options. You may want to try option 1, that doesn't involve changes to other partitions. Let me know if it works for you: 1. Keep CR_CPU_Memory globally and FastSchedule=0 as you currently have. Set only "CPUs=56" to the node definition skipping other parameters (no ThreadsPerCore, no Sockets, no Boards, no Cores). Set DefMemPerCPU=1 MaxMemPerCPU=1 to the partition definition. In 'alice' nodes, change cgroup.conf individually and set ConstrainRAMSpace=no + ConstrainSwapSpace=no. i.e. PartitionName=alice Nodes=n000[0-3].alice[0] CPUS=56 DefMemPerCPU=1 MaxMemPerCPU=1 NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice Weight=1 # C6320 28 cores 128G RAM Set globally: MemLimitEnforce=no TaskPluginParams=NoOverMemoryKill 2. I can try to code a patch to let you set CR_CPU in the Partition specification when you have CR_CPU_* set globally. You would have to recompile slurm with this new patch, deploy it, and maintain the patch locally, because it won't be part of any standard release. It won't go neither into 18.08. This is considered as a new feature. 3. You can use the Federated Cluster Support. This would imply to start another slurmctld instance with its own configuration. https://slurm.schedmd.com/SLUG16/FederatedScheduling.pdf Let me know if 1) works for you.
> You may want to try option 1, that doesn't involve changes to other > partitions. Let me know if it works for you: > > 1. Keep CR_CPU_Memory globally and FastSchedule=0 as you currently have. Set > only "CPUs=56" to the node definition skipping other parameters (no > ThreadsPerCore, no Sockets, no Boards, no Cores). Set DefMemPerCPU=1 > MaxMemPerCPU=1 to the partition definition. > > In 'alice' nodes, change cgroup.conf individually and set > ConstrainRAMSpace=no + ConstrainSwapSpace=no. > > i.e. > PartitionName=alice Nodes=n000[0-3].alice[0] CPUS=56 DefMemPerCPU=1 > MaxMemPerCPU=1 > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice > Weight=1 # C6320 28 cores 128G RAM > > Set globally: > > MemLimitEnforce=no > TaskPluginParams=NoOverMemoryKill > > Let me know if 1) works for you. Jacqueline, Have you been able to try this option? Thanks Felip
Not yet we have to discuss it and this weekend we have a power outage. It might be implemented after that on Monday or Tuesday. I will try option 1 but my concern is what will the NoOverMemorykill global setting do to the other partitions that have cgroup enforcing limits. Thanks Jackie Scoggins On Aug 16, 2018, at 7:45 AM, bugs@schedmd.com wrote: *Comment # 36 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c36> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * > You may want to try option 1, that doesn't involve changes to other > partitions. Let me know if it works for you: > > 1. Keep CR_CPU_Memory globally and FastSchedule=0 as you currently have. Set > only "CPUs=56" to the node definition skipping other parameters (no > ThreadsPerCore, no Sockets, no Boards, no Cores). Set DefMemPerCPU=1 > MaxMemPerCPU=1 to the partition definition. > > In 'alice' nodes, change cgroup.conf individually and set > ConstrainRAMSpace=no + ConstrainSwapSpace=no. > > i.e. > PartitionName=alice Nodes=n000[0-3].alice[0] CPUS=56 DefMemPerCPU=1 > MaxMemPerCPU=1 > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice > Weight=1 # C6320 28 cores 128G RAM > > Set globally: > > MemLimitEnforce=no > TaskPluginParams=NoOverMemoryKill > > Let me know if 1) works for you. Jacqueline, Have you been able to try this option? Thanks Felip ------------------------------ You are receiving this mail because: - You reported the bug.
(In reply to Jacqueline Scoggins from comment #37) > Not yet we have to discuss it and this weekend we have a power outage. It > might be implemented after that on Monday or Tuesday. I will try option 1 > but my concern is what will the NoOverMemorykill global setting do to the > other partitions that have cgroup enforcing limits. > > Thanks > > Jackie Scoggins From the start of this bug I assumed you had the following settings: cgroup.conf: ConstrainRAMSpace=yes slurm.conf: TaskPlugin = task/cgroup If you really have these two settings enabled, you MUST disable the other enforcement mechanism which uses jobacctgather to kill jobs and steps. If you have both mechanisms enabled there could be conflicts if one kills a step or a job while the other is still checking the limit. There are known bugs related to this. I didn't want to bother you with this matter in order to not mix things here, but given your comment I have to say now that you NEED to change your global slurm.conf parameters to: > > MemLimitEnforce=no > > TaskPluginParams=NoOverMemoryKill In future versions having both enables will cause a fatal to slurmd + slurmctld. So it is safe to set it as I suggested as long as you constrain memory from task/cgroup.
Jacqueline, Please tell me how you progressed on this bug. Per comment 35 and 38 I would consider this bug as resolved as multiple paths has been presented to solve your problem.
It is not working as expected. I made recommended changes and I am still seeing jobs stay in Pending state even if there is CPU's available to run the job. Memory NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice Weight=1 # C6320 28 cores 128G RAM PartitionName=alice Nodes=n000[0-3].alice[0] Shared=Yes DefMemPerCPU=1 MaxMemPerCPU=1 OverSubscribe=FORCE LLN=Yes > scontrol show config | grep -i mem AccountingStorageTRES = cpu,mem,energy,node,billing DefMemPerNode = UNLIMITED *JobAcctGatherParams = NoOverMemoryKill*MaxMemPerNode = UNLIMITED *MemLimitEnforce = No*SelectTypeParameters = CR_CPU_MEMORY JobAcctGatherType = jobacct_gather/linux *alice nodes cgroup.conf file -* pdsh -g alice0 cat /etc/slurm/cgroup.conf| dshbak -c ---------------- n[0000-0003] ---------------- CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm/cgroup" CgroupMountpoint="/cgroup" ConstrainCores=yes ConstrainRAMSpace=no ConstrainSwapSpace=no NodeName=n0000.alice0 Arch=x86_64 CoresPerSocket=1 CPUAlloc=28 CPUErr=0 CPUTot=56 CPULoad=22.08 AvailableFeatures=alice ActiveFeatures=alice Gres=(null) NodeAddr=10.0.4.0 NodeHostName=n0000.alice0 Version=17.11 OS=Linux 3.10.0-693.11.6.el7.x86_64 #1 SMP Wed Jan 3 18:09:42 CST 2018 RealMemory=128824 AllocMem=128000 FreeMem=67706 Sockets=56 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=503836 Weight=1 Owner=N/A MCS_label=N/A Partitions=alice BootTime=2018-08-22T15:49:17 SlurmdStartTime=2018-08-22T15:51:08 CfgTRES=cpu=56,mem=128824M,billing=56 AllocTRES=cpu=28,mem=125G CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Node status n0000.alice0 1 alice mixed 56 56:1:1 128824 503836 1 alice none n0001.alice0 1 alice mixed 56 56:1:1 128824 503836 1 alice none n0002.alice0 1 alice mixed 56 56:1:1 128824 503836 1 alice none n0003.alice0 1 alice mixed 56 56:1:1 128824 503836 1 alice none Node memory NodeAddr=10.0.4.0 NodeHostName=n0000.alice0 Version=17.11 RealMemory=128824 AllocMem=128000 FreeMem=67706 Sockets=56 Boards=1 NodeAddr=10.0.4.1 NodeHostName=n0001.alice0 Version=17.11 RealMemory=128824 AllocMem=128000 FreeMem=69844 Sockets=56 Boards=1 NodeAddr=10.0.4.2 NodeHostName=n0002.alice0 Version=17.11 RealMemory=128824 AllocMem=128000 FreeMem=70323 Sockets=56 Boards=1 NodeAddr=10.0.4.3 NodeHostName=n0003.alice0 Version=17.11 RealMemory=128824 AllocMem=128000 FreeMem=68962 Sockets=56 Boards=1 Totals 128 - Running jobs All nodes are Running 32 cpus and 128000 memory 10 - Pending jobs all requesting 1 CPU and 4000M of memory What they would like to happen is for all nodes that can use the CPU's to go up to 56 processes and ignore or oversubscribe memory. None of these settings you have provided me yet have been able to satisfy this request from the customer. Is it possible to have realmemory larger that what the node hold? Make the node think it has almost doubled the size of memory or use some virtual memory to increase the value? Would the users have to do a special request or add any request to their batch script or srun command? I hope this information is helping you see what we are seeing after the changes were made. - not any different that it was before. Thanks Jackie On Tue, Aug 28, 2018 at 1:18 AM, <bugs@schedmd.com> wrote: > *Comment # 39 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c39> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > Jacqueline, > > Please tell me how you progressed on this bug. > > Per comment 35 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c35> and 38 I would consider this bug as resolved as multiple paths > has been presented to solve your problem. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
> Totals > > 128 - Running jobs > All nodes are Running 32 cpus and 128000 memory > > 10 - Pending jobs > all requesting 1 CPU and 4000M of memory > This is because your jobs are asking for memory. Try to run 56 jobs without specifying memory or specifying it very low and it will work, without memory being enforced (since you are not enforcing the limit in cgroup.conf): for i in $(seq 1 56); do srun -n1 -c1 -w n0000.alice0 sleep 120 &; done; > What they would like to happen is for all nodes that can use the CPU's to > go up to 56 processes and ignore or oversubscribe memory. None of these > settings you have provided me yet have been able to satisfy this request > from the customer. I understand what you want, and in your user's e-mail it was clear that who would control memory would be ALICE, not Slurm, so I supposed that jobs were not going to ask for any memory. > Is it possible to have realmemory larger that what the > node hold? Make the node think it has almost doubled the size of memory or > use some virtual memory to increase the value? Yes. Using FastSchedule=2. If you are going to ask for memory (I don't see why you would do that if Slurm is not controlling memory) or fake the node's memory, then you must follow comment 31 approach setting the FastSchedule=2 globally, setting the memory on the node and so on. > Would the users have to do a special request or add any request to their batch script or srun command? Nothing different of what would be done under normal circumstances. > I hope this information is helping you see what we are seeing after the > changes were made. - not any different that it was before. It is different: memory is not constrained (as long as you don't ask for memory), and you can run up to 56 jobs.
I will suggest to the user to change his program so he doesn’t request memory and see if it works. The last time he made the change to not request it we saw fewer jobs running. I know that cgroup was still set to constrain memory so I’ll see what happens this time. Setting fast schedule to 2 is not an option I desire to set because this would require me to change all of the other systems nodes and I don’t want to do that. Thanks Jackie Scoggins On Aug 29, 2018, at 4:20 AM, bugs@schedmd.com wrote: *Comment # 41 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c41> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * > Totals > > 128 - Running jobs > All nodes are Running 32 cpus and 128000 memory > > 10 - Pending jobs > all requesting 1 CPU and 4000M of memory > This is because your jobs are asking for memory. Try to run 56 jobs without specifying memory or specifying it very low and it will work, without memory being enforced (since you are not enforcing the limit in cgroup.conf): for i in $(seq 1 56); do srun -n1 -c1 -w n0000.alice0 sleep 120 &; done; > What they would like to happen is for all nodes that can use the CPU's to > go up to 56 processes and ignore or oversubscribe memory. None of these > settings you have provided me yet have been able to satisfy this request > from the customer. I understand what you want, and in your user's e-mail it was clear that who would control memory would be ALICE, not Slurm, so I supposed that jobs were not going to ask for any memory. > Is it possible to have realmemory larger that what the > node hold? Make the node think it has almost doubled the size of memory or > use some virtual memory to increase the value? Yes. Using FastSchedule=2. If you are going to ask for memory (I don't see why you would do that if Slurm is not controlling memory) or fake the node's memory, then you must followcomment 31 <show_bug.cgi?id=5467#c31> approach setting the FastSchedule=2 globally, setting the memory on the node and so on. > Would the users have to do a special request or add any request to their batch script or srun command? Nothing different of what would be done under normal circumstances. > I hope this information is helping you see what we are seeing after the > changes were made. - not any different that it was before. It is different: memory is not constrained (as long as you don't ask for memory), and you can run up to 56 jobs. ------------------------------ You are receiving this mail because: - You reported the bug.
(In reply to Jacqueline Scoggins from comment #42) > I will suggest to the user to change his program so he doesn’t request > memory and see if it works. The last time he made the change to not > request it we saw fewer jobs running. I know that cgroup was still set to > constrain memory so I’ll see what happens this time. > > Setting fast schedule to 2 is not an option I desire to set because this > would require me to change all of the other systems nodes and I don’t want > to do that. > > > Thanks Hi Jacqueline, do you have any news about this issue?
We were able to make the necessary changes and it works for the user. There are a few more things we need to fix and I would like to have your input. Since the user has only 4 nodes if one node goes offline for any reason he would like to pack as many jobs onto the nodes without over utilizing resources on the node. He wants to be able to limit the number of jobs across all 4 nodes to a max of 164. With 4 nodes that is evenly distributed. But if he has only 3 nodes how will slurm distribute the jobs. His concern is that the pack of jobs running on the nodes would potentially kill the node due to resources being over subscribed/utilized. Any assistance would be great. Thanks Jackie On Thu, Sep 13, 2018 at 4:19 PM, <bugs@schedmd.com> wrote: > *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c43> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > (In reply to Jacqueline Scoggins from comment #42 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c42>)> I will suggest to the user to change his program so he doesn’t request > > memory and see if it works. The last time he made the change to not > > request it we saw fewer jobs running. I know that cgroup was still set to > > constrain memory so I’ll see what happens this time. > > > > Setting fast schedule to 2 is not an option I desire to set because this > > would require me to change all of the other systems nodes and I don’t want > > to do that. > > > > > > Thanks > > Hi Jacqueline, do you have any news about this issue? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
I wanted to share what the user wrote: When I look at the partition, I see: [user@alice ~]$ sinfo -N -o "%N %C %O %e" -p alice NODELIST CPUS(A/I/O/T) CPU_LOAD FREE_MEM n0000.alice0 37/19/0/56 36.16 29164 n0001.alice0 38/18/0/56 40.53 4670 n0002.alice0 37/19/0/56 36.18 28979 n0003.alice0 37/19/0/56 38.90 20314 If I increase the # of jobs or if 1 node goes away, will the available nodes fill to 56 jobs? That would be too many for the memory needs of ALICE jobs. We should shoot for ~42/node. Can you add that limit? So, yes. It appears he wants to make that the limit for all 4 nodes On Thu, Sep 13, 2018 at 5:01 PM, Jacqueline Scoggins <jscoggins@lbl.gov> wrote: > We were able to make the necessary changes and it works for the user. > There are a few more things we need to fix and I would like to have your > input. > > Since the user has only 4 nodes if one node goes offline for any reason he > would like to pack as many jobs onto the nodes without over utilizing > resources on the node. He wants to be able to limit the number of jobs > across all 4 nodes to a max of 164. With 4 nodes that is evenly > distributed. But if he has only 3 nodes how will slurm distribute the > jobs. His concern is that the pack of jobs running on the nodes would > potentially kill the node due to resources being over subscribed/utilized. > > Any assistance would be great. > > Thanks > > Jackie > > On Thu, Sep 13, 2018 at 4:19 PM, <bugs@schedmd.com> wrote: > >> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c43> on bug >> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll >> <felip.moll@schedmd.com> * >> >> (In reply to Jacqueline Scoggins from comment #42 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c42>)> I will suggest to the user to change his program so he doesn’t request >> > memory and see if it works. The last time he made the change to not >> > request it we saw fewer jobs running. I know that cgroup was still set to >> > constrain memory so I’ll see what happens this time. >> > >> > Setting fast schedule to 2 is not an option I desire to set because this >> > would require me to change all of the other systems nodes and I don’t want >> > to do that. >> > >> > >> > Thanks >> >> Hi Jacqueline, do you have any news about this issue? >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> >
(In reply to Jacqueline Scoggins from comment #44) > We were able to make the necessary changes and it works for the user. There > are a few more things we need to fix and I would like to have your input. > > Since the user has only 4 nodes if one node goes offline for any reason he > would like to pack as many jobs onto the nodes without over utilizing > resources on the node. He wants to be able to limit the number of jobs > across all 4 nodes to a max of 164. With 4 nodes that is evenly > distributed. But if he has only 3 nodes how will slurm distribute the > jobs. His concern is that the pack of jobs running on the nodes would > potentially kill the node due to resources being over subscribed/utilized. > > Any assistance would be great. > > Thanks > > Jackie Jackie, What you are asking now is to limit again jobs in order to not exceed the available memory, which is exactly what we removed with the past approach. Now that ALICE is constraining memory, its his responsibility to not exceed memory. Does the jobs always have the same memory constraint by ALICE (i.e. 6GB)? There's another option which would distribute the jobs spreadly in the different nodes, please look at SelectTypeParameter=CR_LLN. But take in mind this will affect the entire system. Another option if you implemented what was in comment 40, is to specify DefMemory by default, to NodeMemory/MaxJobs. Tell me what you think.
I have the lln setting already. We don’t want to change the memory setting we have. They are just fine. He just want to make sure that the limit of 164 maxjobperacct doesn’t mean that if one node goes offline the other three nodes won’t be over utilized and killed by oom or by to high of a load. If 1 node is offline just limit the number of jobs to the remaining nodes to say 42 jobs only. Thanks Jackie Scoggins On Sep 14, 2018, at 4:33 AM, bugs@schedmd.com wrote: *Comment # 46 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c46> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * (In reply to Jacqueline Scoggins from comment #44 <show_bug.cgi?id=5467#c44>)> We were able to make the necessary changes and it works for the user. There > are a few more things we need to fix and I would like to have your input. > > Since the user has only 4 nodes if one node goes offline for any reason he > would like to pack as many jobs onto the nodes without over utilizing > resources on the node. He wants to be able to limit the number of jobs > across all 4 nodes to a max of 164. With 4 nodes that is evenly > distributed. But if he has only 3 nodes how will slurm distribute the > jobs. His concern is that the pack of jobs running on the nodes would > potentially kill the node due to resources being over subscribed/utilized. > > Any assistance would be great. > > Thanks > > Jackie Jackie, What you are asking now is to limit again jobs in order to not exceed the available memory, which is exactly what we removed with the past approach. Now that ALICE is constraining memory, its his responsibility to not exceed memory. Does the jobs always have the same memory constraint by ALICE (i.e. 6GB)? There's another option which would distribute the jobs spreadly in the different nodes, please look at SelectTypeParameter=CR_LLN. But take in mind this will affect the entire system. Another option if you implemented what was in comment 40 <show_bug.cgi?id=5467#c40>, is to specify DefMemory by default, to NodeMemory/MaxJobs. Tell me what you think. ------------------------------ You are receiving this mail because: - You reported the bug.
(In reply to Jacqueline Scoggins from comment #47) > I have the lln setting already. We don’t want to change the memory setting > we have. They are just fine. He just want to make sure that the limit of > 164 maxjobperacct doesn’t mean that if one node goes offline the other > three nodes won’t be over utilized and killed by oom or by to high of a > load. > > If 1 node is offline just limit the number of jobs to the remaining nodes > to say 42 jobs only. > > > Thanks > > Jackie Scoggins Jackie, There's no direct option to define the maximum number of jobs able to run on a partition dynamically. If you just allow one account in the partition, you could modify the maxjobperacct dynamically calling a script through the use of: A) NHC (Node Health Check) B) strigger You should have two scripts that, when a node is set to down, drain or even resumed, act and modify the QoS of the partition changing the MaxJobPA. IMHO I think this is a bit ugly tuning. When the users decided and accepted the possibility to have OOM's, the knew about the consequences. Now trying to mitigate this consequences seems to me like trying to patch something "broken" on purpose. At the same time I understand your concerns and see what you are trying to do, so I give you the possibility to do it by the two proposed options.
(In reply to Felip Moll from comment #48) > (In reply to Jacqueline Scoggins from comment #47) > > I have the lln setting already. We don’t want to change the memory setting > > we have. They are just fine. He just want to make sure that the limit of > > 164 maxjobperacct doesn’t mean that if one node goes offline the other > > three nodes won’t be over utilized and killed by oom or by to high of a > > load. > > > > If 1 node is offline just limit the number of jobs to the remaining nodes > > to say 42 jobs only. > > > > > > Thanks > > > > Jackie Scoggins > > Jackie, > > There's no direct option to define the maximum number of jobs able to run on > a partition dynamically. > > If you just allow one account in the partition, you could modify the > maxjobperacct dynamically calling a script through the use of: > > A) NHC (Node Health Check) > B) strigger > > You should have two scripts that, when a node is set to down, drain or even > resumed, act and modify the QoS of the partition changing the MaxJobPA. > > IMHO I think this is a bit ugly tuning. When the users decided and accepted > the possibility to have OOM's, the knew about the consequences. Now trying > to mitigate this consequences seems to me like trying to patch something > "broken" on purpose. At the same time I understand your concerns and see > what you are trying to do, so I give you the possibility to do it by the two > proposed options. Hi Jackie, any thoughts about this? Thanks, Felip
I have not set anything yet we could try the strigger setting for only the Alice nodes if we can count the number of idle nodes when the trigger happens then we can set the appropriate maxjobperaccount value. How does the trigger get cleared once the node is back online. Is that also in the script? I’ve never used the striggers before so I’m just checking. Thanks Jackie Scoggins On Sep 28, 2018, at 2:03 AM, bugs@schedmd.com wrote: *Comment # 49 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c49> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * (In reply to Felip Moll from comment #48 <show_bug.cgi?id=5467#c48>)> (In reply to Jacqueline Scoggins from comment #47 <show_bug.cgi?id=5467#c47>) > > I have the lln setting already. We don’t want to change the memory setting > > we have. They are just fine. He just want to make sure that the limit of > > 164 maxjobperacct doesn’t mean that if one node goes offline the other > > three nodes won’t be over utilized and killed by oom or by to high of a > > load. > > > > If 1 node is offline just limit the number of jobs to the remaining nodes > > to say 42 jobs only. > > > > > > Thanks > > > > Jackie Scoggins > > Jackie, > > There's no direct option to define the maximum number of jobs able to run on > a partition dynamically. > > If you just allow one account in the partition, you could modify the > maxjobperacct dynamically calling a script through the use of: > > A) NHC (Node Health Check) > B) strigger > > You should have two scripts that, when a node is set to down, drain or even > resumed, act and modify the QoS of the partition changing the MaxJobPA. > > IMHO I think this is a bit ugly tuning. When the users decided and accepted > the possibility to have OOM's, the knew about the consequences. Now trying > to mitigate this consequences seems to me like trying to patch something > "broken" on purpose. At the same time I understand your concerns and see > what you are trying to do, so I give you the possibility to do it by the two > proposed options. Hi Jackie, any thoughts about this? Thanks, Felip ------------------------------ You are receiving this mail because: - You reported the bug.
(In reply to Jacqueline Scoggins from comment #50) > I have not set anything yet we could try the strigger setting for only the > Alice nodes if we can count the number of idle nodes when the trigger > happens then we can set the appropriate maxjobperaccount value. How does > the trigger get cleared once the node is back online. Is that also in the > script? I’ve never used the striggers before so I’m just checking. > > Thanks > > Jackie Scoggins Jackie, The triggers are executed when events occur. For example, some interesting options for you (read man strigger for + info): -u, --up Trigger an event if the specified node is returned to service from a DOWN state. -d, --down Trigger an event if the specified node goes into a DOWN state. -D, --drained Trigger an event if the specified node goes into a DRAINED state. -n, --node[=host] Host name(s) of interest. --flags=PERM Make the trigger permanent. Do not purge it after the event occurs. -p, --program=path Execute the program at the specified fully qualified pathname when the event occurs. SYNOPSIS strigger --set [OPTIONS...] strigger --get [OPTIONS...] strigger --clear [OPTIONS...] The trigger program must set a new trigger before the end of the next interval to ensure that no trigger events are missed OR the trigger must be created with an argument of "--flags=PERM". This command can only set triggers if run by the user SlurmUser. This is required for the slurmctld daemon to set the appropriate user and group IDs for the executed program. I think that this can correctly accomplish what you want to do.
Thanks, I'll take a look. On Fri, Sep 28, 2018 at 5:05 AM <bugs@schedmd.com> wrote: > *Comment # 51 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c51> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > (In reply to Jacqueline Scoggins from comment #50 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c50>)> I have not set anything yet we could try the strigger setting for only the > > Alice nodes if we can count the number of idle nodes when the trigger > > happens then we can set the appropriate maxjobperaccount value. How does > > the trigger get cleared once the node is back online. Is that also in the > > script? I’ve never used the striggers before so I’m just checking. > > > > Thanks > > > > Jackie Scoggins > > Jackie, > > The triggers are executed when events occur. > > For example, some interesting options for you (read man strigger for + info): > > -u, --up > Trigger an event if the specified node is returned to service > from a DOWN state. > -d, --down > Trigger an event if the specified node goes into a DOWN state. > > -D, --drained > Trigger an event if the specified node goes into a DRAINED > state. > > -n, --node[=host] > Host name(s) of interest. > > --flags=PERM > Make the trigger permanent. Do not purge it after the > event occurs. > > -p, --program=path > Execute the program at the specified fully qualified pathname > when the event occurs. > > SYNOPSIS > strigger --set [OPTIONS...] > strigger --get [OPTIONS...] > strigger --clear [OPTIONS...] > > The trigger program must set a new trigger before the end of the next interval > to ensure > that no trigger events are missed OR the trigger must be created with an > argument of "--flags=PERM". > > This command can only set triggers if run by the user SlurmUser. This is > required for the slurmctld > daemon to set the appropriate user and group IDs for the executed program. > > I think that this can correctly accomplish what you want to do. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Jacqueline Scoggins from comment #52) > Thanks, I'll take a look. > Hi Jacqueline, Are all the questions responded for you? May I close this bug?
No, I want to keep this open because we are not able to apply those changes you are requesting for this environment. I would like to talk to someone at Schedmd live via a zoom conference or simple a phone call. We have a customer who is really displeased with the functionality of the product. Their collaborated site ORNL is also not pleased. Is there some way we can get a conference call setup soon to discuss this issue? I would really appreciate if this could be escalated. Thanks Jackie Scoggins On Fri, Oct 19, 2018 at 2:29 AM <bugs@schedmd.com> wrote: > *Comment # 53 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c53> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll > <felip.moll@schedmd.com> * > > (In reply to Jacqueline Scoggins from comment #52 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c52>)> Thanks, I'll take a look. > > > > > Hi Jacqueline, > > Are all the questions responded for you? > > May I close this bug? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Jacqueline Scoggins from comment #54) > No, I want to keep this open because we are not able to apply those changes > you are requesting for this environment. I would like to talk to someone at > Schedmd live via a zoom conference or simple a phone call. We have a > customer who is really displeased with the functionality of the product. > Their collaborated site ORNL is also not pleased. Is there some way we can > get a conference call setup soon to discuss this issue? > > I would really appreciate if this could be escalated. > > Thanks > > Jackie Scoggins Hi Jacqueline, I am a bit surprised that after ~20 days you haven't told me you cannot apply these changes. Please, tell me exactly what issues did you have and I will try to help you as best as I can. I'd also like to know exactly why is your customer so displeased, and also ORNL. I will escalate your issue and give you a response soon.
Hello Felip, I just got this information from the customer when our team went out to visit the past week. It isn’t that I waited Im just reporting the outcome of an onsite visit. The project that they are working on auto generate their job scripts. They disperse several jobs at a time only requesting 1 node and expects the jobs to be spread over their nodes evenly and with control only on the number of jobs to run per node. They are adding 40 more nodes to their cluster with different cpu/memory configurations than the existing one but they don’t want an additional qos nor partition to be passed in their job script. Since we have a complex scheduler configuration making the changes to their one cluster could potentially cause issues to other customers. We need to set up a meeting with your group for additional advice to come up with a solution for their setup. If you have time to talk today or tomorrow let’s schedule something. Thanks Jackie Scoggins On Oct 21, 2018, at 11:57 PM, bugs@schedmd.com wrote: Felip Moll <felip.moll@schedmd.com> changed bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> What Removed Added CC jbooth@schedmd.com *Comment # 55 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c55> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * (In reply to Jacqueline Scoggins from comment #54 <show_bug.cgi?id=5467#c54>)> No, I want to keep this open because we are not able to apply those changes > you are requesting for this environment. I would like to talk to someone at > Schedmd live via a zoom conference or simple a phone call. We have a > customer who is really displeased with the functionality of the product. > Their collaborated site ORNL is also not pleased. Is there some way we can > get a conference call setup soon to discuss this issue? > > I would really appreciate if this could be escalated. > > Thanks > > Jackie Scoggins Hi Jacqueline, I am a bit surprised that after ~20 days you haven't told me you cannot apply these changes. Please, tell me exactly what issues did you have and I will try to help you as best as I can. I'd also like to know exactly why is your customer so displeased, and also ORNL. I will escalate your issue and give you a response soon. ------------------------------ You are receiving this mail because: - You reported the bug.
Greeting Jackie Scoggins, > They are adding 40 more nodes to their cluster with different cpu/memory configurations than the existing one but they don’t want an additional qos nor partition to be passed in their job script. SLURM can not guess the SelectTypeParameters for the user. It can be configured on the partition as Felip has outlined, however, there are other limitations with regards to CR_Core, (SelectTypeParameters in a partition level only works with CR_Core_* or CR_Socket_* set globally) comment #6. Additionally, a user must tell the scheduler which partition they need so that the correct SelectTypeParameters can be used. If users do not wish to make these changes then there is just simply no way for SLURM to accommodate you. You might be able to make use of the job submit plugin but this is something you will need to look into and write. > We need to set up a meeting with your group for additional advice to come up with a solution for their setup. We do not offer telephone support so you should direct the problems you run into through the bug system. Felip has offered several updates on how you can configure your system since the end of July. Although we offer configuration suggestions it is not up to SchedMD support to configure and manage your system. This is left up to the site admin (you) to make these changes and understand them. Felip has done a great job at working with the changing requirement you have sent him and the inaction on your part to implement. If you wish to continue to work through this bug you will need to provide meaningful updates that we can work with such as any issues you are running into with the suggestions Felip has offered. If you are unwilling to make these changes then we will proceed to close out this issue. Thanks, -Jason Director of Support
Hello Jason, Thank you for the update. I want to say that I never implied that Felip had not provided us with good support. I think he has done a great job. I was also not making any complaints about his service. You can go ahead and close this case for the reasons you mentioned the problem is solved as far as schedmd is concerned. I wanted to say that I was not aware that you no longer do phone support. How does one get advanced support via a live person to help figure out the best solution for the customer regarding your product? I don't feel that a ticketing system is getting the point across or allowing for us to really explain our situation. If at all possible we would like to talk with an engineer to work through the users issues and I will do the work just need some guidance from schedmd. Please advise. Thanks Jackie On Tue, Oct 23, 2018 at 12:31 PM <bugs@schedmd.com> wrote: > *Comment # 58 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c58> on bug > 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth > <jbooth@schedmd.com> * > > Greeting Jackie Scoggins, > > They are adding 40 more nodes to their cluster with different cpu/memory configurations than the existing one but they don’t want an additional qos nor partition to be passed in their job script. > > SLURM can not guess the SelectTypeParameters for the user. It can be configured > on the partition as Felip has outlined, however, there are other limitations > with regards to CR_Core, (SelectTypeParameters in a partition level only works > with CR_Core_* or CR_Socket_* set globally) comment #6 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c6>. Additionally, a user > must tell the scheduler which partition they need so that the correct > SelectTypeParameters can be used. If users do not wish to make these changes > then there is just simply no way for SLURM to accommodate you. You might be > able to make use of the job submit plugin but this is something you will need > to look into and write. > > We need to set up a meeting with your group for additional advice to come up with a solution for their setup. > > > We do not offer telephone support so you should direct the problems you run > into through the bug system. Felip has offered several updates on how you can > configure your system since the end of July. Although we offer configuration > suggestions it is not up to SchedMD support to configure and manage your > system. This is left up to the site admin (you) to make these changes and > understand them. Felip has done a great job at working with the changing > requirement you have sent him and the inaction on your part to implement. If > you wish to continue to work through this bug you will need to provide > meaningful updates that we can work with such as any issues you are running > into with the suggestions Felip has offered. If you are unwilling to make these > changes then we will proceed to close out this issue. > > > Thanks, > -Jason > Director of Support > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Jackie, > I wanted to say that I was not aware that you no longer do phone support. SchedMD has never offered phone support so I am not sure why you would have this impression. We have always used the bug system as the primary way to communicate information and to fix issues. > How does one get advanced support via a live person to help figure out the best solution for the customer regarding your product? We offer direct developer access via the ticketing system. What you have mentioned above is consulting and I can have Jacob Jenson reach out to you with more details if you wanted to pursue such an engagement. > I don't feel that a ticketing system is getting the point across or allowing for us to really explain our situation. If at all possible we would like to talk with an engineer to work through the users issues and I will do the work just need some guidance from SchedMD. I understand that you wish to explain the situation over the phone but we kindly ask that you do so via the ticket so that the information is not lost or forgotten. Kind regards, Jason
Thanks and yes a follow up with Jacob would be great. Is there a fee for this service? Btw in the past I’ve spoken directly with Danny and/or Moe when we would have issues. But I do understand your policy. You can close this ticket now. I’ll reopen when needed. Thanks Jackie Scoggins On Oct 24, 2018, at 8:29 AM, bugs@schedmd.com wrote: *Comment # 61 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c61> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth <jbooth@schedmd.com> * Hi Jackie, > I wanted to say that I was not aware that you no longer do phone support. SchedMD has never offered phone support so I am not sure why you would have this impression. We have always used the bug system as the primary way to communicate information and to fix issues. > How does one get advanced support via a live person to help figure out the best solution for the customer regarding your product? We offer direct developer access via the ticketing system. What you have mentioned above is consulting and I can have Jacob Jenson reach out to you with more details if you wanted to pursue such an engagement. > I don't feel that a ticketing system is getting the point across or allowing for us to really explain our situation. If at all possible we would like to talk with an engineer to work through the users issues and I will do the work just need some guidance from SchedMD. I understand that you wish to explain the situation over the phone but we kindly ask that you do so via the ticket so that the information is not lost or forgotten. Kind regards, Jason ------------------------------ You are receiving this mail because: - You reported the bug.
Jacqueline, As per the comments seen, I am closing the bug. I am sorry about this situation and misunderstandings. I would like you to reopen the bug or create a new one whenever you have specific and concrete problems with some Slurm component. I will be more than glad to help you in whatever I can. Danny and Moe used to do this kind of support in the past, but after some time the situation had changed and they couldn't manage anymore the volume of bugs, so the new rule became to always use Bugzilla, with one advantage being having everything recorded here, another to be able to spread the work among all of us, and finally the last one giving us some room to properly analyze and response with more accuracy than with an spontaneous conference call. If you want to contact directly to Jacob his e-mail is publicly available at this site (jacob@schedmd.com), or you can open a new bug and ask for this kind of consulting. Thanks and sorry again for the misunderstanding. Best regards, Felip M
Thank you Felip. Thanks Jackie Scoggins On Oct 25, 2018, at 5:04 AM, bugs@schedmd.com wrote: Felip Moll <felip.moll@schedmd.com> changed bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> What Removed Added Resolution --- INFOGIVEN Status UNCONFIRMED RESOLVED Severity 3 - Medium Impact 4 - Minor Issue *Comment # 64 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c64> on bug 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll <felip.moll@schedmd.com> * Jacqueline, As per the comments seen, I am closing the bug. I am sorry about this situation and misunderstandings. I would like you to reopen the bug or create a new one whenever you have specific and concrete problems with some Slurm component. I will be more than glad to help you in whatever I can. Danny and Moe used to do this kind of support in the past, but after some time the situation had changed and they couldn't manage anymore the volume of bugs, so the new rule became to always use Bugzilla, with one advantage being having everything recorded here, another to be able to spread the work among all of us, and finally the last one giving us some room to properly analyze and response with more accuracy than with an spontaneous conference call. If you want to contact directly to Jacob his e-mail is publicly available at this site (jacob@schedmd.com), or you can open a new bug and ask for this kind of consulting. Thanks and sorry again for the misunderstanding. Best regards, Felip M ------------------------------ You are receiving this mail because: - You reported the bug.