Ticket 5467

Summary: overwrite cgroup limits on memory
Product: Slurm Reporter: Wei Feinstein <wfeinstein>
Component: ConfigurationAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: jbooth, kmfernsler
Version: 17.11.3   
Hardware: Linux   
OS: Linux   
Site: LBNL - Lawrence Berkeley National Laboratory Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: attachment-20858-0.html
slurm.conf

Description Wei Feinstein 2018-07-23 14:46:20 MDT
Is there a way of overwriting the limits in cgroup on a partition bases?  We have a number of partitions and qos's set per department and globally sharing one slurm.conf file.  I want to allow for one of the groups/departments to be able to run on their nodes bypassing the cgroup limit to allow them to use their resources without limitations - they want to oversubscribe memory.

scontrol show config | egrep -i "cgroup|params|sched"
SelectTypeParameters    = CR_CPU_MEMORY
ProctrackType           = proctrack/cgroup
TaskPlugin              = task/cgroup 

I can include the slurm.conf file if it is needed.
Comment 1 Felip Moll 2018-07-24 05:13:38 MDT
(In reply to Jacqueline Scoggins from comment #0)
> Is there a way of overwriting the limits in cgroup on a partition bases?  We
> have a number of partitions and qos's set per department and globally
> sharing one slurm.conf file.  I want to allow for one of the
> groups/departments to be able to run on their nodes bypassing the cgroup
> limit to allow them to use their resources without limitations - they want
> to oversubscribe memory.
> 
> scontrol show config | egrep -i "cgroup|params|sched"
> SelectTypeParameters    = CR_CPU_MEMORY
> ProctrackType           = proctrack/cgroup
> TaskPlugin              = task/cgroup 
> 
> I can include the slurm.conf file if it is needed.

Hi Jacqueline,

You may try to set Oversubscribe, SelectTypeParameters, and DefMemPerCPU to the mentioned partition, i.e:

PartitionName=xxx Nodes=xxxx Default=NO Oversubscribe=FORCE SelectTypeParameters=CR_CORE DefMemPerCPU=0

In that case, for each job, the cgroup limit for memory will be set to the maximum of the node.

Tell me if it does work for you.
Comment 2 Felip Moll 2018-07-24 05:36:30 MDT
> PartitionName=xxx Nodes=xxxx Default=NO Oversubscribe=FORCE
> SelectTypeParameters=CR_CORE DefMemPerCPU=0

Note that FORCE setting allows to oversubscribe cores, I don't know if you also want that or not.
Comment 3 Wei Feinstein 2018-07-24 10:30:12 MDT
We tried the partition level setting yesterday and it brought slurmd down
on the nodes. I’ll send you the message from a node when I get to my
computer.

Thanks

Jackie Scoggins

On Jul 24, 2018, at 4:36 AM, bugs@schedmd.com wrote:

*Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c2> on bug 5467
<https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

> PartitionName=xxx Nodes=xxxx Default=NO Oversubscribe=FORCE
> SelectTypeParameters=CR_CORE DefMemPerCPU=0

Note that FORCE setting allows to oversubscribe cores, I don't know if you also
want that or not.

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 4 Wei Feinstein 2018-07-24 11:04:51 MDT
I set the parameters you requested now I am seeing in the slurmctld log file the following message - 

[2018-07-24T10:01:06.579] cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core
[2018-07-24T10:01:06.579] cons_res: Can't use Partition SelectType unless using CR_Socket or CR_Core

Does this mean that the SelectType global variable need to be changed?

Current value - 
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
PartitionName=alice        Nodes=n000[0-3].alice[0]                    Shared=Yes          SelectTypeParameters=CR_CORE DefMemPerCPU=0 DefMemPerNode=260000 OverSubscribe=FORCE


All of the other partitions are setup without the SelectTypeParameter setting and they should be assuming the global one correct?
Comment 5 Wei Feinstein 2018-07-24 16:19:06 MDT
We dont want to oversubscribe the cores but the memory.  So should we use CR_CPU or CR_Memory instead for the SelectTypeParameter in the partition settings.
Comment 6 Felip Moll 2018-07-25 03:59:37 MDT
Hi, sorry, I was checking before giving you a response.

> Does this mean that the SelectType global variable need to be changed?

Yes, SelectTypeParameters in a partition level only works with CR_Core_* or CR_Socket_* set globally.

Are you using hyperthreading on the nodes?

If not, it should be safe to move to CR_Core_*.

If you have it enabled, you still may want to use CR_Core_* if you don't want one job on every hyper-thread but on one core. 

Do you have any particular reason to use CR_CPU_*?


> All of the other partitions are setup without the SelectTypeParameter
> setting and they should be assuming the global one correct?

Yes. They are assuming the global value.

> We dont want to oversubscribe the cores but the memory.
> So should we use CR_CPU or CR_Memory instead for the SelectTypeParameter in the partition settings.

You should use:

PartitionName=xxx Nodes=xxxx ... SelectTypeParameters=CR_Core DefMemPerCPU=0

This way only Cores will be constrained for this partition, but not memory. Doing this you are "removing" the *_Memory part that's set in the global value (SelectTypeParameters=CR_Core_Memory), which means to not control it.
Comment 7 Jason Booth 2018-07-27 10:33:14 MDT
Hi Jacque,

 I hope you are doing well. I wanted to follow up with you on this ticket to see if you needed any further clarification about what Felip has proposed? 

In his last response, he mentioned that you would need to modify the global "SelectTypeParameters=CR_CPU_Memory" to use one of the "CR_Core_*." for this to work properly and overwrite the DefMemPerCPU on the partition.

For example:

SelectTypeParameters=CR_Core_Memory
PartitionName=alice        Nodes=n000[0-3].alice[0]                    Shared=Yes          SelectTypeParameters=CR_CORE DefMemPerCPU=0 

We are also curious to know if you are using hyperthreading on the nodes.

Best regards,
Jason
Comment 8 Wei Feinstein 2018-07-27 17:07:28 MDT
Jason

We tried the parameter changes and it did not work as expected. Do you have
time to talk now?  I have a follow up meeting with the user on Monday. And
I want to get it squared up before then.

Thanks

Jackie Scoggins

On Jul 27, 2018, at 9:33 AM, bugs@schedmd.com wrote:

*Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c7> on bug 5467
<https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth
<jbooth@schedmd.com> *

Hi Jacque,

 I hope you are doing well. I wanted to follow up with you on this ticket to
see if you needed any further clarification about what Felip has proposed?

In his last response, he mentioned that you would need to modify the global
"SelectTypeParameters=CR_CPU_Memory" to use one of the "CR_Core_*." for this to
work properly and overwrite the DefMemPerCPU on the partition.

For example:

SelectTypeParameters=CR_Core_Memory
PartitionName=alice        Nodes=n000[0-3].alice[0]
Shared=Yes          SelectTypeParameters=CR_CORE DefMemPerCPU=0

We are also curious to know if you are using hyperthreading on the nodes.

Best regards,
Jason

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 9 Felip Moll 2018-07-30 00:03:14 MDT
(In reply to Jacqueline Scoggins from comment #8)
> Jason
> 
> We tried the parameter changes and it did not work as expected. Do you have
> time to talk now?  I have a follow up meeting with the user on Monday. And
> I want to get it squared up before then.
> 
> Thanks
> 
> Jackie Scoggins

Hi Jacqueline, 

Can you please tell me why exactly it didn't work?

Changing the global to CR_Core_Memory + the partition to SelectTypeParameters=CR_Core DefMemPerCPU=0 should work, I tested it in my environment before answering you and I had no problems.

Thanks
Felip
Comment 10 Wei Feinstein 2018-07-30 00:12:28 MDT
The jobs are getting queued when we believe it has enough resources to add
more jobs if oversubscription is set. We did a test right after and it did
not behave as we expected.  I’ll send you some stats tomorrow if it’s still
on my computer. I have a follow up
meeting with the user tomorrow and I can provide more examples.

Thanks

Jackie Scoggins

On Jul 29, 2018, at 11:03 PM, bugs@schedmd.com wrote:

*Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c9> on bug 5467
<https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

(In reply to Jacqueline Scoggins from comment #8
<show_bug.cgi?id=5467#c8>)> Jason
>
> We tried the parameter changes and it did not work as expected. Do you have
> time to talk now?  I have a follow up meeting with the user on Monday. And
> I want to get it squared up before then.
>
> Thanks
>
> Jackie Scoggins

Hi Jacqueline,

Can you please tell me why exactly it didn't work?

Changing the global to CR_Core_Memory + the partition to
SelectTypeParameters=CR_Core DefMemPerCPU=0 should work, I tested it in my
environment before answering you and I had no problems.

Thanks
Felip

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 11 Wei Feinstein 2018-07-30 00:14:46 MDT
One additional issue is changing the global parameter could affect or not
affect our other customers configuration. This is a single slurm
configuration managing about 10+ clusters. That’s why I’m trying to do it
only at the partition level.

Thanks

Jackie Scoggins

On Jul 29, 2018, at 11:03 PM, bugs@schedmd.com wrote:

*Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c9> on bug 5467
<https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

(In reply to Jacqueline Scoggins from comment #8
<show_bug.cgi?id=5467#c8>)> Jason
>
> We tried the parameter changes and it did not work as expected. Do you have
> time to talk now?  I have a follow up meeting with the user on Monday. And
> I want to get it squared up before then.
>
> Thanks
>
> Jackie Scoggins

Hi Jacqueline,

Can you please tell me why exactly it didn't work?

Changing the global to CR_Core_Memory + the partition to
SelectTypeParameters=CR_Core DefMemPerCPU=0 should work, I tested it in my
environment before answering you and I had no problems.

Thanks
Felip

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 12 Felip Moll 2018-07-30 00:20:05 MDT
(In reply to Jacqueline Scoggins from comment #11)
> One additional issue is changing the global parameter could affect or not
> affect our other customers configuration. This is a single slurm
> configuration managing about 10+ clusters. That’s why I’m trying to do it
> only at the partition level.
> 
> Thanks
> 
> Jackie Scoggins

This was the reason that I asked if you were using hyper threading or not. If not, or if you are scheduling at a core level, this shouldn't change how things work.

Can you attach your current slurm.conf (I have an old one) and I'll take a look too?
Comment 13 Wei Feinstein 2018-07-30 04:53:11 MDT
Created attachment 7451 [details]
attachment-20858-0.html

Here’s the slurm.conf file.
Comment 14 Wei Feinstein 2018-07-30 04:53:12 MDT
Created attachment 7452 [details]
slurm.conf
Comment 16 Wei Feinstein 2018-07-30 09:49:19 MDT
I am not 100% but I think it is working as expected. The user changed his
mem size from 6G to 4G per job and now he is seeing 128G of mem alloc. I
will check in with him today to see if everything is working as expected.

Thanks

Jackie

On Mon, Jul 30, 2018 at 4:02 AM, <bugs@schedmd.com> wrote:

> Felip Moll <felip.moll@schedmd.com> changed bug 5467
> <https://bugs.schedmd.com/show_bug.cgi?id=5467>
> What Removed Added
> CC   tim@schedmd.com
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c14> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jacqueline
> Scoggins <jscoggins@lbl.gov> *
>
> Created attachment 7452 [details] <https://bugs.schedmd.com/attachment.cgi?id=7452> [details] <https://bugs.schedmd.com/attachment.cgi?id=7452&action=edit>
> slurm.conf
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 17 Felip Moll 2018-07-30 10:29:48 MDT
(In reply to Jacqueline Scoggins from comment #16)
> I am not 100% but I think it is working as expected. The user changed his
> mem size from 6G to 4G per job and now he is seeing 128G of mem alloc. I
> will check in with him today to see if everything is working as expected.
> 
> Thanks
> 
> Jackie

With DefMemPerCpu=0 this is expected; it just changes the amount of required memory by the user to "infinite". Without changing the global to CR_Core_Memory and the partition SelectTypeParameters to CR_Core, the maximum memory that can be used at a time in the node is 128G, so no memory overcommit can happen.

Changing global from CR_CPU_Memory to CR_Core_Memory shouldn't make any noticeable difference for you and is needed for changing the select type partition parameter.

Before telling you to change it I will test twice and ensure that all works as expected.
Comment 18 Wei Feinstein 2018-07-30 15:50:57 MDT
I spoke too soon.  Here is what the user wants and he is not getting it.

He wants the number of jobs scheduled to the system to be 75% of the
 number of cores (28 cores with HT its 56) so he's looking for 42 jobs to
run.  He is requesting the jobs in his script as follows -

#SBATCH --qos=alice_normal --partition=alice --ntasks=1 --cpus-per-task=1
--mem=4000M --time=48:00:00 --account=alice

The qos alice_normal -

Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES|

alice_normal|0|00:00:00||cluster|||1.000000|||||||||||||||140|||


The partition alice -
PartitionName=alice        Nodes=n000[0-3].alice[0]
 Shared=Yes          SelectTypeParameters=CR_CORE DefMemPerCPU=0
OverSubscribe=FORCE:56


PartitionName=alice

   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL

   AllocNodes=ALL Default=NO QoS=N/A

   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO

   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO
MaxCPUsPerNode=UNLIMITED

   Nodes=n000[0-3].alice[0]

   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=FORCE:56

   OverTimeLimit=NONE PreemptMode=REQUEUE

   State=UP TotalCPUs=224 TotalNodes=4 SelectTypeParameters=CR_CORE

   DefMemPerCPU=UNLIMITED MaxMemPerNode=UNLIMITED


node information


sinfo -leN --partition=alice
Mon Jul 30 14:44:45 2018

NODELIST      NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
WEIGHT AVAIL_FE REASON
n0000.alice0      1     alice       mixed   56   2:28:1 128824   503836
 1    alice none
n0001.alice0      1     alice       mixed   56   2:28:1 128824   503836
 1    alice none
n0002.alice0      1     alice       mixed   56   2:28:1 128824   503836
 1    alice none
n0003.alice0      1     alice       mixed   56   2:28:1 128824   503836
 1    alice none

They also want to make sure that if they request 6GB the job does not run
over 6GB and will be killed.

If you have time to talk this over please let me know. We can setup a zoom
conference and I can share my screen with you.

Thanks

Jackie


On Mon, Jul 30, 2018 at 9:29 AM, <bugs@schedmd.com> wrote:

> *Comment # 17 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c17> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> (In reply to Jacqueline Scoggins from comment #16 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c16>)> I am not 100% but I think it is working as expected. The user changed his
> > mem size from 6G to 4G per job and now he is seeing 128G of mem alloc. I
> > will check in with him today to see if everything is working as expected.
> >
> > Thanks
> >
> > Jackie
>
> With DefMemPerCpu=0 this is expected; it just changes the amount of required
> memory by the user to "infinite". Without changing the global to CR_Core_Memory
> and the partition SelectTypeParameters to CR_Core, the maximum memory that can
> be used at a time in the node is 128G, so no memory overcommit can happen.
>
> Changing global from CR_CPU_Memory to CR_Core_Memory shouldn't make any
> noticeable difference for you and is needed for changing the select type
> partition parameter.
>
> Before telling you to change it I will test twice and ensure that all works as
> expected.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 19 Felip Moll 2018-07-31 09:26:37 MDT
(In reply to Jacqueline Scoggins from comment #18)
> I spoke too soon.  Here is what the user wants and he is not getting it.
> 
> He wants the number of jobs scheduled to the system to be 75% of the
>  number of cores (28 cores with HT its 56) so he's looking for 42 jobs to
> run.  He is requesting the jobs in his script as follows -

Judging from your slurm.conf file, you *are not* using hyper-threading.

NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=28 Feature=alice Weight=1   # C6320 28 cores  128G RAM

This seems to be Dell PowerEdge C6320 with 2 sockets and 28 cores per socket, therefor you have 56 Cores. Am I right?

> The partition alice -
> PartitionName=alice Nodes=n000[0-3].alice[0]  Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE:56

On your configuration, this FORCE:56 means one Core can run up to 56 jobs, so you could theoretically have 56*56 = 3136 jobs in the node.
That maximum doesn't make much sense here.


> NODELIST      NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> n0000.alice0      1     alice       mixed   56   2:28:1 128824   503836
>  1    alice none
> n0001.alice0      1     alice       mixed   56   2:28:1 128824   503836
>  1    alice none
> n0002.alice0      1     alice       mixed   56   2:28:1 128824   503836
>  1    alice none
> n0003.alice0      1     alice       mixed   56   2:28:1 128824   503836
>  1    alice none

This listing shows like nodes have only 1 thread, so no HyperThreading is enabled (S:C:T).


> They also want to make sure that if they request 6GB the job does not run
> over 6GB and will be killed.

I am confused, in the initial comment and in comment 5 you say you want to oversubscribe memory.

Let me understand:

What the user wants is to launch 42 jobs, each one in one single core, and each job to have the possibility to reach 6GB RAM limit, maybe exceeding the total system memory.
Is that it?
Comment 20 Wei Feinstein 2018-07-31 13:00:55 MDT
Hello,

What the user wants is to launch 42 jobs, each one in one single core, and each
job to have the possibility to reach 6GB RAM limit, maybe exceeding the total
system memory.
Is that it?

Yes.

Here is what I have done so far -

Ok after reviewing the setup and seeing that it was not properly set on the
node for HT I have fixed it.  I have done the following -
SelectTypeParameter is set to CR_CPU_MEMORY because when the user was
requesting --cpus-per-tasks=1  I saw that he was being allocated 2 CPU's
instead of 1.  Changing it to CR_CPU_MEMORY the jobs are now being
allocated 1 CPU.  If I have it set to CR_CORE_MEMORY would you recommend
they use --cores-per-tasks instead?

I added ThreadsperCore=2 Sockets=2 CoresPerSocket=14 which gives me a total
of 56 CPU's.  I also had to add LLN because I noticed that all the jobs
were being stacked on a single node first and then the next node and they
don't want it to behave that way.

i.e.
NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2
CoresPerSocket=14 Sockets=2 Feature=alice Weight=1
PartitionName=alice   Nodes=n000[0-3].alice[0]  Shared=Yes
SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes

I am watching the jobs in the queue and the system using the following
commands;

squeue --user=rjporter --state=R -o "%T|%h| %r |%A | %C| %u| %j| %m| %Y|%N"

sinfo -lNe | grep alice0
scontrol show node n000[0-3].alice[0] | egrep "NodeAddr|Mem"

And I'm watching the slurmctld log file to see when a job is started.


Is there anything else you would recommend I run to verify that it is
behaving as expected - all Cores will be allocated a job expecting at least
42+ jobs to run per node?

Thanks


Jackie

On Tue, Jul 31, 2018 at 8:26 AM, <bugs@schedmd.com> wrote:

> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c19> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> (In reply to Jacqueline Scoggins from comment #18 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c18>)> I spoke too soon.  Here is what the user wants and he is not getting it.
> >
> > He wants the number of jobs scheduled to the system to be 75% of the
> >  number of cores (28 cores with HT its 56) so he's looking for 42 jobs to
> > run.  He is requesting the jobs in his script as follows -
>
> Judging from your slurm.conf file, you *are not* using hyper-threading.
>
> NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2
> CoresPerSocket=28 Feature=alice Weight=1   # C6320 28 cores  128G RAM
>
> This seems to be Dell PowerEdge C6320 with 2 sockets and 28 cores per socket,
> therefor you have 56 Cores. Am I right?
> > The partition alice -
> > PartitionName=alice Nodes=n000[0-3].alice[0]  Shared=Yes SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE:56
>
> On your configuration, this FORCE:56 means one Core can run up to 56 jobs, so
> you could theoretically have 56*56 = 3136 jobs in the node.
> That maximum doesn't make much sense here.
>
> > NODELIST      NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> > WEIGHT AVAIL_FE REASON
> > n0000.alice0      1     alice       mixed   56   2:28:1 128824   503836
> >  1    alice none
> > n0001.alice0      1     alice       mixed   56   2:28:1 128824   503836
> >  1    alice none
> > n0002.alice0      1     alice       mixed   56   2:28:1 128824   503836
> >  1    alice none
> > n0003.alice0      1     alice       mixed   56   2:28:1 128824   503836
> >  1    alice none
>
> This listing shows like nodes have only 1 thread, so no HyperThreading is
> enabled (S:C:T).
>
> > They also want to make sure that if they request 6GB the job does not run
> > over 6GB and will be killed.
>
> I am confused, in the initial comment and in comment 5 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c5> you say you want to
> oversubscribe memory.
>
> Let me understand:
>
> What the user wants is to launch 42 jobs, each one in one single core, and each
> job to have the possibility to reach 6GB RAM limit, maybe exceeding the total
> system memory.
> Is that it?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 21 Felip Moll 2018-08-01 10:49:30 MDT
Jacque,

Now you have configured HyperThreading and thus using CR_CPU_Memory makes your initial request not to be possible.

I am currently looking at alternatives for you.

One of them is to make each "alice" node look like it has more memory than it actually does, therefore your user will be able to submit more jobs to the nodes.

Note that in any case that can result in OOM, but allows some amount of extra memory for scheduling.


Is this user aware that he can receive OOMs?
Comment 22 Felip Moll 2018-08-01 10:58:56 MDT
> when the user was requesting --cpus-per-tasks=1  I saw that he was being allocated 2 CPU's
> instead of 1.  Changing it to CR_CPU_MEMORY the jobs are now being
> allocated 1 CPU.

This is working as expected.

With CR_CPU_* each task is bind to an "hyper-thread".
With CR_Core_* each task is bind to a physical core.

Depending on the application, it may be preferable to schedule to cores instead to hyper-threads.
Are you sure your really want to bind processes to hyper-threads and not to physical cores?

One hyper-thread shares resources with others.

> If I have it set to CR_CORE_MEMORY would you recommend
> they use --cores-per-tasks instead?

In that case, --cores-per-tasks=1 will bind 1 task to 1 core. The granularity will be core, not hyper-thread.
 

> I also had to add LLN because I noticed that all the jobs
> were being stacked on a single node first and then the next node and they
> don't want it to behave that way.

That's fine.

> i.e.
> NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2
> CoresPerSocket=14 Sockets=2 Feature=alice Weight=1
> PartitionName=alice   Nodes=n000[0-3].alice[0]  Shared=Yes
> SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes

What's not good here is SelectTypeParameters=CR_CORE at a partition level when you have SelectTypeParameters=CR_CPU_Memory at a global level. This won't work.


> Is there anything else you would recommend I run to verify that it is
> behaving as expected - all Cores will be allocated a job expecting at least
> 42+ jobs to run per node?

With your setup, memory will still be constrained and there will not be possibility to overcommit.

I.e. If one job requests 6GB and 1 CPU,  you will not be able to run more than 21 jobs (128GB RAM/6GB per job).


See my previous comment for more info. I am looking alternatives for you.
Comment 24 Felip Moll 2018-08-01 12:15:51 MDT
Jacque,

I'm gonna suggest you to try:
---------------------
FastSchedule=2

SelectTypeParameters=CR_CPU_Memory
NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=28 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores 128G RAM
PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=2000
---------------------

Explanation:

FastSchedule is now set to 0. This indicates that each node reports its memory to slurmctld, and that's what is used to define the maximum memory available to the node. If slurm.conf has a node definition with > memory than the real memory, the node is *not* set to drain. Are you sure you want FastSchedule to be 0 and not 1?

For your situation where you want to oversubscribe memory, FastSchedule must be set to 2. This implies that slurm.conf CPUs and Memory must be correctly set for each node. '2' allows you to define more memory than what's currently available in the node. i.e. my laptop has 8G real memory, but with FastSchedule=2 I faked it to have 256GB.

Thanks to this parameter, and to the NodeName definition, where I fake the RealMemory to be 42*6GB (258048), you will be able to oversubscribe memory on these node plus constraining memory to jobs.

I also removed Shared=Yes parameter, no need for it if you don't want to oversubscribe cores (moreover it is deprecated in favor of Oversubscribe=FORCE).


Tell me if it does work for you.
Comment 25 Wei Feinstein 2018-08-01 12:41:26 MDT
On Wed, Aug 1, 2018 at 9:58 AM, <bugs@schedmd.com> wrote:

> *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c22> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> > when the user was requesting --cpus-per-tasks=1  I saw that he was being allocated 2 CPU's
> > instead of 1.  Changing it to CR_CPU_MEMORY the jobs are now being
> > allocated 1 CPU.
>
> This is working as expected.
>
> With CR_CPU_* each task is bind to an "hyper-thread".
> With CR_Core_* each task is bind to a physical core.
>
> Depending on the application, it may be preferable to schedule to cores instead
> to hyper-threads.
> Are you sure your really want to bind processes to hyper-threads and not to
> physical cores?
>
>
We want to bind to cores and not the hyper-threads so I will change it back
to CR_Core.  Which means the user need to request --cores-per-tasks instead
of --cpus-per-tasks.

>
> One hyper-thread shares resources with others.
> > If I have it set to CR_CORE_MEMORY would you recommend
> > they use --cores-per-tasks instead?
>
> In that case, --cores-per-tasks=1 will bind 1 task to 1 core. The granularity
> will be core, not hyper-thread.
>
> > I also had to add LLN because I noticed that all the jobs
> > were being stacked on a single node first and then the next node and they
> > don't want it to behave that way.
>
> That's fine.
> > i.e.
> > NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2
> > CoresPerSocket=14 Sockets=2 Feature=alice Weight=1
> > PartitionName=alice   Nodes=n000[0-3].alice[0]  Shared=Yes
> > SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes
>
> What's not good here is SelectTypeParameters=CR_CORE at a partition level when
> you have SelectTypeParameters=CR_CPU_Memory at a global level. This won't work.
>
>
Will keep this once the global changes are made.


> > Is there anything else you would recommend I run to verify that it is
> > behaving as expected - all Cores will be allocated a job expecting at least
> > 42+ jobs to run per node?
>
> With your setup, memory will still be constrained and there will not be
> possibility to overcommit.
>
> Ok please advice how to make it so that memory is not constrained.  They
want to use there cluster to the fullest and not have memory constraining
using the cores.  If a jobs uses more than 6GB they want it to be killed
and they are ok with that.  They are submitting 2 day jobs via a program
that just keeps spawning them.  They are all 1 cpu , 1 tasks with 6GB of
ram requested but they vary in what they actually do.  Most of them run
under 4GB and some maybe above but they don't know which will run in which
manner so they set all of them to the highest memory they expect the job to
run. If it goes over they want it to die.

> I.e. If one job requests 6GB and 1 CPU,  you will not be able to run more than
> 21 jobs (128GB RAM/6GB per job).
>
>
> See my previous comment for more info. I am looking alternatives for you.
>
>
Any help here would be greatly appreciated.

> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 26 Wei Feinstein 2018-08-01 15:21:06 MDT
The only concern I have with setting FastSchedule to 2 is that this is not
the only cluster we have under slurm.conf and I don't want to impact the
entire set of clusters with this change.  Other clusters do want to have
their memory constraint and we're using cgroup as we intended for the rest
of the clusters.  Setting this global setting will have what affect to the
other clusters where users don't request memory for their job and some are
exclusive and some are shared nodes.  Have you reviewed the slurm.conf file
to verify that this setting will not impact other clusters?

Please advice before I make this change.

Jackie

On Wed, Aug 1, 2018 at 11:15 AM, <bugs@schedmd.com> wrote:

> *Comment # 24 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c24> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> Jacque,
>
> I'm gonna suggest you to try:
> ---------------------
> FastSchedule=2
>
> SelectTypeParameters=CR_CPU_Memory
> NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2
> CoresPerSocket=28 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores
> 128G RAM
> PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=2000
> ---------------------
>
> Explanation:
>
> FastSchedule is now set to 0. This indicates that each node reports its memory
> to slurmctld, and that's what is used to define the maximum memory available to
> the node. If slurm.conf has a node definition with > memory than the real
> memory, the node is *not* set to drain. Are you sure you want FastSchedule to
> be 0 and not 1?
>
> For your situation where you want to oversubscribe memory, FastSchedule must be
> set to 2. This implies that slurm.conf CPUs and Memory must be correctly set
> for each node. '2' allows you to define more memory than what's currently
> available in the node. i.e. my laptop has 8G real memory, but with
> FastSchedule=2 I faked it to have 256GB.
>
> Thanks to this parameter, and to the NodeName definition, where I fake the
> RealMemory to be 42*6GB (258048), you will be able to oversubscribe memory on
> these node plus constraining memory to jobs.
>
> I also removed Shared=Yes parameter, no need for it if you don't want to
> oversubscribe cores (moreover it is deprecated in favor of
> Oversubscribe=FORCE).
>
>
> Tell me if it does work for you.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 27 Jason Booth 2018-08-01 16:25:22 MDT
Hi Jacque,

 Jacob mentioned that you have asked for a phone call about this issue. We work exclusively through Bugzilla so, unfortunately, this will not be possible. I have reviewed your ticket, "5467", and I see that Felip has given valid responses to your requests. Please note that the requirements you have specified have changed through the ticket such as in comment #18 and #20 so, this has added to the time it takes to respond with meaningful information. Also, the request is not clear to us and has the potential to have serious repercussions for those nodes. We understand the user wants to oversubscribe, however, it is not clear why they wish to do this thing.

1) Based on the information you have given you want to still use cgroups yet disregard the memory locking used by the cgroup?
2) Are the user's tasks ballooning in size to consume all the RAM and then decreasing in size over the lifecycle of the job?
3) Does the user just want to use the entire node (exclusively)?
4) What type of jobs are these (Matlab)?

Please ask the user why they want to do this, because, if two jobs land on the node and each consumes all the memory then OOM would be invoked which is never good as this would stomp on other processes.

In answer to your last question about "FastSchedule 2", this should not cause any issues since this option will look at the slurm.conf and honor the configured node attributes over the detected ones.

"Consider the configuration of each node to be that specified in the slurm.conf configuration file and any node with less than the configured resources will not be set DRAIN."

In regards to your other concerns about testing these parameters, setting and unsetting parameters in the conf files, have you considered testing these out in a test cluster before deploying them to production? 

Best regards,
Jason
Director of Support
Comment 28 Wei Feinstein 2018-08-01 17:22:01 MDT
This is why emails is not the most efficient way to communicate.

We have about 13 clusters configured under our slurm.conf file.  We have
multiple configurations per clusters because they are for different
users/departments.  Cgroup has been setup on all of the clusters and we
want to use this for all of them except this one cluster (alice). Here is
what the user requested to us for his cluster (see below) and what I don't
want to do is have the changes just for this partition affect the global
configuration we have for all of the other 12 clusters.

"Hey Karen & John,

I mentioned briefly to John & Gary that I'd like to optimize the number of
concurrent jobs run on the cluster.  Using hyper threaded slots, we found
we can get about 30% boost by doubling the # of slots relative to cores,
but then we also hit occasional memory problems.  So what we do at ORNL is
more cautiously push hyperthreaded slots.  I'd like to do the same at the
HPCS cluster.  Here are the conditions:

1. ALICE requires sites provide 2GB/job slot plus swap (~2-3GB/slot).
2. (Most) ALICE sites do not impose a memory limit request per job.
3. The average ALICE job, weighted by length (e.g. ignoring all the little
jobs), is ~2.1GB/job
4. That average includes a long tail out to ~6 or 7 GBs
5. Internally, each ALICE jobagent monitors its payload and kills the job
if it uses 8GB of mem.
6. Sites that implement #1 & #2 report on occasion (few times per year)
that nodes need to be rebooted because of memory exhaustion.

What we do at ORNL is implement #1 and #2 but at 3GB/slot.   The ORNL folks
report that they have not had to reboot a node due memory exhaustion, but
in checking the logs they find that a handful of single node OOM events per
month that go into swap but are able to self-recover within a couple of
hours.  This is a tiny loss of processing and does not impact cluster
operations.

To implement this on HPCS, we should first allow for 35 slots/node = 28 x
4GB/3GB.  Then either: do not have memory as a consumable resource or tell
the scheduler that it has more memory than physical RAM. e.g. say we define
the alice job request at 6.5GB and then tell the scheduler it has 228GB of
RAM.  Then the scheduler will allow 35 such jobs on each node.

What do you think?

thanks,

Jeff"

Keep in mind from looking at the slurm.conf file we have multiple
partitions and we use qos's to control and manage limits.  We have cgroup
setup on all clusters and this user want to use the cluster in the same
configuration differently than the global configuration settings we have
for the other cluster users. And btw if I had a test cluster I would
definitely be performing these changes there. At this time there is not one
setup.  I hope this clears up the request from what the users email to my
team members.

Thanks

Jackie



On Wed, Aug 1, 2018 at 3:25 PM, <bugs@schedmd.com> wrote:

> *Comment # 27 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c27> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth
> <jbooth@schedmd.com> *
>
> Hi Jacque,
>
>  Jacob mentioned that you have asked for a phone call about this issue. We work
> exclusively through Bugzilla so, unfortunately, this will not be possible. I
> have reviewed your ticket, "5467", and I see that Felip has given valid
> responses to your requests. Please note that the requirements you have
> specified have changed through the ticket such as in comment #18 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c18> and #20 so,
> this has added to the time it takes to respond with meaningful information.
> Also, the request is not clear to us and has the potential to have serious
> repercussions for those nodes. We understand the user wants to oversubscribe,
> however, it is not clear why they wish to do this thing.
>
> 1) Based on the information you have given you want to still use cgroups yet
> disregard the memory locking used by the cgroup?
> 2) Are the user's tasks ballooning in size to consume all the RAM and then
> decreasing in size over the lifecycle of the job?
> 3) Does the user just want to use the entire node (exclusively)?
> 4) What type of jobs are these (Matlab)?
>
> Please ask the user why they want to do this, because, if two jobs land on the
> node and each consumes all the memory then OOM would be invoked which is never
> good as this would stomp on other processes.
>
> In answer to your last question about "FastSchedule 2", this should not cause
> any issues since this option will look at the slurm.conf and honor the
> configured node attributes over the detected ones.
>
> "Consider the configuration of each node to be that specified in the slurm.conf
> configuration file and any node with less than the configured resources will
> not be set DRAIN."
>
> In regards to your other concerns about testing these parameters, setting and
> unsetting parameters in the conf files, have you considered testing these out
> in a test cluster before deploying them to production?
>
> Best regards,
> Jason
> Director of Support
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 31 Felip Moll 2018-08-02 04:57:49 MDT
Jacqueline,

It is a bit more clear now but I still don't fully understand some parts like:

> we should first allow for 35 slots/node = 28 x 4GB/3GB

Or this:
> ALICE sites do not impose a memory limit request per job. <--- I guess this is Slurm job request limit?
+
> we define the alice job request at 6.5GB  <--- I guess this is internal ALICE limit?

-----------

In any case, the easiest and more flexible way to implement what you want is to set:

FastSchedule=2
SelectTypeParameters=CR_CPU_Memory
NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 Feature=alice Weight=1 RealMemory=258048 # C6320 28 cores 128G RAM, faked RealMemory
PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=3000

(Adjust RealMemory to NumJobsYouWant*SizePerJobYouWant, i.e. 35 jobs x 6.5GB/job = 232960)
(Adjust DefMemPerCPU to be less or equal than SizePerJobYouWant, i.e. 6.5GB = 6656)


*BE AWARE*: Setting FastSchedule=2 implies that your node definition must be correctly and carefully defined in slurm.conf. Automatic info gathered from slurmd when node registers will *not be* honored. i.e. if you don't define RealMemory to some node, when you start this node will have RealMemory=1. Any parameter not defined in NodeName definition in slurm.conf will get the default = 1.

So your work here would be to check the RealMemory/Sockets/CoresPerSocket/ThreadsPerCore/CPUs for all nodes to be correctly set, before applying the FastSchedule change.

Then apply the proposed change.

Restart slurmctld.
Comment 32 Wei Feinstein 2018-08-02 05:49:56 MDT
Ok I will look at this today and see if we want to make this change which
will be affecting all of our node configuration settings.

Thanks

Jackie Scoggins

On Aug 2, 2018, at 3:57 AM, bugs@schedmd.com wrote:

*Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c31> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

Jacqueline,

It is a bit more clear now but I still don't fully understand some parts like:
> we should first allow for 35 slots/node = 28 x 4GB/3GB

Or this:> ALICE sites do not impose a memory limit request per job.
<--- I guess this is Slurm job request limit?
+> we define the alice job request at 6.5GB  <--- I guess this is
internal ALICE limit?

-----------

In any case, the easiest and more flexible way to implement what you want is to
set:

FastSchedule=2
SelectTypeParameters=CR_CPU_Memory
NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Sockets=2
CoresPerSocket=14 ThreadsPerCore=2 Feature=alice Weight=1 RealMemory=258048 #
C6320 28 cores 128G RAM, faked RealMemory
PartitionName=alice Nodes=n000[0-3].alice[0] DefMemPerCPU=3000

(Adjust RealMemory to NumJobsYouWant*SizePerJobYouWant, i.e. 35 jobs x
6.5GB/job = 232960)
(Adjust DefMemPerCPU to be less or equal than SizePerJobYouWant, i.e. 6.5GB =
6656)


*BE AWARE*: Setting FastSchedule=2 implies that your node definition must be
correctly and carefully defined in slurm.conf. Automatic info gathered from
slurmd when node registers will *not be* honored. i.e. if you don't define
RealMemory to some node, when you start this node will have RealMemory=1. Any
parameter not defined in NodeName definition in slurm.conf will get the default
= 1.

So your work here would be to check the
RealMemory/Sockets/CoresPerSocket/ThreadsPerCore/CPUs for all nodes to be
correctly set, before applying the FastSchedule change.

Then apply the proposed change.

Restart slurmctld.

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 33 Felip Moll 2018-08-07 06:49:45 MDT
Hi Jacque,

Have you finally taken a decision on this matter?

Please, keep me informed.
Comment 34 Wei Feinstein 2018-08-09 19:03:40 MDT
We have not taken a decision on this. My concern is the amount of changes that need to occur in the slurm.conf file just because we need to change FastSchedule  from 0 to 2 and then add for each node memory resource limits.  We currently have 63 Nodename lines that now need to be updated and I really don't want to have to do this just to allow one cluster to use a majority of their resources without limitations on memory or the number of cores.
 
Which is only a set of 4 nodes and 1 line of change for that cluster.  I feel that we have a complex setup and that this change is more work than it should be.

Currently alice is setup as follows:


sinfo -leN --partition=alice
Thu Aug  9 17:54:38 2018
NODELIST      NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
n0000.alice0      1     alice   allocated   56   2:14:2 128824   503836      1    alice none                
n0001.alice0      1     alice   allocated   56   2:14:2 128824   503836      1    alice none                
n0002.alice0      1     alice   allocated   56   2:14:2 128824   503836      1    alice none                
n0003.alice0      1     alice   allocated   56   2:14:2 128824   503836      1    alice none                
[root]# grep -i alice /etc/slurm/slurm.conf
## ALICE nodes
NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 ThreadsPerCore=2 CoresPerSocket=14 Sockets=2 Feature=alice Weight=1   # C6320 28 cores  128G RAM
PartitionName=alice        Nodes=n000[0-3].alice[0]                    Shared=Yes          SelectTypeParameters=CR_CORE DefMemPerCPU=0 OverSubscribe=FORCE LLN=Yes


Number of running jobs per node
    32 n0000.alice0
    32 n0001.alice0
    32 n0002.alice0
    32 n0003.alice0

Number of pending jobs and reasons
     9 (Priority)
     1 (Resources)


Memory allocated per node
RealMemory=128824 AllocMem=128000 FreeMem=42331 Sockets=2 Boards=1

and the jobs pending are requesting the same parameters as the ones running
STATE|OVER_SUBSCRIBE| REASON |JOBID | CPUS| USER| NAME| MIN_MEMORY| SCHEDNODES|NODELIST
PENDING|YES| Priority |13838925 | 1| rjporter| AliEn.8300.155| 4000M| n0003.alice0|
PENDING|YES| Priority |13838920 | 1| rjporter| AliEn.8300.155| 4000M| n0000.alice0|
PENDING|YES| Priority |13838909 | 1| rjporter| AliEn.8300.155| 4000M| n0002.alice0|
PENDING|YES| Resources |13838903 | 1| rjporter| AliEn.8300.155| 4000M| n0001.alice0|
PENDING|YES| Priority |13838929 | 1| rjporter| AliEn.8300.155| 4000M| (null)|
PENDING|YES| Priority |13838933 | 1| rjporter| AliEn.8300.155| 4000M| (null)|
PENDING|YES| Priority |13839005 | 1| rjporter| AliEn.8300.155| 4000M| (null)|
PENDING|YES| Priority |13839658 | 1| rjporter| AliEn.8300.156| 4000M| (null)|
PENDING|YES| Priority |13839682 | 1| rjporter| AliEn.8300.156| 4000M| (null)|
PENDING|YES| Priority |13840022 | 1| rjporter| AliEn.8300.156| 4000M| (null)|


If there was another option that would be better I would love to hear it.
Comment 35 Felip Moll 2018-08-10 06:15:59 MDT
(In reply to Jacqueline Scoggins from comment #34)
> We have not taken a decision on this. My concern is the amount of changes
> that need to occur in the slurm.conf file just because we need to change
> FastSchedule  from 0 to 2 and then add for each node memory resource limits.
> We currently have 63 Nodename lines that now need to be updated and I really
> don't want to have to do this just to allow one cluster to use a majority of
> their resources without limitations on memory or the number of cores.

I talked with other engineers here.

This is the most reliable way to do so.

To get the realmemory of the nodes you can just do:

]$ scontrol show node|grep -i "nodename\|realmemory"

and you will have each node with its current realmemory. This is the value that you must set in slurm.conf



> If there was another option that would be better I would love to hear it.

3 more options.


You may want to try option 1, that doesn't involve changes to other partitions. Let me know if it works for you:

1. Keep CR_CPU_Memory globally and FastSchedule=0 as you currently have. Set only "CPUs=56" to the node definition skipping other parameters (no ThreadsPerCore, no Sockets, no Boards, no Cores). Set DefMemPerCPU=1 MaxMemPerCPU=1 to the partition definition. 

In 'alice' nodes, change cgroup.conf individually and set ConstrainRAMSpace=no + ConstrainSwapSpace=no.

i.e.
PartitionName=alice  Nodes=n000[0-3].alice[0]  CPUS=56 DefMemPerCPU=1 MaxMemPerCPU=1
NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice Weight=1   # C6320 28 cores  128G RAM

Set globally:

MemLimitEnforce=no
TaskPluginParams=NoOverMemoryKill

2. I can try to code a patch to let you set CR_CPU in the Partition specification when you have CR_CPU_* set globally. You would have to recompile slurm with this new patch, deploy it, and maintain the patch locally, because it won't be part of any standard release. It won't go neither into 18.08. This is considered as a new feature.

3. You can use the Federated Cluster Support. This would imply to start another slurmctld instance with its own configuration. https://slurm.schedmd.com/SLUG16/FederatedScheduling.pdf



Let me know if 1) works for you.
Comment 36 Felip Moll 2018-08-16 08:45:05 MDT
> You may want to try option 1, that doesn't involve changes to other
> partitions. Let me know if it works for you:
> 
> 1. Keep CR_CPU_Memory globally and FastSchedule=0 as you currently have. Set
> only "CPUs=56" to the node definition skipping other parameters (no
> ThreadsPerCore, no Sockets, no Boards, no Cores). Set DefMemPerCPU=1
> MaxMemPerCPU=1 to the partition definition. 
> 
> In 'alice' nodes, change cgroup.conf individually and set
> ConstrainRAMSpace=no + ConstrainSwapSpace=no.
> 
> i.e.
> PartitionName=alice  Nodes=n000[0-3].alice[0]  CPUS=56 DefMemPerCPU=1
> MaxMemPerCPU=1
> NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice
> Weight=1   # C6320 28 cores  128G RAM
> 
> Set globally:
> 
> MemLimitEnforce=no
> TaskPluginParams=NoOverMemoryKill
> 
> Let me know if 1) works for you.

Jacqueline,

Have you been able to try this option?

Thanks
Felip
Comment 37 Wei Feinstein 2018-08-16 09:31:07 MDT
Not yet we have to discuss it and this weekend we have a power outage. It
might be implemented after that on Monday or Tuesday. I will try option 1
but my concern is what will the NoOverMemorykill global setting do to the
other partitions that have cgroup enforcing limits.

Thanks

Jackie Scoggins

On Aug 16, 2018, at 7:45 AM, bugs@schedmd.com wrote:

*Comment # 36 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c36> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

> You may want to try option 1, that doesn't involve changes to other
> partitions. Let me know if it works for you:
>
> 1. Keep CR_CPU_Memory globally and FastSchedule=0 as you currently have. Set
> only "CPUs=56" to the node definition skipping other parameters (no
> ThreadsPerCore, no Sockets, no Boards, no Cores). Set DefMemPerCPU=1
> MaxMemPerCPU=1 to the partition definition.
>
> In 'alice' nodes, change cgroup.conf individually and set
> ConstrainRAMSpace=no + ConstrainSwapSpace=no.
>
> i.e.
> PartitionName=alice  Nodes=n000[0-3].alice[0]  CPUS=56 DefMemPerCPU=1
> MaxMemPerCPU=1
> NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice
> Weight=1   # C6320 28 cores  128G RAM
>
> Set globally:
>
> MemLimitEnforce=no
> TaskPluginParams=NoOverMemoryKill
>
> Let me know if 1) works for you.

Jacqueline,

Have you been able to try this option?

Thanks
Felip

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 38 Felip Moll 2018-08-22 06:49:17 MDT
(In reply to Jacqueline Scoggins from comment #37)
> Not yet we have to discuss it and this weekend we have a power outage. It
> might be implemented after that on Monday or Tuesday. I will try option 1
> but my concern is what will the NoOverMemorykill global setting do to the
> other partitions that have cgroup enforcing limits.
> 
> Thanks
> 
> Jackie Scoggins

From the start of this bug I assumed you had the following settings:

cgroup.conf:
ConstrainRAMSpace=yes

slurm.conf:
TaskPlugin = task/cgroup


If you really have these two settings enabled, you MUST disable the other enforcement mechanism which uses jobacctgather to kill jobs and steps. If you have both mechanisms enabled there could be conflicts if one kills a step or a job while the other is still checking the limit. There are known bugs related to this.

I didn't want to bother you with this matter in order to not mix things here, but given your comment I have to say now that you NEED to change your global slurm.conf parameters to:

> > MemLimitEnforce=no
> > TaskPluginParams=NoOverMemoryKill

In future versions having both enables will cause a fatal to slurmd + slurmctld.

So it is safe to set it as I suggested as long as you constrain memory from task/cgroup.
Comment 39 Felip Moll 2018-08-28 02:18:28 MDT
Jacqueline,

Please tell me how you progressed on this bug.

Per comment 35 and 38 I would consider this bug as resolved as multiple paths has been presented to solve your problem.
Comment 40 Wei Feinstein 2018-08-28 14:13:59 MDT
It is not working as expected.  I made recommended changes and I am still
seeing jobs stay in Pending state even if there is CPU's available to run
the job.  Memory

NodeName=n000[0-3].alice[0] NodeAddr=10.0.4.[0-3] CPUs=56 Feature=alice
Weight=1   # C6320 28 cores  128G RAM

PartitionName=alice        Nodes=n000[0-3].alice[0]
 Shared=Yes          DefMemPerCPU=1  MaxMemPerCPU=1  OverSubscribe=FORCE
LLN=Yes

>  scontrol show config | grep -i mem
AccountingStorageTRES   = cpu,mem,energy,node,billing
DefMemPerNode           = UNLIMITED

*JobAcctGatherParams     = NoOverMemoryKill*MaxMemPerNode           =
UNLIMITED

*MemLimitEnforce         = No*SelectTypeParameters    = CR_CPU_MEMORY
JobAcctGatherType       = jobacct_gather/linux

*alice nodes cgroup.conf file  -*
pdsh -g alice0 cat /etc/slurm/cgroup.conf| dshbak -c
----------------
n[0000-0003]
----------------
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
CgroupMountpoint="/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=no
ConstrainSwapSpace=no


NodeName=n0000.alice0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=28 CPUErr=0 CPUTot=56 CPULoad=22.08
   AvailableFeatures=alice
   ActiveFeatures=alice
   Gres=(null)
   NodeAddr=10.0.4.0 NodeHostName=n0000.alice0 Version=17.11
   OS=Linux 3.10.0-693.11.6.el7.x86_64 #1 SMP Wed Jan 3 18:09:42 CST 2018
   RealMemory=128824 AllocMem=128000 FreeMem=67706 Sockets=56 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=503836 Weight=1 Owner=N/A
MCS_label=N/A
   Partitions=alice
   BootTime=2018-08-22T15:49:17 SlurmdStartTime=2018-08-22T15:51:08
   CfgTRES=cpu=56,mem=128824M,billing=56
   AllocTRES=cpu=28,mem=125G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

     ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Node status
n0000.alice0          1         alice       mixed   56   56:1:1 128824
503836      1    alice none
n0001.alice0          1         alice       mixed   56   56:1:1 128824
503836      1    alice none
n0002.alice0          1         alice       mixed   56   56:1:1 128824
503836      1    alice none
n0003.alice0          1         alice       mixed   56   56:1:1 128824
503836      1    alice none

Node memory

   NodeAddr=10.0.4.0 NodeHostName=n0000.alice0 Version=17.11
   RealMemory=128824 AllocMem=128000 FreeMem=67706 Sockets=56 Boards=1

   NodeAddr=10.0.4.1 NodeHostName=n0001.alice0 Version=17.11
   RealMemory=128824 AllocMem=128000 FreeMem=69844 Sockets=56 Boards=1

   NodeAddr=10.0.4.2 NodeHostName=n0002.alice0 Version=17.11
   RealMemory=128824 AllocMem=128000 FreeMem=70323 Sockets=56 Boards=1

   NodeAddr=10.0.4.3 NodeHostName=n0003.alice0 Version=17.11
   RealMemory=128824 AllocMem=128000 FreeMem=68962 Sockets=56 Boards=1

Totals

128 - Running jobs
All nodes are Running 32 cpus and 128000 memory

10 - Pending jobs
all requesting 1 CPU and 4000M of memory

What they would like to happen is for all nodes that can use the CPU's to
go up to 56 processes and ignore or oversubscribe memory.  None of these
settings you have provided me yet have been able to satisfy this request
from the customer.  Is it possible to have realmemory larger that what the
node hold?  Make the node think it has almost doubled the size of memory or
use some virtual memory to increase the value?  Would the users have to do
a special request or add any request to their batch script or srun command?


 I hope this information is helping you see what we are seeing after the
changes were made. - not any different that it was before.

Thanks

Jackie
On Tue, Aug 28, 2018 at 1:18 AM, <bugs@schedmd.com> wrote:

> *Comment # 39 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c39> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> Jacqueline,
>
> Please tell me how you progressed on this bug.
>
> Per comment 35 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c35> and 38 I would consider this bug as resolved as multiple paths
> has been presented to solve your problem.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 41 Felip Moll 2018-08-29 05:20:27 MDT
> Totals
> 
> 128 - Running jobs
> All nodes are Running 32 cpus and 128000 memory
> 
> 10 - Pending jobs
> all requesting 1 CPU and 4000M of memory
> 

This is because your jobs are asking for memory.

Try to run 56 jobs without specifying memory or specifying it very low and it will work, without memory being enforced (since you are not enforcing the limit in cgroup.conf):

for i in $(seq 1 56); do srun -n1 -c1  -w n0000.alice0 sleep 120 &; done;

> What they would like to happen is for all nodes that can use the CPU's to
> go up to 56 processes and ignore or oversubscribe memory.  None of these
> settings you have provided me yet have been able to satisfy this request
> from the customer. 

I understand what you want, and in your user's e-mail it was clear that who would control memory would be ALICE, not Slurm, so I supposed that jobs were not going to ask for any memory.


> Is it possible to have realmemory larger that what the
> node hold?  Make the node think it has almost doubled the size of memory or
> use some virtual memory to increase the value? 

Yes. Using FastSchedule=2.

If you are going to ask for memory (I don't see why you would do that if Slurm is not controlling memory) or fake the node's memory, then you must follow comment 31 approach setting the FastSchedule=2 globally, setting the memory on the node and so on.


> Would the users have to do a special request or add any request to their batch script or srun command?

Nothing different of what would be done under normal circumstances.

>  I hope this information is helping you see what we are seeing after the
> changes were made. - not any different that it was before.

It is different: memory is not constrained (as long as you don't ask for memory), and you can run up to 56 jobs.
Comment 42 Wei Feinstein 2018-08-29 06:33:32 MDT
I will suggest to the user to change his program so he doesn’t request
memory and see if it works.  The last time he made the change to not
 request it we saw fewer jobs running.  I know that cgroup was still set to
constrain memory so I’ll see what happens this time.

Setting fast schedule to 2 is not an option I desire to set because this
would require me to change all of the other systems nodes and I don’t want
to do that.


Thanks

Jackie Scoggins

On Aug 29, 2018, at 4:20 AM, bugs@schedmd.com wrote:

*Comment # 41 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c41> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

> Totals
>
> 128 - Running jobs
> All nodes are Running 32 cpus and 128000 memory
>
> 10 - Pending jobs
> all requesting 1 CPU and 4000M of memory
>

This is because your jobs are asking for memory.

Try to run 56 jobs without specifying memory or specifying it very low and it
will work, without memory being enforced (since you are not enforcing the limit
in cgroup.conf):

for i in $(seq 1 56); do srun -n1 -c1  -w n0000.alice0 sleep 120 &; done;
> What they would like to happen is for all nodes that can use the CPU's to
> go up to 56 processes and ignore or oversubscribe memory.  None of these
> settings you have provided me yet have been able to satisfy this request
> from the customer.

I understand what you want, and in your user's e-mail it was clear that who
would control memory would be ALICE, not Slurm, so I supposed that jobs were
not going to ask for any memory.

> Is it possible to have realmemory larger that what the
> node hold?  Make the node think it has almost doubled the size of memory or
> use some virtual memory to increase the value?

Yes. Using FastSchedule=2.

If you are going to ask for memory (I don't see why you would do that if Slurm
is not controlling memory) or fake the node's memory, then you must
followcomment 31 <show_bug.cgi?id=5467#c31> approach setting the
FastSchedule=2 globally, setting the memory on
the node and so on.

> Would the users have to do a special request or add any request to their batch script or srun command?

Nothing different of what would be done under normal circumstances.
>  I hope this information is helping you see what we are seeing after the
> changes were made. - not any different that it was before.

It is different: memory is not constrained (as long as you don't ask for
memory), and you can run up to 56 jobs.

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 43 Felip Moll 2018-09-13 17:19:21 MDT
(In reply to Jacqueline Scoggins from comment #42)
> I will suggest to the user to change his program so he doesn’t request
> memory and see if it works.  The last time he made the change to not
>  request it we saw fewer jobs running.  I know that cgroup was still set to
> constrain memory so I’ll see what happens this time.
> 
> Setting fast schedule to 2 is not an option I desire to set because this
> would require me to change all of the other systems nodes and I don’t want
> to do that.
> 
> 
> Thanks

Hi Jacqueline, do you have any news about this issue?
Comment 44 Wei Feinstein 2018-09-13 18:01:38 MDT
We were able to make the necessary changes and it works for the user. There
are a few more things we need to fix and I would like to have your input.

Since the user has only 4 nodes if one node goes offline for any reason he
would like to pack as many jobs onto the nodes without over utilizing
resources on the node. He wants to be able to limit the number of jobs
across all 4 nodes to a max of 164. With 4 nodes that is evenly
distributed.  But if he has only 3 nodes how will slurm distribute the
jobs.  His concern is that the pack of jobs running on the nodes would
potentially kill the node due to resources being over subscribed/utilized.

Any assistance would be great.

Thanks

Jackie

On Thu, Sep 13, 2018 at 4:19 PM, <bugs@schedmd.com> wrote:

> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c43> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> (In reply to Jacqueline Scoggins from comment #42 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c42>)> I will suggest to the user to change his program so he doesn’t request
> > memory and see if it works.  The last time he made the change to not
> >  request it we saw fewer jobs running.  I know that cgroup was still set to
> > constrain memory so I’ll see what happens this time.
> >
> > Setting fast schedule to 2 is not an option I desire to set because this
> > would require me to change all of the other systems nodes and I don’t want
> > to do that.
> >
> >
> > Thanks
>
> Hi Jacqueline, do you have any news about this issue?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 45 Wei Feinstein 2018-09-13 18:29:16 MDT
I wanted to share what the user wrote:


When I look at the partition, I see:

[user@alice ~]$ sinfo -N -o "%N %C %O %e" -p alice

NODELIST CPUS(A/I/O/T) CPU_LOAD FREE_MEM
n0000.alice0 37/19/0/56 36.16 29164
n0001.alice0 38/18/0/56 40.53 4670
n0002.alice0 37/19/0/56 36.18 28979
n0003.alice0 37/19/0/56 38.90 20314

If I increase the # of jobs or if 1 node goes away, will the available
nodes fill to 56 jobs?  That would be too many for the memory needs of
ALICE jobs.   We should shoot for ~42/node.  Can you add that limit?

So, yes.  It appears he wants to make that the limit for all 4 nodes

On Thu, Sep 13, 2018 at 5:01 PM, Jacqueline Scoggins <jscoggins@lbl.gov>
wrote:

> We were able to make the necessary changes and it works for the user.
> There are a few more things we need to fix and I would like to have your
> input.
>
> Since the user has only 4 nodes if one node goes offline for any reason he
> would like to pack as many jobs onto the nodes without over utilizing
> resources on the node. He wants to be able to limit the number of jobs
> across all 4 nodes to a max of 164. With 4 nodes that is evenly
> distributed.  But if he has only 3 nodes how will slurm distribute the
> jobs.  His concern is that the pack of jobs running on the nodes would
> potentially kill the node due to resources being over subscribed/utilized.
>
> Any assistance would be great.
>
> Thanks
>
> Jackie
>
> On Thu, Sep 13, 2018 at 4:19 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c43> on bug
>> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
>> <felip.moll@schedmd.com> *
>>
>> (In reply to Jacqueline Scoggins from comment #42 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c42>)> I will suggest to the user to change his program so he doesn’t request
>> > memory and see if it works.  The last time he made the change to not
>> >  request it we saw fewer jobs running.  I know that cgroup was still set to
>> > constrain memory so I’ll see what happens this time.
>> >
>> > Setting fast schedule to 2 is not an option I desire to set because this
>> > would require me to change all of the other systems nodes and I don’t want
>> > to do that.
>> >
>> >
>> > Thanks
>>
>> Hi Jacqueline, do you have any news about this issue?
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
Comment 46 Felip Moll 2018-09-14 05:33:43 MDT
(In reply to Jacqueline Scoggins from comment #44)
> We were able to make the necessary changes and it works for the user. There
> are a few more things we need to fix and I would like to have your input.
> 
> Since the user has only 4 nodes if one node goes offline for any reason he
> would like to pack as many jobs onto the nodes without over utilizing
> resources on the node. He wants to be able to limit the number of jobs
> across all 4 nodes to a max of 164. With 4 nodes that is evenly
> distributed.  But if he has only 3 nodes how will slurm distribute the
> jobs.  His concern is that the pack of jobs running on the nodes would
> potentially kill the node due to resources being over subscribed/utilized.
> 
> Any assistance would be great.
> 
> Thanks
> 
> Jackie

Jackie, 

What you are asking now is to limit again jobs in order to not exceed the available memory, which is exactly what we removed with the past approach.

Now that ALICE is constraining memory, its his responsibility to not exceed memory. Does the jobs always have the same memory constraint by ALICE (i.e. 6GB)?

There's another option which would distribute the jobs spreadly in the different nodes, please look at SelectTypeParameter=CR_LLN. But take in mind this will affect the entire system.

Another option if you implemented what was in comment 40, is to specify DefMemory by default, to NodeMemory/MaxJobs.

Tell me what you think.
Comment 47 Wei Feinstein 2018-09-14 05:42:06 MDT
I have the lln setting already.  We don’t want to change the memory setting
we have. They are just fine.   He just want to make sure that the limit of
164 maxjobperacct doesn’t mean that if one node goes offline the other
three nodes won’t be over utilized and killed by oom or by to high of a
load.

If 1 node is offline just limit the number of jobs to the remaining nodes
to say 42 jobs only.


Thanks

Jackie Scoggins

On Sep 14, 2018, at 4:33 AM, bugs@schedmd.com wrote:

*Comment # 46 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c46> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

(In reply to Jacqueline Scoggins from comment #44
<show_bug.cgi?id=5467#c44>)> We were able to make the necessary
changes and it works for the user. There
> are a few more things we need to fix and I would like to have your input.
>
> Since the user has only 4 nodes if one node goes offline for any reason he
> would like to pack as many jobs onto the nodes without over utilizing
> resources on the node. He wants to be able to limit the number of jobs
> across all 4 nodes to a max of 164. With 4 nodes that is evenly
> distributed.  But if he has only 3 nodes how will slurm distribute the
> jobs.  His concern is that the pack of jobs running on the nodes would
> potentially kill the node due to resources being over subscribed/utilized.
>
> Any assistance would be great.
>
> Thanks
>
> Jackie

Jackie,

What you are asking now is to limit again jobs in order to not exceed the
available memory, which is exactly what we removed with the past approach.

Now that ALICE is constraining memory, its his responsibility to not exceed
memory. Does the jobs always have the same memory constraint by ALICE (i.e.
6GB)?

There's another option which would distribute the jobs spreadly in the
different nodes, please look at SelectTypeParameter=CR_LLN. But take in mind
this will affect the entire system.

Another option if you implemented what was in comment 40
<show_bug.cgi?id=5467#c40>, is to specify
DefMemory by default, to NodeMemory/MaxJobs.

Tell me what you think.

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 48 Felip Moll 2018-09-18 06:37:39 MDT
(In reply to Jacqueline Scoggins from comment #47)
> I have the lln setting already.  We don’t want to change the memory setting
> we have. They are just fine.   He just want to make sure that the limit of
> 164 maxjobperacct doesn’t mean that if one node goes offline the other
> three nodes won’t be over utilized and killed by oom or by to high of a
> load.
> 
> If 1 node is offline just limit the number of jobs to the remaining nodes
> to say 42 jobs only.
> 
> 
> Thanks
> 
> Jackie Scoggins

Jackie,

There's no direct option to define the maximum number of jobs able to run on a partition dynamically.

If you just allow one account in the partition, you could modify the maxjobperacct dynamically calling a script through the use of:

A) NHC (Node Health Check)
B) strigger

You should have two scripts that, when a node is set to down, drain or even resumed, act and modify the QoS of the partition changing the MaxJobPA.

IMHO I think this is a bit ugly tuning. When the users decided and accepted the possibility to have OOM's, the knew about the consequences. Now trying to mitigate this consequences seems to me like trying to patch something "broken" on purpose. At the same time I understand your concerns and see what you are trying to do, so I give you the possibility to do it by the two proposed options.
Comment 49 Felip Moll 2018-09-28 03:03:10 MDT
(In reply to Felip Moll from comment #48)
> (In reply to Jacqueline Scoggins from comment #47)
> > I have the lln setting already.  We don’t want to change the memory setting
> > we have. They are just fine.   He just want to make sure that the limit of
> > 164 maxjobperacct doesn’t mean that if one node goes offline the other
> > three nodes won’t be over utilized and killed by oom or by to high of a
> > load.
> > 
> > If 1 node is offline just limit the number of jobs to the remaining nodes
> > to say 42 jobs only.
> > 
> > 
> > Thanks
> > 
> > Jackie Scoggins
> 
> Jackie,
> 
> There's no direct option to define the maximum number of jobs able to run on
> a partition dynamically.
> 
> If you just allow one account in the partition, you could modify the
> maxjobperacct dynamically calling a script through the use of:
> 
> A) NHC (Node Health Check)
> B) strigger
> 
> You should have two scripts that, when a node is set to down, drain or even
> resumed, act and modify the QoS of the partition changing the MaxJobPA.
> 
> IMHO I think this is a bit ugly tuning. When the users decided and accepted
> the possibility to have OOM's, the knew about the consequences. Now trying
> to mitigate this consequences seems to me like trying to patch something
> "broken" on purpose. At the same time I understand your concerns and see
> what you are trying to do, so I give you the possibility to do it by the two
> proposed options.

Hi Jackie, any thoughts about this?

Thanks,
Felip
Comment 50 Wei Feinstein 2018-09-28 04:27:16 MDT
I have not set anything yet we could try the strigger setting for only the
Alice nodes if we can count the number of idle nodes when the trigger
happens then we can set the appropriate maxjobperaccount value.  How does
the trigger get cleared once the node is back online.  Is that also in the
script?   I’ve never used the striggers before so I’m just checking.

Thanks

Jackie Scoggins

On Sep 28, 2018, at 2:03 AM, bugs@schedmd.com wrote:

*Comment # 49 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c49> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

(In reply to Felip Moll from comment #48 <show_bug.cgi?id=5467#c48>)>
(In reply to Jacqueline Scoggins from comment #47
<show_bug.cgi?id=5467#c47>)
> > I have the lln setting already.  We don’t want to change the memory setting
> > we have. They are just fine.   He just want to make sure that the limit of
> > 164 maxjobperacct doesn’t mean that if one node goes offline the other
> > three nodes won’t be over utilized and killed by oom or by to high of a
> > load.
> >
> > If 1 node is offline just limit the number of jobs to the remaining nodes
> > to say 42 jobs only.
> >
> >
> > Thanks
> >
> > Jackie Scoggins
>
> Jackie,
>
> There's no direct option to define the maximum number of jobs able to run on
> a partition dynamically.
>
> If you just allow one account in the partition, you could modify the
> maxjobperacct dynamically calling a script through the use of:
>
> A) NHC (Node Health Check)
> B) strigger
>
> You should have two scripts that, when a node is set to down, drain or even
> resumed, act and modify the QoS of the partition changing the MaxJobPA.
>
> IMHO I think this is a bit ugly tuning. When the users decided and accepted
> the possibility to have OOM's, the knew about the consequences. Now trying
> to mitigate this consequences seems to me like trying to patch something
> "broken" on purpose. At the same time I understand your concerns and see
> what you are trying to do, so I give you the possibility to do it by the two
> proposed options.

Hi Jackie, any thoughts about this?

Thanks,
Felip

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 51 Felip Moll 2018-09-28 06:05:11 MDT
(In reply to Jacqueline Scoggins from comment #50)
> I have not set anything yet we could try the strigger setting for only the
> Alice nodes if we can count the number of idle nodes when the trigger
> happens then we can set the appropriate maxjobperaccount value.  How does
> the trigger get cleared once the node is back online.  Is that also in the
> script?   I’ve never used the striggers before so I’m just checking.
> 
> Thanks
> 
> Jackie Scoggins

Jackie,

The triggers are executed when events occur.

For example, some interesting options for you (read man strigger for + info):

       -u, --up
              Trigger an event if the specified node is  returned  to  service
              from a DOWN state.
       -d, --down
              Trigger an event if the specified node goes into a DOWN state.

       -D, --drained
              Trigger an event if the  specified  node  goes  into  a  DRAINED
              state.

       -n, --node[=host]
              Host name(s) of interest.

       --flags=PERM
                     Make the trigger permanent. Do not  purge  it  after  the
                     event occurs.

       -p, --program=path
              Execute  the  program  at the specified fully qualified pathname
              when the event occurs. 

SYNOPSIS
       strigger --set   [OPTIONS...]
       strigger --get   [OPTIONS...]
       strigger --clear [OPTIONS...]

The trigger program must set a new trigger before the end of the next  interval  to  ensure
that  no  trigger events are missed OR the trigger must be created with an argument of "--flags=PERM".

This command can only set triggers if run by the user SlurmUser. This  is required for the slurmctld 
daemon to set the appropriate user and group IDs for the executed program.

I think that this can correctly accomplish what you want to do.
Comment 52 Wei Feinstein 2018-09-28 13:38:05 MDT
Thanks, I'll take a look.

On Fri, Sep 28, 2018 at 5:05 AM <bugs@schedmd.com> wrote:

> *Comment # 51 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c51> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> (In reply to Jacqueline Scoggins from comment #50 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c50>)> I have not set anything yet we could try the strigger setting for only the
> > Alice nodes if we can count the number of idle nodes when the trigger
> > happens then we can set the appropriate maxjobperaccount value.  How does
> > the trigger get cleared once the node is back online.  Is that also in the
> > script?   I’ve never used the striggers before so I’m just checking.
> >
> > Thanks
> >
> > Jackie Scoggins
>
> Jackie,
>
> The triggers are executed when events occur.
>
> For example, some interesting options for you (read man strigger for + info):
>
>        -u, --up
>               Trigger an event if the specified node is  returned  to  service
>               from a DOWN state.
>        -d, --down
>               Trigger an event if the specified node goes into a DOWN state.
>
>        -D, --drained
>               Trigger an event if the  specified  node  goes  into  a  DRAINED
>               state.
>
>        -n, --node[=host]
>               Host name(s) of interest.
>
>        --flags=PERM
>                      Make the trigger permanent. Do not  purge  it  after  the
>                      event occurs.
>
>        -p, --program=path
>               Execute  the  program  at the specified fully qualified pathname
>               when the event occurs.
>
> SYNOPSIS
>        strigger --set   [OPTIONS...]
>        strigger --get   [OPTIONS...]
>        strigger --clear [OPTIONS...]
>
> The trigger program must set a new trigger before the end of the next  interval
>  to  ensure
> that  no  trigger events are missed OR the trigger must be created with an
> argument of "--flags=PERM".
>
> This command can only set triggers if run by the user SlurmUser. This  is
> required for the slurmctld
> daemon to set the appropriate user and group IDs for the executed program.
>
> I think that this can correctly accomplish what you want to do.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 53 Felip Moll 2018-10-19 03:29:12 MDT
(In reply to Jacqueline Scoggins from comment #52)
> Thanks, I'll take a look.
> 


Hi Jacqueline,

Are all the questions responded for you?

May I close this bug?
Comment 54 Wei Feinstein 2018-10-19 10:18:21 MDT
No, I want to keep this open because we are not able to apply those changes
you are requesting for this environment. I would like to talk to someone at
Schedmd live via a zoom conference or simple a phone call.  We have a
customer who is really displeased with the functionality of the product.
Their collaborated site ORNL is also not pleased. Is there some way we can
get a conference call setup soon to discuss this issue?

I would really appreciate if this could be escalated.

Thanks

Jackie Scoggins

On Fri, Oct 19, 2018 at 2:29 AM <bugs@schedmd.com> wrote:

> *Comment # 53 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c53> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
> <felip.moll@schedmd.com> *
>
> (In reply to Jacqueline Scoggins from comment #52 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c52>)> Thanks, I'll take a look.
> >
>
>
> Hi Jacqueline,
>
> Are all the questions responded for you?
>
> May I close this bug?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 55 Felip Moll 2018-10-22 00:57:24 MDT
(In reply to Jacqueline Scoggins from comment #54)
> No, I want to keep this open because we are not able to apply those changes
> you are requesting for this environment. I would like to talk to someone at
> Schedmd live via a zoom conference or simple a phone call.  We have a
> customer who is really displeased with the functionality of the product.
> Their collaborated site ORNL is also not pleased. Is there some way we can
> get a conference call setup soon to discuss this issue?
> 
> I would really appreciate if this could be escalated.
> 
> Thanks
> 
> Jackie Scoggins

Hi Jacqueline,

I am a bit surprised that after ~20 days you haven't told me you cannot apply these changes. Please, tell me exactly what issues did you have and I will try to help you as best as I can. I'd also like to know exactly why is your customer so displeased, and also ORNL.

I will escalate your issue and give you a response soon.
Comment 57 Wei Feinstein 2018-10-22 05:22:49 MDT
Hello Felip,

I just got this information from the customer when our team went out to
visit the past week.  It isn’t that I waited Im just reporting the outcome
of an onsite visit.  The project that they are working on  auto generate
their job scripts.  They disperse several jobs at a time only requesting 1
node and expects the jobs to be spread over their nodes evenly and with
control only on the number of jobs to run per node.  They are adding 40
more nodes to their cluster with different cpu/memory configurations than
the existing one but they don’t want an additional qos nor partition to be
passed in their job script.  Since we have a complex scheduler
configuration making the changes to their one cluster could potentially
cause issues to other customers. We need to set up a meeting with your
group for additional advice to come up with a solution for their setup.

If you have time to talk today or tomorrow let’s schedule something.


Thanks

Jackie Scoggins

On Oct 21, 2018, at 11:57 PM, bugs@schedmd.com wrote:

Felip Moll <felip.moll@schedmd.com> changed bug 5467
<https://bugs.schedmd.com/show_bug.cgi?id=5467>
What Removed Added
CC   jbooth@schedmd.com

*Comment # 55 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c55> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

(In reply to Jacqueline Scoggins from comment #54
<show_bug.cgi?id=5467#c54>)> No, I want to keep this open because we
are not able to apply those changes
> you are requesting for this environment. I would like to talk to someone at
> Schedmd live via a zoom conference or simple a phone call.  We have a
> customer who is really displeased with the functionality of the product.
> Their collaborated site ORNL is also not pleased. Is there some way we can
> get a conference call setup soon to discuss this issue?
>
> I would really appreciate if this could be escalated.
>
> Thanks
>
> Jackie Scoggins

Hi Jacqueline,

I am a bit surprised that after ~20 days you haven't told me you cannot apply
these changes. Please, tell me exactly what issues did you have and I will try
to help you as best as I can. I'd also like to know exactly why is your
customer so displeased, and also ORNL.

I will escalate your issue and give you a response soon.

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 58 Jason Booth 2018-10-23 13:30:53 MDT
Greeting Jackie Scoggins,

> They are adding 40 more nodes to their cluster with different cpu/memory configurations than the existing one but they don’t want an additional qos nor partition to be passed in their job script.

SLURM can not guess the SelectTypeParameters for the user. It can be configured on the partition as Felip has outlined, however, there are other limitations with regards to CR_Core, (SelectTypeParameters in a partition level only works with CR_Core_* or CR_Socket_* set globally) comment #6. Additionally, a user must tell the scheduler which partition they need so that the correct SelectTypeParameters can be used. If users do not wish to make these changes then there is just simply no way for SLURM to accommodate you. You might be able to make use of the job submit plugin but this is something you will need to look into and write. 

> We need to set up a meeting with your group for additional advice to come up with a solution for their setup.


We do not offer telephone support so you should direct the problems you run into through the bug system. Felip has offered several updates on how you can configure your system since the end of July. Although we offer configuration suggestions it is not up to SchedMD support to configure and manage your system. This is left up to the site admin (you) to make these changes and understand them. Felip has done a great job at working with the changing requirement you have sent him and the inaction on your part to implement. If you wish to continue to work through this bug you will need to provide meaningful updates that we can work with such as any issues you are running into with the suggestions Felip has offered. If you are unwilling to make these changes then we will proceed to close out this issue.


Thanks,
-Jason
Director of Support
Comment 59 Wei Feinstein 2018-10-23 16:09:59 MDT
Hello Jason,

Thank you for the update.  I want to say that I never implied that Felip
had not provided us with good support. I think he has done a great job. I
was also not making any complaints about his service. You can go ahead and
close this case for the reasons you mentioned the problem is solved as far
as schedmd is concerned.

I wanted to say that I was not aware that you no longer do phone support.
How does one get advanced support via a live person to help figure out the
best solution for the customer regarding your product?  I don't feel that a
ticketing system is getting the point across or allowing for us to really
explain our situation.  If at all possible we would like to talk with an
engineer to work through the users issues and I will do the work just need
some guidance from schedmd.

Please advise.

Thanks

Jackie




On Tue, Oct 23, 2018 at 12:31 PM <bugs@schedmd.com> wrote:

> *Comment # 58 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c58> on bug
> 5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth
> <jbooth@schedmd.com> *
>
> Greeting Jackie Scoggins,
> > They are adding 40 more nodes to their cluster with different cpu/memory configurations than the existing one but they don’t want an additional qos nor partition to be passed in their job script.
>
> SLURM can not guess the SelectTypeParameters for the user. It can be configured
> on the partition as Felip has outlined, however, there are other limitations
> with regards to CR_Core, (SelectTypeParameters in a partition level only works
> with CR_Core_* or CR_Socket_* set globally) comment #6 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c6>. Additionally, a user
> must tell the scheduler which partition they need so that the correct
> SelectTypeParameters can be used. If users do not wish to make these changes
> then there is just simply no way for SLURM to accommodate you. You might be
> able to make use of the job submit plugin but this is something you will need
> to look into and write.
> > We need to set up a meeting with your group for additional advice to come up with a solution for their setup.
>
>
> We do not offer telephone support so you should direct the problems you run
> into through the bug system. Felip has offered several updates on how you can
> configure your system since the end of July. Although we offer configuration
> suggestions it is not up to SchedMD support to configure and manage your
> system. This is left up to the site admin (you) to make these changes and
> understand them. Felip has done a great job at working with the changing
> requirement you have sent him and the inaction on your part to implement. If
> you wish to continue to work through this bug you will need to provide
> meaningful updates that we can work with such as any issues you are running
> into with the suggestions Felip has offered. If you are unwilling to make these
> changes then we will proceed to close out this issue.
>
>
> Thanks,
> -Jason
> Director of Support
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 61 Jason Booth 2018-10-24 09:29:59 MDT
Hi Jackie,

> I wanted to say that I was not aware that you no longer do phone support.

SchedMD has never offered phone support so I am not sure why you would have this impression. We have always used the bug system as the primary way to communicate information and to fix issues.

> How does one get advanced support via a live person to help figure out the best solution for the customer regarding your product? 

We offer direct developer access via the ticketing system. What you have mentioned above is consulting and I can have Jacob Jenson reach out to you with more details if you wanted to pursue such an engagement.

> I don't feel that a ticketing system is getting the point across or allowing for us to really explain our situation.  If at all possible we would like to talk with an engineer to work through the users issues and I will do the work just need some guidance from SchedMD.

I understand that you wish to explain the situation over the phone but we kindly ask that you do so via the ticket so that the information is not lost or forgotten. 

Kind regards,
Jason
Comment 62 Wei Feinstein 2018-10-24 10:06:18 MDT
Thanks and yes a follow up with Jacob would be great. Is there a fee for
this service? Btw in the past I’ve spoken directly with Danny and/or Moe
when we would have issues.  But I do understand your policy.

You can close this ticket now.  I’ll reopen when needed.

Thanks

Jackie Scoggins

On Oct 24, 2018, at 8:29 AM, bugs@schedmd.com wrote:

*Comment # 61 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c61> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth
<jbooth@schedmd.com> *

Hi Jackie,
> I wanted to say that I was not aware that you no longer do phone support.

SchedMD has never offered phone support so I am not sure why you would have
this impression. We have always used the bug system as the primary way to
communicate information and to fix issues.
> How does one get advanced support via a live person to help figure out the best solution for the customer regarding your product?

We offer direct developer access via the ticketing system. What you have
mentioned above is consulting and I can have Jacob Jenson reach out to you with
more details if you wanted to pursue such an engagement.
> I don't feel that a ticketing system is getting the point across or allowing for us to really explain our situation.  If at all possible we would like to talk with an engineer to work through the users issues and I will do the work just need some guidance from SchedMD.

I understand that you wish to explain the situation over the phone but we
kindly ask that you do so via the ticket so that the information is not lost or
forgotten.

Kind regards,
Jason

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 63 Wei Feinstein 2018-10-24 10:06:24 MDT
Thanks and yes a follow up with Jacob would be great. Is there a fee for
this service? Btw in the past I’ve spoken directly with Danny and/or Moe
when we would have issues.  But I do understand your policy.

You can close this ticket now.  I’ll reopen when needed.

Thanks

Jackie Scoggins

On Oct 24, 2018, at 8:29 AM, bugs@schedmd.com wrote:

*Comment # 61 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c61> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Jason Booth
<jbooth@schedmd.com> *

Hi Jackie,
> I wanted to say that I was not aware that you no longer do phone support.

SchedMD has never offered phone support so I am not sure why you would have
this impression. We have always used the bug system as the primary way to
communicate information and to fix issues.
> How does one get advanced support via a live person to help figure out the best solution for the customer regarding your product?

We offer direct developer access via the ticketing system. What you have
mentioned above is consulting and I can have Jacob Jenson reach out to you with
more details if you wanted to pursue such an engagement.
> I don't feel that a ticketing system is getting the point across or allowing for us to really explain our situation.  If at all possible we would like to talk with an engineer to work through the users issues and I will do the work just need some guidance from SchedMD.

I understand that you wish to explain the situation over the phone but we
kindly ask that you do so via the ticket so that the information is not lost or
forgotten.

Kind regards,
Jason

------------------------------
You are receiving this mail because:

   - You reported the bug.
Comment 64 Felip Moll 2018-10-25 03:04:10 MDT
Jacqueline,

As per the comments seen, I am closing the bug.

I am sorry about this situation and misunderstandings. I would like you to reopen the bug or create
a new one whenever you have specific and concrete problems with some Slurm component. I will be more
than glad to help you in whatever I can.

Danny and Moe used to do this kind of support in the past, but after some time the situation had
changed and they couldn't manage anymore the volume of bugs, so the new rule became to always use
Bugzilla, with one advantage being having everything recorded here, another to be able to spread
the work among all of us, and finally the last one giving us some room to properly analyze and 
response with more accuracy than with an spontaneous conference call.

If you want to contact directly to Jacob his e-mail is publicly available at this site
(jacob@schedmd.com), or you can open a new bug and ask for this kind of consulting.

Thanks and sorry again for the misunderstanding.

Best regards,
Felip M
Comment 65 Wei Feinstein 2018-10-25 05:02:31 MDT
Thank you Felip.

Thanks

Jackie Scoggins

On Oct 25, 2018, at 5:04 AM, bugs@schedmd.com wrote:

Felip Moll <felip.moll@schedmd.com> changed bug 5467
<https://bugs.schedmd.com/show_bug.cgi?id=5467>
What Removed Added
Resolution --- INFOGIVEN
Status UNCONFIRMED RESOLVED
Severity 3 - Medium Impact 4 - Minor Issue

*Comment # 64 <https://bugs.schedmd.com/show_bug.cgi?id=5467#c64> on bug
5467 <https://bugs.schedmd.com/show_bug.cgi?id=5467> from Felip Moll
<felip.moll@schedmd.com> *

Jacqueline,

As per the comments seen, I am closing the bug.

I am sorry about this situation and misunderstandings. I would like you to
reopen the bug or create
a new one whenever you have specific and concrete problems with some Slurm
component. I will be more
than glad to help you in whatever I can.

Danny and Moe used to do this kind of support in the past, but after some time
the situation had
changed and they couldn't manage anymore the volume of bugs, so the new rule
became to always use
Bugzilla, with one advantage being having everything recorded here, another to
be able to spread
the work among all of us, and finally the last one giving us some room to
properly analyze and
response with more accuracy than with an spontaneous conference call.

If you want to contact directly to Jacob his e-mail is publicly available at
this site
(jacob@schedmd.com), or you can open a new bug and ask for this kind of
consulting.

Thanks and sorry again for the misunderstanding.

Best regards,
Felip M

------------------------------
You are receiving this mail because:

   - You reported the bug.