4021 – Configuration recommendation

Ticket 4021 - Configuration recommendation

Summary: Configuration recommendation

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	17.02.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-07-24 07:55 MDT by frank.schluenzen
Modified:	2017-08-07 09:14 MDT (History)
CC List:	3 users (show)

See Also:
Site:	DESY
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (8.27 KB, text/plain) 2017-07-24 07:55 MDT, frank.schluenzen	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description frank.schluenzen 2017-07-24 07:55:20 MDT

Created attachment 4957 [details]
slurm.conf

We need from time to time to dedicate selected hosts to a specific group of users, with the possibility of node sharing across the group members. According to

https://slurm.schedmd.com/cons_res_share.html 

it should be valid to have shared nodes in different partitions, some of which set OverSubscribe=FORCE and some set OverSubscribe=NO.

Would an addition like

# ... shared partition
PartitionName=shared Default=NO MaxNodes=1 AllowGroups=special Nodes=max-cfel[003-008] OverSubscribe=FORCE:16 Shared=Yes PriorityTier=500 PreemptMode=off

be the right way to go, and should we expect problems when removing/adding shared/OverSubscribed partitions from time to time? We possibly encountered a problem related to adding/removing shared partitions earlier on.

We don't have to rely on pre-emption since we anyway need to set a reservation for the dedicated/shared nodes, so also the "starvation" mentioned in the doc presumably shouldn't be an issue?

Any advice on the general and shared configuration would be highly appreciated.


[apologies for spamming]

Comment 1 Alejandro Sanchez 2017-07-25 07:25:52 MDT

Hi Frank.

(In reply to frank.schluenzen from comment #0)
> Created attachment 4957 [details]
> slurm.conf
> 
> We need from time to time to dedicate selected hosts to a specific group of
> users, with the possibility of node sharing across the group members.
> According to
> 
> https://slurm.schedmd.com/cons_res_share.html 
> 
> it should be valid to have shared nodes in different partitions, some of
> which set OverSubscribe=FORCE and some set OverSubscribe=NO.

Correct.
 
> Would an addition like
> 
> # ... shared partition
> PartitionName=shared Default=NO MaxNodes=1 AllowGroups=special
> Nodes=max-cfel[003-008] OverSubscribe=FORCE:16 Shared=Yes PriorityTier=500
> PreemptMode=off
> 
> be the right way to go, and should we expect problems when removing/adding
> shared/OverSubscribed partitions from time to time? We possibly encountered
> a problem related to adding/removing shared partitions earlier on.

This is the right way to go. Just one thing, the "Shared" option is obsoleted by "OverSubscribe" option, so "Shared" can be removed from the partition definition line.

I would not expect problems when removing/adding OverSubscribed partitions from time to time. But if you find any issue with that, just let us know. Been doing some tests locally with two partitions hitting the same nodes, one with OverSubscribe=NO and the other with OverSubscribe=FORCE:16 and behavior seems correct to me.
 
> We don't have to rely on pre-emption since we anyway need to set a
> reservation for the dedicated/shared nodes, so also the "starvation"
> mentioned in the doc presumably shouldn't be an issue?

Jobs are ordered in a single queue by these factors (in order):

1.Preemption order (preemptor higher priority than preemptee)
2.Advanced reservation (jobs with an advanced reservation are higher priority than other jobs)
3.Partition PriorityTier
4.Job priority
5.Job ID

If the shared nodes are in a reservation, or the "shared" partition has a higher Partition PriorityTier than the rest of partitions hitting the same nodes, the jobs from the rest of partitions might starve if the only nodes that they can fall into are the same than the "shared" partition. Otherwise jobs should not starve. Does it make sense?
 
> Any advice on the general and shared configuration would be highly
> appreciated.

I'd switch from your slurm.conf:

ProctrackType from proctrack/pgid to proctrack/cgroup #This helps in job cleanup. When the job finishes anything spawned in the cgroup will be cleaned up. This 
prevents runaway jobs (ie. jobs that double forked themselves). NOTE: proctrack/pgid mechanism is not entirely reliable for process tracking.

TaskPlugin=task/none to 'task/affinity,task/cgroup' both stacked.

You use select/linear, but you might consider using select/cons_res. Linear is perfectly fine though.

> 
> [apologies for spamming]

We're glad to help with your system.

Comment 3 frank.schluenzen 2017-07-25 10:39:01 MDT

Hi Alejandro, 

thanks a lot for the detailed explanation and recommendations, greatly appreciated. We'll give it a try ... 

Cheers, Frank. 

> From: "bugs" <bugs@schedmd.com>
> To: "frank schluenzen" <frank.schluenzen@desy.de>
> Sent: Tuesday, 25 July, 2017 15:25:52
> Subject: [Bug 4021] Configuration recommendation

> [ mailto:alex@schedmd.com |  Alejandro Sanchez ] changed [
> https://bugs.schedmd.com/show_bug.cgi?id=4021 | bug 4021 ]
> What Removed Added
> 	Assignee 	support@schedmd.com 	alex@schedmd.com
> 	CC 		alex@schedmd.com

> [ https://bugs.schedmd.com/show_bug.cgi?id=4021#c1 | Comment # 1 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=4021 | bug 4021 ] from [
> mailto:alex@schedmd.com |  Alejandro Sanchez ]
> Hi Frank.

> (In reply to frank.schluenzen from [
> https://bugs.schedmd.com//show_bug.cgi?id=4021#c0 | comment #0 ] ) > Created [
> https://bugs.schedmd.com/attachment.cgi?id=4957 | attachment 4957 [details] ] [
> https://bugs.schedmd.com//attachment.cgi?id=4957&action=edit | [details] ] >
> slurm.conf

> > We need from time to time to dedicate selected hosts to a specific group of
> > users, with the possibility of node sharing across the group members.
> > According to

>> [ https://slurm.schedmd.com/cons_res_share.html |
> > https://slurm.schedmd.com/cons_res_share.html ] >
> > it should be valid to have shared nodes in different partitions, some of
>> which set OverSubscribe=FORCE and some set OverSubscribe=NO. Correct. > Would an
> > addition like

> > # ... shared partition
> > PartitionName=shared Default=NO MaxNodes=1 AllowGroups=special
> > Nodes=max-cfel[003-008] OverSubscribe=FORCE:16 Shared=Yes PriorityTier=500
> > PreemptMode=off

> > be the right way to go, and should we expect problems when removing/adding
> > shared/OverSubscribed partitions from time to time? We possibly encountered
>> a problem related to adding/removing shared partitions earlier on. This is the
> > right way to go. Just one thing, the "Shared" option is obsoleted
> by "OverSubscribe" option, so "Shared" can be removed from the partition
> definition line.

> I would not expect problems when removing/adding OverSubscribed partitions from
> time to time. But if you find any issue with that, just let us know. Been doing
> some tests locally with two partitions hitting the same nodes, one with
> OverSubscribe=NO and the other with OverSubscribe=FORCE:16 and behavior seems
> correct to me. > We don't have to rely on pre-emption since we anyway need to
> set a
> > reservation for the dedicated/shared nodes, so also the "starvation"
>> mentioned in the doc presumably shouldn't be an issue? Jobs are ordered in a
> > single queue by these factors (in order):

> 1.Preemption order (preemptor higher priority than preemptee)
> 2.Advanced reservation (jobs with an advanced reservation are higher priority
> than other jobs)
> 3.Partition PriorityTier
> 4.Job priority
> 5.Job ID

> If the shared nodes are in a reservation, or the "shared" partition has a
> higher Partition PriorityTier than the rest of partitions hitting the same
> nodes, the jobs from the rest of partitions might starve if the only nodes that
> they can fall into are the same than the "shared" partition. Otherwise jobs
> should not starve. Does it make sense? > Any advice on the general and shared
> configuration would be highly
> > appreciated. I'd switch from your slurm.conf:

> ProctrackType from proctrack/pgid to proctrack/cgroup #This helps in job
> cleanup. When the job finishes anything spawned in the cgroup will be cleaned
> up. This
> prevents runaway jobs (ie. jobs that double forked themselves). NOTE:
> proctrack/pgid mechanism is not entirely reliable for process tracking.

> TaskPlugin=task/none to 'task/affinity,task/cgroup' both stacked.

> You use select/linear, but you might consider using select/cons_res. Linear is
> perfectly fine though. >
> > [apologies for spamming] We're glad to help with your system.

> You are receiving this mail because:

>     * You reported the bug.

Comment 4 Alejandro Sanchez 2017-08-01 03:08:16 MDT

Frank, is there anything else we can assist you with this bug? Thanks.

Comment 5 frank.schluenzen 2017-08-01 03:26:37 MDT

(In reply to Alejandro Sanchez from comment #4)
> Frank, is there anything else we can assist you with this bug? Thanks.

Maybe a last question: we already have a process-cleanup in the epilog and we don't have more than one job on a node at a time, which makes it easy. Is there any advantage of proctrack/cgroup other than cleaning up processes?  Apart from that all fine, consider the "issue" closed. Thanks.

Comment 6 Alejandro Sanchez 2017-08-01 03:40:05 MDT

(In reply to frank.schluenzen from comment #5)
> (In reply to Alejandro Sanchez from comment #4)
> > Frank, is there anything else we can assist you with this bug? Thanks.
> 
> Maybe a last question: we already have a process-cleanup in the epilog and
> we don't have more than one job on a node at a time, which makes it easy. Is
> there any advantage of proctrack/cgroup other than cleaning up processes? 
> Apart from that all fine, consider the "issue" closed. Thanks.

According to the documentation in:
https://slurm.schedmd.com/proctrack_plugins.html

The proctrack plugins are responsible for process tracking. This means that a container id is created and tasks are placed in that container. The plugin also permits to signal specific pids associated with with a step, i.e. SIGSTOP/SIGCONT signals, which can be used by other components such as Gang Scheduling, Preemption logic and/or scancel.

proctrack/cgroup is the only implementation entirely reliable for process tracking. We encourage to avoid using different flavors.

Comment 7 Alejandro Sanchez 2017-08-07 09:14:16 MDT

Frank, I'm closing this as per the info provided. Please, reopen if you have any further questions. Thanks.