Created attachment 4957 [details] slurm.conf We need from time to time to dedicate selected hosts to a specific group of users, with the possibility of node sharing across the group members. According to https://slurm.schedmd.com/cons_res_share.html it should be valid to have shared nodes in different partitions, some of which set OverSubscribe=FORCE and some set OverSubscribe=NO. Would an addition like # ... shared partition PartitionName=shared Default=NO MaxNodes=1 AllowGroups=special Nodes=max-cfel[003-008] OverSubscribe=FORCE:16 Shared=Yes PriorityTier=500 PreemptMode=off be the right way to go, and should we expect problems when removing/adding shared/OverSubscribed partitions from time to time? We possibly encountered a problem related to adding/removing shared partitions earlier on. We don't have to rely on pre-emption since we anyway need to set a reservation for the dedicated/shared nodes, so also the "starvation" mentioned in the doc presumably shouldn't be an issue? Any advice on the general and shared configuration would be highly appreciated. [apologies for spamming]
Hi Frank. (In reply to frank.schluenzen from comment #0) > Created attachment 4957 [details] > slurm.conf > > We need from time to time to dedicate selected hosts to a specific group of > users, with the possibility of node sharing across the group members. > According to > > https://slurm.schedmd.com/cons_res_share.html > > it should be valid to have shared nodes in different partitions, some of > which set OverSubscribe=FORCE and some set OverSubscribe=NO. Correct. > Would an addition like > > # ... shared partition > PartitionName=shared Default=NO MaxNodes=1 AllowGroups=special > Nodes=max-cfel[003-008] OverSubscribe=FORCE:16 Shared=Yes PriorityTier=500 > PreemptMode=off > > be the right way to go, and should we expect problems when removing/adding > shared/OverSubscribed partitions from time to time? We possibly encountered > a problem related to adding/removing shared partitions earlier on. This is the right way to go. Just one thing, the "Shared" option is obsoleted by "OverSubscribe" option, so "Shared" can be removed from the partition definition line. I would not expect problems when removing/adding OverSubscribed partitions from time to time. But if you find any issue with that, just let us know. Been doing some tests locally with two partitions hitting the same nodes, one with OverSubscribe=NO and the other with OverSubscribe=FORCE:16 and behavior seems correct to me. > We don't have to rely on pre-emption since we anyway need to set a > reservation for the dedicated/shared nodes, so also the "starvation" > mentioned in the doc presumably shouldn't be an issue? Jobs are ordered in a single queue by these factors (in order): 1.Preemption order (preemptor higher priority than preemptee) 2.Advanced reservation (jobs with an advanced reservation are higher priority than other jobs) 3.Partition PriorityTier 4.Job priority 5.Job ID If the shared nodes are in a reservation, or the "shared" partition has a higher Partition PriorityTier than the rest of partitions hitting the same nodes, the jobs from the rest of partitions might starve if the only nodes that they can fall into are the same than the "shared" partition. Otherwise jobs should not starve. Does it make sense? > Any advice on the general and shared configuration would be highly > appreciated. I'd switch from your slurm.conf: ProctrackType from proctrack/pgid to proctrack/cgroup #This helps in job cleanup. When the job finishes anything spawned in the cgroup will be cleaned up. This prevents runaway jobs (ie. jobs that double forked themselves). NOTE: proctrack/pgid mechanism is not entirely reliable for process tracking. TaskPlugin=task/none to 'task/affinity,task/cgroup' both stacked. You use select/linear, but you might consider using select/cons_res. Linear is perfectly fine though. > > [apologies for spamming] We're glad to help with your system.
Hi Alejandro, thanks a lot for the detailed explanation and recommendations, greatly appreciated. We'll give it a try ... Cheers, Frank. > From: "bugs" <bugs@schedmd.com> > To: "frank schluenzen" <frank.schluenzen@desy.de> > Sent: Tuesday, 25 July, 2017 15:25:52 > Subject: [Bug 4021] Configuration recommendation > [ mailto:alex@schedmd.com | Alejandro Sanchez ] changed [ > https://bugs.schedmd.com/show_bug.cgi?id=4021 | bug 4021 ] > What Removed Added > Assignee support@schedmd.com alex@schedmd.com > CC alex@schedmd.com > [ https://bugs.schedmd.com/show_bug.cgi?id=4021#c1 | Comment # 1 ] on [ > https://bugs.schedmd.com/show_bug.cgi?id=4021 | bug 4021 ] from [ > mailto:alex@schedmd.com | Alejandro Sanchez ] > Hi Frank. > (In reply to frank.schluenzen from [ > https://bugs.schedmd.com//show_bug.cgi?id=4021#c0 | comment #0 ] ) > Created [ > https://bugs.schedmd.com/attachment.cgi?id=4957 | attachment 4957 [details] ] [ > https://bugs.schedmd.com//attachment.cgi?id=4957&action=edit | [details] ] > > slurm.conf > > We need from time to time to dedicate selected hosts to a specific group of > > users, with the possibility of node sharing across the group members. > > According to >> [ https://slurm.schedmd.com/cons_res_share.html | > > https://slurm.schedmd.com/cons_res_share.html ] > > > it should be valid to have shared nodes in different partitions, some of >> which set OverSubscribe=FORCE and some set OverSubscribe=NO. Correct. > Would an > > addition like > > # ... shared partition > > PartitionName=shared Default=NO MaxNodes=1 AllowGroups=special > > Nodes=max-cfel[003-008] OverSubscribe=FORCE:16 Shared=Yes PriorityTier=500 > > PreemptMode=off > > be the right way to go, and should we expect problems when removing/adding > > shared/OverSubscribed partitions from time to time? We possibly encountered >> a problem related to adding/removing shared partitions earlier on. This is the > > right way to go. Just one thing, the "Shared" option is obsoleted > by "OverSubscribe" option, so "Shared" can be removed from the partition > definition line. > I would not expect problems when removing/adding OverSubscribed partitions from > time to time. But if you find any issue with that, just let us know. Been doing > some tests locally with two partitions hitting the same nodes, one with > OverSubscribe=NO and the other with OverSubscribe=FORCE:16 and behavior seems > correct to me. > We don't have to rely on pre-emption since we anyway need to > set a > > reservation for the dedicated/shared nodes, so also the "starvation" >> mentioned in the doc presumably shouldn't be an issue? Jobs are ordered in a > > single queue by these factors (in order): > 1.Preemption order (preemptor higher priority than preemptee) > 2.Advanced reservation (jobs with an advanced reservation are higher priority > than other jobs) > 3.Partition PriorityTier > 4.Job priority > 5.Job ID > If the shared nodes are in a reservation, or the "shared" partition has a > higher Partition PriorityTier than the rest of partitions hitting the same > nodes, the jobs from the rest of partitions might starve if the only nodes that > they can fall into are the same than the "shared" partition. Otherwise jobs > should not starve. Does it make sense? > Any advice on the general and shared > configuration would be highly > > appreciated. I'd switch from your slurm.conf: > ProctrackType from proctrack/pgid to proctrack/cgroup #This helps in job > cleanup. When the job finishes anything spawned in the cgroup will be cleaned > up. This > prevents runaway jobs (ie. jobs that double forked themselves). NOTE: > proctrack/pgid mechanism is not entirely reliable for process tracking. > TaskPlugin=task/none to 'task/affinity,task/cgroup' both stacked. > You use select/linear, but you might consider using select/cons_res. Linear is > perfectly fine though. > > > [apologies for spamming] We're glad to help with your system. > You are receiving this mail because: > * You reported the bug.
Frank, is there anything else we can assist you with this bug? Thanks.
(In reply to Alejandro Sanchez from comment #4) > Frank, is there anything else we can assist you with this bug? Thanks. Maybe a last question: we already have a process-cleanup in the epilog and we don't have more than one job on a node at a time, which makes it easy. Is there any advantage of proctrack/cgroup other than cleaning up processes? Apart from that all fine, consider the "issue" closed. Thanks.
(In reply to frank.schluenzen from comment #5) > (In reply to Alejandro Sanchez from comment #4) > > Frank, is there anything else we can assist you with this bug? Thanks. > > Maybe a last question: we already have a process-cleanup in the epilog and > we don't have more than one job on a node at a time, which makes it easy. Is > there any advantage of proctrack/cgroup other than cleaning up processes? > Apart from that all fine, consider the "issue" closed. Thanks. According to the documentation in: https://slurm.schedmd.com/proctrack_plugins.html The proctrack plugins are responsible for process tracking. This means that a container id is created and tasks are placed in that container. The plugin also permits to signal specific pids associated with with a step, i.e. SIGSTOP/SIGCONT signals, which can be used by other components such as Gang Scheduling, Preemption logic and/or scancel. proctrack/cgroup is the only implementation entirely reliable for process tracking. We encourage to avoid using different flavors.
Frank, I'm closing this as per the info provided. Please, reopen if you have any further questions. Thanks.