I'm mostly just looking for some info ... We have a partition with bigger memory, but inconsistently sized nodes. A user would like to a) get the one with the most memory available at a given time, get the entire node/memory allocated for that node and b) would like the freedom to swap as needed. For a) we're thinking about just creating another partition with all the nodes which fit their requirements or maybe creating a feature they can use. It'd be nice, but I don't see a way, to specify a memory range, say something like --mem=200G-1024G as there is with --nodes=min-max. Am I missing that? We're using 14.11 (but really any week now it'll be 15). It'd also be nice, but I don't see a way, to specify a node list with ORs in it. This just chops off the other node when I really want to say, give me one node and use any of the named nodes. drdws0109:~$ garden with -m desres-settings/slurmcl2/8/all srun -p drdmem -t 1200 --mem 0 --exclusive --nodes=1-1 --nodelist=drdmem[0006,0009-0011,0013-0017,0021] -l /bin/hostname srun: error: Required nodelist includes more nodes than permitted by max-node count (10 > 1). Eliminating nodes from the nodelist. srun: job 92039435 queued and waiting for resources srun: job 92039435 has been allocated resources 0: drdmem0006.en.desres.deshaw.com I guess that is not what it is meant for, eh? For b) If they say srun --mem=200G and they land on the node with 768GB of memory then they still only get 200G allocated and their job ends up getting killed. [fennm@drdbfe1 ~]$ zgrep drdmem /var/log/messages* | grep "being killed" /var/log/messages-20160308.gz:2016-03-07T10:58:56-05:00 drdmem0011.en.desres.deshaw.com slurmstepd-drdmem0011[3163]: error: Job 90458115 exceeded memory limit (76477892 > 18432000), being killed /var/log/messages-20160308.gz:2016-03-07T15:12:48-05:00 drdmem0009.en.desres.deshaw.com slurmstepd-drdmem0009[14381]: error: Job 90470268 exceeded memory limit (266862956 > 102400000), being killed /var/log/messages-20160308.gz:2016-03-07T21:33:14-05:00 drdmem0011.en.desres.deshaw.com slurmstepd-drdmem0011[26940]: error: Job 90536641 exceeded memory limit (273248696 > 204800000), being killed /var/log/messages-20160313.gz:2016-03-12T16:10:05-05:00 drdmem0014.en.desres.deshaw.com slurmstepd-drdmem0014[1587]: error: Step 91470287.0 exceeded memory limit (232607968 > 204800000), being killed /var/log/messages-20160315.gz:2016-03-15T04:00:42-04:00 drdmem0017.en.desres.deshaw.com slurmstepd-drdmem0017[28709]: error: Step 91968467.0 exceeded memory limit (176315284 > 147456000), being killed I suppose we could also toggle MemLimitEnforce, but that is a global right? All jobs and all partitions? I don't see a way to over-ride that for certain user/jobs/partitions or anything (which is probably for the best).
(In reply to Chris from comment #0) > I'm mostly just looking for some info ... > > We have a partition with bigger memory, but inconsistently sized nodes. A > user would like to a) get the one with the most memory available at a given > time, get the entire node/memory allocated for that node and b) would like > the freedom to swap as needed. Are you using cgroups for the enforcement? You can set AllowedSwapSpace for the node, although that's global not per job. > For a) we're thinking about just creating another partition with all the > nodes which fit their requirements or maybe creating a feature they can use. > > It'd be nice, but I don't see a way, to specify a memory range, say > something like --mem=200G-1024G as there is with --nodes=min-max. Am I > missing that? We're using 14.11 (but really any week now it'll be 15). It's an interesting idea - specify a range (potentially unbounded at the high end), and allocate as much as possible for that job? Such a feature doesn't exist at present. I'm still mulling over if there may be a way to approximate that now. > It'd also be nice, but I don't see a way, to specify a node list with ORs in > it. This just chops off the other node when I really want to say, give me > one node and use any of the named nodes. > > drdws0109:~$ garden with -m desres-settings/slurmcl2/8/all srun -p drdmem > -t 1200 --mem 0 --exclusive --nodes=1-1 > --nodelist=drdmem[0006,0009-0011,0013-0017,0021] -l /bin/hostname > srun: error: Required nodelist includes more nodes than permitted by > max-node count (10 > 1). Eliminating nodes from the nodelist. > srun: job 92039435 queued and waiting for resources > srun: job 92039435 has been allocated resources > 0: drdmem0006.en.desres.deshaw.com > > I guess that is not what it is meant for, eh? Nope. That's really only meant for specific testing where you want to land back on the exact same nodes as before. I'd suggest setting appropriate Features on the nodes you want to target, and use --constraint to limit the nodes slurmctld will consider the job for. That does support some set manipulation with AND and OR logic, and you can even construct a job with a given number of one type or node and a number of a second type. E.g., if you had an application running as a central master plus a bunch of workers, you could specify --constraint="[bigmem*1&fastproc*12] to get your 13 nodes. The sbatch man page gives some examples, you'd just need to add appropriate Features labels to your nodes. There aren't any types for these, they're just text labels, so you can be as detailed as you'd like - I've seen sites list a high level cpu architecture "Xeon" as well as the specific model "e5-2670v2". Or adding additional partitions can accomplish much the same thing. The one advantage to using features is you can give the users the option to use them or not, and they may come up with some combination of interesting nodes that don't readily lend themselves to a partition. > For b) If they say srun --mem=200G and they land on the node with 768GB of > memory then they still only get 200G allocated and their job ends up getting > killed. > > > [fennm@drdbfe1 ~]$ zgrep drdmem /var/log/messages* | grep "being killed" > /var/log/messages-20160308.gz:2016-03-07T10:58:56-05:00 > drdmem0011.en.desres.deshaw.com slurmstepd-drdmem0011[3163]: error: Job > 90458115 exceeded memory limit (76477892 > 18432000), being killed > /var/log/messages-20160308.gz:2016-03-07T15:12:48-05:00 > drdmem0009.en.desres.deshaw.com slurmstepd-drdmem0009[14381]: error: Job > 90470268 exceeded memory limit (266862956 > 102400000), being killed > /var/log/messages-20160308.gz:2016-03-07T21:33:14-05:00 > drdmem0011.en.desres.deshaw.com slurmstepd-drdmem0011[26940]: error: Job > 90536641 exceeded memory limit (273248696 > 204800000), being killed > /var/log/messages-20160313.gz:2016-03-12T16:10:05-05:00 > drdmem0014.en.desres.deshaw.com slurmstepd-drdmem0014[1587]: error: Step > 91470287.0 exceeded memory limit (232607968 > 204800000), being killed > /var/log/messages-20160315.gz:2016-03-15T04:00:42-04:00 > drdmem0017.en.desres.deshaw.com slurmstepd-drdmem0017[28709]: error: Step > 91968467.0 exceeded memory limit (176315284 > 147456000), being killed > > I suppose we could also toggle MemLimitEnforce, but that is a global right? > All jobs and all partitions? I don't see a way to over-ride that for certain > user/jobs/partitions or anything (which is probably for the best). MemLimitEnforce is global. I don't think anyone's ever asked to override it per-job, that introduces a lot of odd edge cases, and it'd probably be better long-term to allow for --mem as a range rather than consider that. Slurm doesn't like being partially in control of a resource, as it prevents the scheduler from knowing when the next job needing a specific allocation will launch. So everything tends to be either tracked or enforced, or ignored altogether. - Tim
Thanks - we went with the feature option since we already had a bunch of those and that appears to work. We haven't converted to cgroups (yet), but it is on our list... is pretty much everyone else using them? On 03/15/2016 03:07 PM, bugs@schedmd.com wrote: > Tim Wickberg <mailto:tim@schedmd.com> changed bug 2552 > <https://bugs.schedmd.com/show_bug.cgi?id=2552> > What Removed Added > Assignee support@schedmd.com tim@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=2552#c1> on bug > 2552 <https://bugs.schedmd.com/show_bug.cgi?id=2552> from Tim Wickberg > <mailto:tim@schedmd.com> * > (In reply to Chris fromcomment #0 <show_bug.cgi?id=2552#c0>) > > I'm mostly just looking for some info ... > > We have a partition with bigger memory, but inconsistently sized > nodes. A > user would like to a) get the one with the most memory > available at a given > time, get the entire node/memory allocated for > that node and b) would like > the freedom to swap as needed. > > Are you using cgroups for the enforcement? You can set AllowedSwapSpace for the > node, although that's global not per job. > > > For a) we're thinking about just creating another partition with all the > nodes which fit their requirements or maybe creating a feature they > can use. > > It'd be nice, but I don't see a way, to specify a memory > range, say > something like --mem=200G-1024G as there is with > --nodes=min-max. Am I > missing that? We're using 14.11 (but really > any week now it'll be 15). > > It's an interesting idea - specify a range (potentially unbounded at the high > end), and allocate as much as possible for that job? > > Such a feature doesn't exist at present. I'm still mulling over if there may be > a way to approximate that now. > > > It'd also be nice, but I don't see a way, to specify a node list with ORs in > it. This just chops off the other node when I really want to say, > give me > one node and use any of the named nodes. > > drdws0109:~$ > garden with -m desres-settings/slurmcl2/8/all srun -p drdmem > -t 1200 > --mem 0 --exclusive --nodes=1-1 > > --nodelist=drdmem[0006,0009-0011,0013-0017,0021] -l /bin/hostname > > srun: error: Required nodelist includes more nodes than permitted by > > max-node count (10 > 1). Eliminating nodes from the nodelist. > srun: > job 92039435 queued and waiting for resources > srun: job 92039435 has > been allocated resources > 0: drdmem0006.en.desres.deshaw.com > > I > guess that is not what it is meant for, eh? > > Nope. That's really only meant for specific testing where you want to land back > on the exact same nodes as before. > > I'd suggest setting appropriate Features on the nodes you want to target, and > use --constraint to limit the nodes slurmctld will consider the job for. That > does support some set manipulation with AND and OR logic, and you can even > construct a job with a given number of one type or node and a number of a > second type. E.g., if you had an application running as a central master plus a > bunch of workers, you could specify --constraint="[bigmem*1&fastproc*12] to get > your 13 nodes. > > The sbatch man page gives some examples, you'd just need to add appropriate > Features labels to your nodes. There aren't any types for these, they're just > text labels, so you can be as detailed as you'd like - I've seen sites list a > high level cpu architecture "Xeon" as well as the specific model "e5-2670v2". > > Or adding additional partitions can accomplish much the same thing. The one > advantage to using features is you can give the users the option to use them or > not, and they may come up with some combination of interesting nodes that don't > readily lend themselves to a partition. > > > For b) If they say srun --mem=200G and they land on the node with 768GB of > memory then they still only get 200G allocated and their job ends > up getting > killed. > > > [fennm@drdbfe1 ~]$ zgrep drdmem > /var/log/messages* | grep "being killed" > > /var/log/messages-20160308.gz:2016-03-07T10:58:56-05:00 > > drdmem0011.en.desres.deshaw.com slurmstepd-drdmem0011[3163]: error: > Job > 90458115 exceeded memory limit (76477892 > 18432000), being > killed > /var/log/messages-20160308.gz:2016-03-07T15:12:48-05:00 > > drdmem0009.en.desres.deshaw.com slurmstepd-drdmem0009[14381]: error: > Job > 90470268 exceeded memory limit (266862956 > 102400000), being > killed > /var/log/messages-20160308.gz:2016-03-07T21:33:14-05:00 > > drdmem0011.en.desres.deshaw.com slurmstepd-drdmem0011[26940]: error: > Job > 90536641 exceeded memory limit (273248696 > 204800000), being > killed > /var/log/messages-20160313.gz:2016-03-12T16:10:05-05:00 > > drdmem0014.en.desres.deshaw.com slurmstepd-drdmem0014[1587]: error: > Step > 91470287.0 exceeded memory limit (232607968 > 204800000), being > killed > /var/log/messages-20160315.gz:2016-03-15T04:00:42-04:00 > > drdmem0017.en.desres.deshaw.com slurmstepd-drdmem0017[28709]: error: > Step > 91968467.0 exceeded memory limit (176315284 > 147456000), being > killed > > I suppose we could also toggle MemLimitEnforce, but that is > a global right? > All jobs and all partitions? I don't see a way to > over-ride that for certain > user/jobs/partitions or anything (which > is probably for the best). > > MemLimitEnforce is global. I don't think anyone's ever asked to override it > per-job, that introduces a lot of odd edge cases, and it'd probably be better > long-term to allow for --mem as a range rather than consider that. Slurm > doesn't like being partially in control of a resource, as it prevents the > scheduler from knowing when the next job needing a specific allocation will > launch. So everything tends to be either tracked or enforced, or ignored > altogether. > > - Tim > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Chris from comment #2) > Thanks - we went with the feature option since we already had a bunch of > those and that appears to work. > > We haven't converted to cgroups (yet), but it is on our list... is > pretty much everyone else using them? We heavily recommend using them, although not everyone does. The active isolation they provide is quite nice and avoids odd behavior when OOM-Killer starts deciding what to kill. slurmd/slurmstepd's memory limit enforcement is passive and only triggered on the accounting data collection polling every 30 seconds, and depending on the application that can be enough time to crash the node. If you have no other questions, can I go ahead and mark this as resolved? - Tim
Please do. Thanks :> On 03/15/2016 05:05 PM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=2552#c3> on bug > 2552 <https://bugs.schedmd.com/show_bug.cgi?id=2552> from Tim Wickberg > <mailto:tim@schedmd.com> * > (In reply to Chris fromcomment #2 <show_bug.cgi?id=2552#c2>) > > Thanks - we went with the feature option since we already had a bunch of> those and that appears to work. > > We haven't converted to cgroups > (yet), but it is on our list... is > pretty much everyone else using them? > > We heavily recommend using them, although not everyone does. The active > isolation they provide is quite nice and avoids odd behavior when OOM-Killer > starts deciding what to kill. slurmd/slurmstepd's memory limit enforcement is > passive and only triggered on the accounting data collection polling every 30 > seconds, and depending on the application that can be enough time to crash the > node. > > If you have no other questions, can I go ahead and mark this as resolved? > > - Tim > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Closing.