| Summary: | Error: Exceeded job memory limit | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | John Villa <jv2575> |
| Component: | slurmd | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | ca2783 |
| Version: | 17.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Columbia University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
cgroup.conf slurmd sdiag-output slurmctld |
||
|
Description
John Villa
2018-03-30 09:30:39 MDT
It sounds like jobs are exceeding their memory limits and are getting killed. And the slurmd's are being placed into a drain state as a result. Can you upload your slurm.conf file? There are a number of different things that affect how memory limits are enforced. If you're using the task/cgroup plugin, can you also upload your cgroup.conf file? See also bug 3562 - there was a long discussion there about job memory limits and how they're handled. Hello, Please cc hpc-support@columbia.edu on any future correspondence. We are using cgroups. I will provide you with the files please standby. Thanks, John Villa On Fri, Mar 30, 2018 at 12:08 PM, <bugs@schedmd.com> wrote: > Marshall Garey <marshall@schedmd.com> changed bug 5005 > <https://bugs.schedmd.com/show_bug.cgi?id=5005> > What Removed Added > Assignee support@schedmd.com marshall@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c1> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > It sounds like jobs are exceeding their memory limits and are getting killed. > And the slurmd's are being placed into a drain state as a result. > > Can you upload your slurm.conf file? There are a number of different things > that affect how memory limits are enforced. If you're using the task/cgroup > plugin, can you also upload your cgroup.conf file? > > See also bug 3562 <https://bugs.schedmd.com/show_bug.cgi?id=3562> - there was a long discussion there about job memory limits > and how they're handled. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > You can add that email to the CC list, but the email address needs to be registered with SchedMD's bugzilla. I don't think it has been registered yet. Created attachment 6510 [details] slurm.conf Hello, Please see the attached configs. We will standby for your feedback. In the meantime why would slurmd go into a drain state due to a job failure? Thanks, John On Fri, Mar 30, 2018 at 12:14 PM, <bugs@schedmd.com> wrote: > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c3> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > You can add that email to the CC list, but the email address needs to be > registered with SchedMD's bugzilla. I don't think it has been registered yet. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Created attachment 6511 [details]
cgroup.conf
Hello, Please be advised that this is happening again. We have seven more nodes that have thrown this error and been into a drain state. Please increase the urgency level of this BUG. Thanks, John Villa On Fri, Mar 30, 2018 at 12:18 PM, John Villa <jv2575@columbia.edu> wrote: > Hello, > Please see the attached configs. We will standby for your feedback. In the > meantime why would slurmd go into a drain state due to a job failure? > Thanks, > John > > > On Fri, Mar 30, 2018 at 12:14 PM, <bugs@schedmd.com> wrote: > >> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c3> on bug >> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey >> <marshall@schedmd.com> * >> >> You can add that email to the CC list, but the email address needs to be >> registered with SchedMD's bugzilla. I don't think it has been registered yet. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > > -- > Sincerely, > John Villa > Sr. Research Systems Engineer > > Hello, We didn't experience this issue until upgrading. Would you recommend we use the following: 'ConstrainKmemSpace=no' Perhaps the enhancements in 17 concerning memory errors are what we are seeing here? Thanks, John Villa On Fri, Mar 30, 2018 at 1:29 PM, John Villa <jv2575@columbia.edu> wrote: > Hello, > Please be advised that this is happening again. We have seven more nodes > that have thrown this error and been into a drain state. Please increase > the urgency level of this BUG. > Thanks, > John Villa > > On Fri, Mar 30, 2018 at 12:18 PM, John Villa <jv2575@columbia.edu> wrote: > >> Hello, >> Please see the attached configs. We will standby for your feedback. In >> the meantime why would slurmd go into a drain state due to a job failure? >> Thanks, >> John >> >> >> On Fri, Mar 30, 2018 at 12:14 PM, <bugs@schedmd.com> wrote: >> >>> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c3> on bug >>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey >>> <marshall@schedmd.com> * >>> >>> You can add that email to the CC list, but the email address needs to be >>> registered with SchedMD's bugzilla. I don't think it has been registered yet. >>> >>> ------------------------------ >>> You are receiving this mail because: >>> >>> - You reported the bug. >>> >>> >> >> >> -- >> Sincerely, >> John Villa >> Sr. Research Systems Engineer >> >> > > > > -- > Sincerely, > John Villa > Sr. Research Systems Engineer > > Thanks for the configs. Do you want to enforce job memory limits? I noticed you have ConstrainRAMSpace=no ConstrainSwapSpace=no in your cgroup.conf, which means memory limits won't be enforced by cgroups, but will still be enforced by JobAcctGatherType=jobacct_gather/linux plugin. And you also have selecttypeparameters=CR_Core (not memory) and you don't have defmempercpu. So the only times a job has memory limits is when it specifically requests a memory limit. Then those jobs are exceeding the memory limit that they requested, and getting killed. I'm not sure why the node is being set to DRAIN. Can you upload a slurmd log file from one of those nodes? If you want to enforce memory, then we recommend using the task/cgroup plugin to do it: ConstrainRAMSpace=yes, and then disabling memory enforcement by the JobAcctGather plugin (JobAcctGatherParams=NoOverMemoryKill). You may also want to set UsePss. Be careful, though! It's case sensitive - this was fixed, but I don’t remember which version. If you don’t want to enforce memory limits, you’ll still want to set NoOverMemoryKill and also consider setting UsePss. Created attachment 6512 [details] slurmd Hello, I think what it more important here is to find out why they are being set to drain. We have had this configuration in place for months (maybe even a year or more) and this error is just starting to populate. In fact we have more nodes down now. Please see the attached log. Thank You, John Villa On Fri, Mar 30, 2018 at 2:52 PM, <bugs@schedmd.com> wrote: > *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c8> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > Thanks for the configs. > > Do you want to enforce job memory limits? I noticed you have > > ConstrainRAMSpace=no > ConstrainSwapSpace=no > > in your cgroup.conf, which means memory limits won't be enforced by cgroups, > but will still be enforced by JobAcctGatherType=jobacct_gather/linux plugin. > > And you also have > > selecttypeparameters=CR_Core (not memory) > > and you don't have defmempercpu. So the only times a job has memory limits is > when it specifically requests a memory limit. Then those jobs are exceeding the > memory limit that they requested, and getting killed. I'm not sure why the node > is being set to DRAIN. Can you upload a slurmd log file from one of those > nodes? > > If you want to enforce memory, then we recommend using the task/cgroup plugin > to do it: ConstrainRAMSpace=yes, and then disabling memory enforcement by the > JobAcctGather plugin (JobAcctGatherParams=NoOverMemoryKill). You may also want > to set UsePss. Be careful, though! It's case sensitive - this was fixed, but I > don’t remember which version. > > If you don’t want to enforce memory limits, you’ll still want to set > NoOverMemoryKill and also consider setting UsePss. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Thanks, can you also upload the output of sdiag and the slurmctld log? I think there may be something more going on here. For example: [2018-03-30T12:05:55.000] [5880794.batch] error: *** JOB 5880794 STEPD TERMINATED ON node266 AT 2018-03-30T12:05:54 DUE TO JOB NOT ENDING WITH SIGNALS *** Indicates that jobs are not getting killed right away. One possible reason for that is they are stuck on IO, see the faq: https://slurm.schedmd.com/faq.html#comp But it could be other things, let me keep looking through the slurmd log. Hello, I can see that these jobs being cancelled are part of a larger job array from one of our users. I will prepare the other data you have requested. Please standby. Thank You, John On Fri, Mar 30, 2018 at 3:16 PM, <bugs@schedmd.com> wrote: > *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c10> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > Thanks, can you also upload the output of sdiag and the slurmctld log? I think > there may be something more going on here. For example: > > [2018-03-30T12:05:55.000] [5880794.batch] error: *** JOB 5880794 STEPD > TERMINATED ON node266 AT 2018-03-30T12:05:54 DUE TO JOB NOT ENDING WITH SIGNALS > *** > > Indicates that jobs are not getting killed right away. One possible reason for > that is they are stuck on IO, see the faq: > https://slurm.schedmd.com/faq.html#comp > > But it could be other things, let me keep looking through the slurmd log. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Created attachment 6513 [details] sdiag-output Hello, Please see the files attached. Please consider this of high priority. Thank You, John Villa On Fri, Mar 30, 2018 at 3:21 PM, John Villa <jv2575@columbia.edu> wrote: > Hello, > I can see that these jobs being cancelled are part of a larger job array > from one of our users. I will prepare the other data you have requested. > Please standby. > Thank You, > John > > On Fri, Mar 30, 2018 at 3:16 PM, <bugs@schedmd.com> wrote: > >> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c10> on bug >> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey >> <marshall@schedmd.com> * >> >> Thanks, can you also upload the output of sdiag and the slurmctld log? I think >> there may be something more going on here. For example: >> >> [2018-03-30T12:05:55.000] [5880794.batch] error: *** JOB 5880794 STEPD >> TERMINATED ON node266 AT 2018-03-30T12:05:54 DUE TO JOB NOT ENDING WITH SIGNALS >> *** >> >> Indicates that jobs are not getting killed right away. One possible reason for >> that is they are stuck on IO, see the faq: >> https://slurm.schedmd.com/faq.html#comp >> >> But it could be other things, let me keep looking through the slurmd log. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > > -- > Sincerely, > John Villa > Sr. Research Systems Engineer > > Created attachment 6514 [details]
slurmctld
Thank you. Feel free to adjust the severity level as you see fit, following our guidelines: https://www.schedmd.com/support.php Since you have more information about the status of your site than I do. When did this start happening? Today? Yesterday? You said they're jobs that are part of a large job array. I advise avoiding large job arrays for now. Hello, We have seen this error periodically, maybe twice a week, for the past month. But this week we started seeing it more often. Today it appears to be happening constantly. It appears one user is spawning the array jobs. Why would this be an issue now? We have not made any major changes on our end. Thank You, John Villa On Fri, Mar 30, 2018 at 3:28 PM, <bugs@schedmd.com> wrote: > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c13> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from John Villa > <jv2575@columbia.edu> * > > Created attachment 6514 [details] <https://bugs.schedmd.com/attachment.cgi?id=6514> [details] <https://bugs.schedmd.com/attachment.cgi?id=6514&action=edit> > slurmctld > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c14> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > Thank you. Feel free to adjust the severity level as you see fit, following our > guidelines: > https://www.schedmd.com/support.php > > Since you have more information about the status of your site than I do. > > When did this start happening? Today? Yesterday? > > You said they're jobs that are part of a large job array. I advise avoiding > large job arrays for now. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hello, Be advised, our main concern here is the fact that slurm should be killing these jobs should they exceed their limits and then reallocate the resources. What we are seeing is the job being killed and then the node being set to drain. There is a parameter we can tweak for this in the meantime? Advising our end users to refrain from array jobs is not the preferred course of action. Again this just started happening. Thank You, John Villa On Fri, Mar 30, 2018 at 3:30 PM, John Villa <jv2575@columbia.edu> wrote: > Hello, > We have seen this error periodically, maybe twice a week, for the past > month. But this week we started seeing it more often. Today it appears to > be happening constantly. It appears one user is spawning the array jobs. > Why would this be an issue now? We have not made any major changes on our > end. > Thank You, > John Villa > > On Fri, Mar 30, 2018 at 3:28 PM, <bugs@schedmd.com> wrote: > >> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c13> on bug >> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from John Villa >> <jv2575@columbia.edu> * >> >> Created attachment 6514 [details] <https://bugs.schedmd.com/attachment.cgi?id=6514> [details] <https://bugs.schedmd.com/attachment.cgi?id=6514&action=edit> >> slurmctld >> >> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c14> on bug >> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey >> <marshall@schedmd.com> * >> >> Thank you. Feel free to adjust the severity level as you see fit, following our >> guidelines: >> https://www.schedmd.com/support.php >> >> Since you have more information about the status of your site than I do. >> >> When did this start happening? Today? Yesterday? >> >> You said they're jobs that are part of a large job array. I advise avoiding >> large job arrays for now. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > > -- > Sincerely, > John Villa > Sr. Research Systems Engineer > > This explains why the nodes are being set to drain: [2018-03-30T12:05:55.033] error: slurmd error running JobId=5880787 on node(s)=node266: Kill task failed [2018-03-30T12:05:55.034] drain_nodes: node node266 state set to DRAIN I advise constraining memory limits using cgroups, and disabling the JobAcctGather plugin from constraining memory limits (since that would be redundant). # cgroup.conf: ConstrainRAMSpace=yes ConstrainSwapSpace=yes # slurm.conf JobAcctGatherParams=NoOverMemoryKill There may be other things, too - I'm continuing to investigate. There is a significant patch for job arrays in 17.11.4: https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a An example of this bug is bug 4833. Thankfully, I haven't seen a sign of this in your log file. However, it is something to be aware of, so I advise applying this patch, and limiting the use of job arrays (at least large ones) until you are able to apply this patch. Again, this doesn't appear to be the issue here, but it is quite possible it could be an issue. If steps cannot be killed, you may also need to use the slurm.conf parameters UnkillableStepTimeout UnkillableStepProgram If a step is not killed within UnkillableStepTimeout seconds, then UnkillableStepProgram will run as SlurmdUser (root for you). Hello, We have looked into using these constraints in the past however we are not currently utilizing them for other reasons. I will talk with the rest of my team concerning the changes you are advising me make. In the meantime I will await more information. Thanks, John Villa On Fri, Mar 30, 2018 at 3:42 PM, <bugs@schedmd.com> wrote: > *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c18> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > There may be other things, too - I'm continuing to investigate. > > > There is a significant patch for job arrays in 17.11.4: > https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a > > An example of this bug is bug 4833 <https://bugs.schedmd.com/show_bug.cgi?id=4833>. > > Thankfully, I haven't seen a sign of this in your log file. However, it is > something to be aware of, so I advise applying this patch, and limiting the > use of job arrays (at least large ones) until you are able to apply this patch. > Again, this doesn't appear to be the issue here, but it is quite possible it > could be an issue. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hello, We will keep this in mind but we do not want to apply these additional settings until we try the first ones recommended. Thanks, John Villa On Fri, Mar 30, 2018 at 3:48 PM, <bugs@schedmd.com> wrote: > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c19> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > If steps cannot be killed, you may also need to use the slurm.conf parameters > > UnkillableStepTimeout > UnkillableStepProgram > > If a step is not killed within UnkillableStepTimeout seconds, then > UnkillableStepProgram will run as SlurmdUser (root for you). > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Marshal, So the node is being set to drain because slurmd thinks there is an issue due to the fact stepd couldn't kill the job? Please explain. Thanks, John [2018-03-30T12:05:55.033] error: slurmd error running JobId=5880787 on node(s)=node266: Kill task failed [2018-03-30T12:05:55.034] drain_nodes: node node266 state set to DRAIN On Fri, Mar 30, 2018 at 3:51 PM, John Villa <jv2575@columbia.edu> wrote: > Hello, > We will keep this in mind but we do not want to apply these additional > settings until we try the first ones recommended. > Thanks, > John Villa > > On Fri, Mar 30, 2018 at 3:48 PM, <bugs@schedmd.com> wrote: > >> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c19> on bug >> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey >> <marshall@schedmd.com> * >> >> If steps cannot be killed, you may also need to use the slurm.conf parameters >> >> UnkillableStepTimeout >> UnkillableStepProgram >> >> If a step is not killed within UnkillableStepTimeout seconds, then >> UnkillableStepProgram will run as SlurmdUser (root for you). >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > > -- > Sincerely, > John Villa > Sr. Research Systems Engineer > > Marshal, Basically what you are getting at is we need to leave enough overhead on these machines for slurmd to operate properly? Why is this just happening now? Perhaps it is the perfect storm between this user's job arrays and operations? Thank You, John Villa On Fri, Mar 30, 2018 at 3:54 PM, John Villa <jv2575@columbia.edu> wrote: > Marshal, > So the node is being set to drain because slurmd thinks there is an issue > due to the fact stepd couldn't kill the job? Please explain. > Thanks, > John > > [2018-03-30T12:05:55.033] error: slurmd error running JobId=5880787 on > node(s)=node266: Kill task failed > [2018-03-30T12:05:55.034] drain_nodes: node node266 state set to DRAIN > > On Fri, Mar 30, 2018 at 3:51 PM, John Villa <jv2575@columbia.edu> wrote: > >> Hello, >> We will keep this in mind but we do not want to apply these additional >> settings until we try the first ones recommended. >> Thanks, >> John Villa >> >> On Fri, Mar 30, 2018 at 3:48 PM, <bugs@schedmd.com> wrote: >> >>> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c19> on bug >>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey >>> <marshall@schedmd.com> * >>> >>> If steps cannot be killed, you may also need to use the slurm.conf parameters >>> >>> UnkillableStepTimeout >>> UnkillableStepProgram >>> >>> If a step is not killed within UnkillableStepTimeout seconds, then >>> UnkillableStepProgram will run as SlurmdUser (root for you). >>> >>> ------------------------------ >>> You are receiving this mail because: >>> >>> - You reported the bug. >>> >>> >> >> >> -- >> Sincerely, >> John Villa >> Sr. Research Systems Engineer >> >> > > > > -- > Sincerely, > John Villa > Sr. Research Systems Engineer > > slurmd couldn't kill the stepd, so it looks like the whole job was killed. When that happens, the slurmctld places the node to the drain state to prevent future jobs from being scheduled on the node. Please see bug 3941 for a longer explanation. The backfill scheduling time from the sdiag output is also concerning - it had a max time of over 71 seconds, and an average time of over 3 seconds. That indicates to me something was going wrong - backfill should never take 71 seconds, and 3 seconds seems long for an average time. See if you can find out why the job was unable to be killed - was it stuck on IO, or something else? What version did you upgrade from? 17.02 or something older? The node being placed in a draining state due to an unkillable step was introduced in 17.02, I believe. I'll need to double check that. Before you set ConstrainRamSpace, we actually advise not setting that if you're on RHEL 6, since RHEL 6 has lots of bugs in cgroups. I just remembered that. One more thing. If you do enforce memory limits with cgroups and want to disable memory enforcement with the jobacctgather plugin, you also need to add MemLimitEnforce=no to slurm.conf. See bug 4637. There was also discussion there about MaxRSS being inaccurate when pages are swapped out, and to use UsePss instead. (UsePss is case-sensitive in 17.11.2 - that's fixed in 17.11.3.) I'm hopeful that cgroup memory enforcement will solve the problems you're having. If not, or if you aren't able to use cgroup memory enforcement (e.g., if you're on RHEL6), please let us know and we'll continue to look for alternate solutions. Hello, I have tested the memory limits with cgroups within my test environment and it appears to have worked in the past. We will not make these changes suggested until the next downtime. In the meantime why would this just be happening now? Our workload has not changed. Have previous users reported such errors after upgrading? Thank You, John Villa On Fri, Mar 30, 2018 at 5:22 PM, <bugs@schedmd.com> wrote: > *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c26> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > One more thing. If you do enforce memory limits with cgroups and want to > disable memory enforcement with the jobacctgather plugin, you also need to add > > MemLimitEnforce=no > > to slurm.conf. See bug 4637 <https://bugs.schedmd.com/show_bug.cgi?id=4637>. There was also discussion there about MaxRSS being > inaccurate when pages are swapped out, and to use UsePss instead. (UsePss is > case-sensitive in 17.11.2 - that's fixed in 17.11.3.) > > I'm hopeful that cgroup memory enforcement will solve the problems you're > having. If not, or if you aren't able to use cgroup memory enforcement (e.g., > if you're on RHEL6), please let us know and we'll continue to look for > alternate solutions. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Why are you just now having nodes set to DRAIN because of steps exceeding memory limits and not getting killed? My best guess - because it's new behavior from the previous version of Slurm you were on. Looking at previous bugs submitted by your site, I'm guessing you upgraded from 16.05? I believe that previous to 17.02 you'd still get the errors, but the node wouldn't be set to DRAIN. (See comment 24.) I haven't noticed other sites report this particular error after upgrading. I have a little more information on this now: My guess in comment 28 was correct - when a step is unkillable, the node is put into the DRAIN state. This was introduced in 17.02 in commit f18390e81766a46b50ffa08a8cf1b7946ecdbf90, which references bug 3312 as the reason it was added. Your applications probably weren't getting killed properly before you upgraded. You'd have to check the application itself to see why it isn't dying to a SIGKILL, but it's possible that it's hung on IO. If cgroup memory enforcement isn't enough, you can increase UnkillableStepTimeout to be long enough for the application to get killed. You'd have to investigate the application itself to see how long you need to wait for. This was a proper workaround for the customer in bug 4959, who experienced the same problem (nodes being set to DRAIN) because of a hung application. Marshall We appreciate the followup and the information provided. It makes sense that this was introduced in 17.02 since we didn't see this prior. We will need to schedule a downtime with our researchers before we can move forward with any drastic changes in memory management. We will keep this bug open until we implement our configuration changes for we might decide to add or remove some tweaks as we get closer. Thanks again for follow up here. Thanks, John Villa On Mon, Apr 9, 2018 at 1:15 PM, <bugs@schedmd.com> wrote: > *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c29> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > I have a little more information on this now: > > My guess in comment 28 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c28> was correct - when a step is unkillable, the node is put > into the DRAIN state. This was introduced in 17.02 in commit > f18390e81766a46b50ffa08a8cf1b7946ecdbf90, which references bug 3312 <https://bugs.schedmd.com/show_bug.cgi?id=3312> as the > reason it was added. > > Your applications probably weren't getting killed properly before you upgraded. > > You'd have to check the application itself to see why it isn't dying to a > SIGKILL, but it's possible that it's hung on IO. > > If cgroup memory enforcement isn't enough, you can increase > UnkillableStepTimeout to be long enough for the application to get killed. > You'd have to investigate the application itself to see how long you need to > wait for. This was a proper workaround for the customer in bug 4959 <https://bugs.schedmd.com/show_bug.cgi?id=4959>, who > experienced the same problem (nodes being set to DRAIN) because of a hung > application. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > You're welcome. One question - how soon do you anticipate being able to schedule a downtime? If it's not very soon, would it be okay to close this bug for now as resolved/infogiven, and if you do have problems when you perform the changes you can reopen this ticket by changing the status back to unconfirmed? That makes it easier for us to keep track of tickets that are actually active and what work we have left to do. However, it's fine if you'd prefer to keep the ticket open until then. - Marshall Marshall, As long as we have the ability re-open the ticket should we have questions before or after implementation then feel free to close it. Thanks, John On Mon, Apr 9, 2018 at 1:29 PM, <bugs@schedmd.com> wrote: > *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c31> on bug > 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey > <marshall@schedmd.com> * > > You're welcome. > > One question - how soon do you anticipate being able to schedule a downtime? If > it's not very soon, would it be okay to close this bug for now as > resolved/infogiven, and if you do have problems when you perform the changes > you can reopen this ticket by changing the status back to unconfirmed? That > makes it easier for us to keep track of tickets that are actually active and > what work we have left to do. However, it's fine if you'd prefer to keep the > ticket open until then. > > - Marshall > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Sounds good. Closing as resolved/infogiven. To re-open the ticket, simply change the ticket status back to unconfirmed, post an additional comment, and click "Save Changes." |