Ticket 5005

Summary:	Error: Exceeded job memory limit
Product:	Slurm	Reporter:	John Villa <jv2575>
Component:	slurmd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	ca2783
Version:	17.11.2
Hardware:	Linux
OS:	Linux
Site:	Columbia University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf cgroup.conf slurmd sdiag-output slurmctld

Description John Villa 2018-03-30 09:30:39 MDT

Hello,

After upgrading our version of slurm to 17.11.2 we are seeing the following error often:

error: Exceeded job memory limit

This puts our nodes into a drain state for the following reason:

batch job complete failure

Can you help us resolve this issue? We noticed others have also experienced this issue when searching. Nothing on the nodes in question indicates a system problem. Do you have a patch for this? Please let us know if you need any further information.

Comment 1 Marshall Garey 2018-03-30 10:08:06 MDT

It sounds like jobs are exceeding their memory limits and are getting killed. And the slurmd's are being placed into a drain state as a result.

Can you upload your slurm.conf file? There are a number of different things that affect how memory limits are enforced. If you're using the task/cgroup plugin, can you also upload your cgroup.conf file?

See also bug 3562 - there was a long discussion there about job memory limits and how they're handled.

Comment 2 John Villa 2018-03-30 10:13:18 MDT

Hello,
Please cc hpc-support@columbia.edu on any future correspondence. We are
using cgroups. I will provide you with the files please standby.
Thanks,
John Villa

On Fri, Mar 30, 2018 at 12:08 PM, <bugs@schedmd.com> wrote:

> Marshall Garey <marshall@schedmd.com> changed bug 5005
> <https://bugs.schedmd.com/show_bug.cgi?id=5005>
> What Removed Added
> Assignee support@schedmd.com marshall@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c1> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> It sounds like jobs are exceeding their memory limits and are getting killed.
> And the slurmd's are being placed into a drain state as a result.
>
> Can you upload your slurm.conf file? There are a number of different things
> that affect how memory limits are enforced. If you're using the task/cgroup
> plugin, can you also upload your cgroup.conf file?
>
> See also bug 3562 <https://bugs.schedmd.com/show_bug.cgi?id=3562> - there was a long discussion there about job memory limits
> and how they're handled.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 3 Marshall Garey 2018-03-30 10:14:58 MDT

You can add that email to the CC list, but the email address needs to be registered with SchedMD's bugzilla. I don't think it has been registered yet.

Comment 4 John Villa 2018-03-30 10:18:28 MDT

Created attachment 6510 [details]
slurm.conf

Hello,
Please see the attached configs. We will standby for your feedback. In the
meantime why would slurmd go into a drain state due to a job failure?
Thanks,
John

On Fri, Mar 30, 2018 at 12:14 PM, <bugs@schedmd.com> wrote:

> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c3> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> You can add that email to the CC list, but the email address needs to be
> registered with SchedMD's bugzilla. I don't think it has been registered yet.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 5 John Villa 2018-03-30 10:18:28 MDT

Created attachment 6511 [details]
cgroup.conf

Comment 6 John Villa 2018-03-30 11:29:19 MDT

Hello,
Please be advised that this is happening again. We have seven more nodes
that have thrown this error and been into a drain state. Please increase
the urgency level of this BUG.
Thanks,
John Villa

On Fri, Mar 30, 2018 at 12:18 PM, John Villa <jv2575@columbia.edu> wrote:

> Hello,
> Please see the attached configs. We will standby for your feedback. In the
> meantime why would slurmd go into a drain state due to a job failure?
> Thanks,
> John
>
>
> On Fri, Mar 30, 2018 at 12:14 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c3> on bug
>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
>> <marshall@schedmd.com> *
>>
>> You can add that email to the CC list, but the email address needs to be
>> registered with SchedMD's bugzilla. I don't think it has been registered yet.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
>
> --
> Sincerely,
> John Villa
> Sr. Research Systems Engineer
>
>

Comment 7 John Villa 2018-03-30 11:47:40 MDT

Hello,
We didn't experience this issue until upgrading. Would you recommend we use
the following:
'ConstrainKmemSpace=no'

Perhaps the enhancements in 17 concerning memory errors are what we are
seeing here?
Thanks,
John Villa

On Fri, Mar 30, 2018 at 1:29 PM, John Villa <jv2575@columbia.edu> wrote:

> Hello,
> Please be advised that this is happening again. We have seven more nodes
> that have thrown this error and been into a drain state. Please increase
> the urgency level of this BUG.
> Thanks,
> John Villa
>
> On Fri, Mar 30, 2018 at 12:18 PM, John Villa <jv2575@columbia.edu> wrote:
>
>> Hello,
>> Please see the attached configs. We will standby for your feedback. In
>> the meantime why would slurmd go into a drain state due to a job failure?
>> Thanks,
>> John
>>
>>
>> On Fri, Mar 30, 2018 at 12:14 PM, <bugs@schedmd.com> wrote:
>>
>>> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c3> on bug
>>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
>>> <marshall@schedmd.com> *
>>>
>>> You can add that email to the CC list, but the email address needs to be
>>> registered with SchedMD's bugzilla. I don't think it has been registered yet.
>>>
>>> ------------------------------
>>> You are receiving this mail because:
>>>
>>>    - You reported the bug.
>>>
>>>
>>
>>
>> --
>> Sincerely,
>> John Villa
>> Sr. Research Systems Engineer
>>
>>
>
>
>
> --
> Sincerely,
> John Villa
> Sr. Research Systems Engineer
>
>

Comment 8 Marshall Garey 2018-03-30 12:52:57 MDT

Thanks for the configs.

Do you want to enforce job memory limits? I noticed you have

ConstrainRAMSpace=no
ConstrainSwapSpace=no

in your cgroup.conf, which means memory limits won't be enforced by cgroups, but will still be enforced by JobAcctGatherType=jobacct_gather/linux plugin.

And you also have

selecttypeparameters=CR_Core (not memory)

and you don't have defmempercpu. So the only times a job has memory limits is when it specifically requests a memory limit. Then those jobs are exceeding the memory limit that they requested, and getting killed. I'm not sure why the node is being set to DRAIN. Can you upload a slurmd log file from one of those nodes?

If you want to enforce memory, then we recommend using the task/cgroup plugin to do it: ConstrainRAMSpace=yes, and then disabling memory enforcement by the JobAcctGather plugin (JobAcctGatherParams=NoOverMemoryKill). You may also want to set UsePss. Be careful, though! It's case sensitive - this was fixed, but I don’t remember which version.

If you don’t want to enforce memory limits, you’ll still want to set NoOverMemoryKill and also consider setting UsePss.

Comment 9 John Villa 2018-03-30 13:08:29 MDT

Created attachment 6512 [details]
slurmd

Hello,
I think what it more important here is to find out why they are being set
to drain. We have had this configuration in place for months (maybe even a
year or more) and this error is just starting to populate. In fact we have
more nodes down now. Please see the attached log.
Thank You,
John Villa

On Fri, Mar 30, 2018 at 2:52 PM, <bugs@schedmd.com> wrote:

> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c8> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> Thanks for the configs.
>
> Do you want to enforce job memory limits? I noticed you have
>
> ConstrainRAMSpace=no
> ConstrainSwapSpace=no
>
> in your cgroup.conf, which means memory limits won't be enforced by cgroups,
> but will still be enforced by JobAcctGatherType=jobacct_gather/linux plugin.
>
> And you also have
>
> selecttypeparameters=CR_Core (not memory)
>
> and you don't have defmempercpu. So the only times a job has memory limits is
> when it specifically requests a memory limit. Then those jobs are exceeding the
> memory limit that they requested, and getting killed. I'm not sure why the node
> is being set to DRAIN. Can you upload a slurmd log file from one of those
> nodes?
>
> If you want to enforce memory, then we recommend using the task/cgroup plugin
> to do it: ConstrainRAMSpace=yes, and then disabling memory enforcement by the
> JobAcctGather plugin (JobAcctGatherParams=NoOverMemoryKill). You may also want
> to set UsePss. Be careful, though! It's case sensitive - this was fixed, but I
> don’t remember which version.
>
> If you don’t want to enforce memory limits, you’ll still want to set
> NoOverMemoryKill and also consider setting UsePss.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 10 Marshall Garey 2018-03-30 13:16:26 MDT

Thanks, can you also upload the output of sdiag and the slurmctld log? I think there may be something more going on here. For example:

[2018-03-30T12:05:55.000] [5880794.batch] error: *** JOB 5880794 STEPD TERMINATED ON node266 AT 2018-03-30T12:05:54 DUE TO JOB NOT ENDING WITH SIGNALS ***

Indicates that jobs are not getting killed right away. One possible reason for that is they are stuck on IO, see the faq:

https://slurm.schedmd.com/faq.html#comp

But it could be other things, let me keep looking through the slurmd log.

Comment 11 John Villa 2018-03-30 13:21:50 MDT

Hello,
I can see that these jobs being cancelled are part of a larger job array
from one of our users. I will prepare the other data you have requested.
Please standby.
Thank You,
John

On Fri, Mar 30, 2018 at 3:16 PM, <bugs@schedmd.com> wrote:

> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c10> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> Thanks, can you also upload the output of sdiag and the slurmctld log? I think
> there may be something more going on here. For example:
>
> [2018-03-30T12:05:55.000] [5880794.batch] error: *** JOB 5880794 STEPD
> TERMINATED ON node266 AT 2018-03-30T12:05:54 DUE TO JOB NOT ENDING WITH SIGNALS
> ***
>
> Indicates that jobs are not getting killed right away. One possible reason for
> that is they are stuck on IO, see the faq:
> https://slurm.schedmd.com/faq.html#comp
>
> But it could be other things, let me keep looking through the slurmd log.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 12 John Villa 2018-03-30 13:27:35 MDT

Created attachment 6513 [details]
sdiag-output

Hello,
Please see the files attached. Please consider this of high priority.
Thank You,
John Villa

On Fri, Mar 30, 2018 at 3:21 PM, John Villa <jv2575@columbia.edu> wrote:

> Hello,
> I can see that these jobs being cancelled are part of a larger job array
> from one of our users. I will prepare the other data you have requested.
> Please standby.
> Thank You,
> John
>
> On Fri, Mar 30, 2018 at 3:16 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c10> on bug
>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
>> <marshall@schedmd.com> *
>>
>> Thanks, can you also upload the output of sdiag and the slurmctld log? I think
>> there may be something more going on here. For example:
>>
>> [2018-03-30T12:05:55.000] [5880794.batch] error: *** JOB 5880794 STEPD
>> TERMINATED ON node266 AT 2018-03-30T12:05:54 DUE TO JOB NOT ENDING WITH SIGNALS
>> ***
>>
>> Indicates that jobs are not getting killed right away. One possible reason for
>> that is they are stuck on IO, see the faq:
>> https://slurm.schedmd.com/faq.html#comp
>>
>> But it could be other things, let me keep looking through the slurmd log.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
>
> --
> Sincerely,
> John Villa
> Sr. Research Systems Engineer
>
>

Comment 13 John Villa 2018-03-30 13:27:36 MDT

Created attachment 6514 [details]
slurmctld

Comment 14 Marshall Garey 2018-03-30 13:28:39 MDT

Thank you. Feel free to adjust the severity level as you see fit, following our guidelines:

https://www.schedmd.com/support.php

Since you have more information about the status of your site than I do.

When did this start happening? Today? Yesterday?

You said they're jobs that are part of a large job array. I advise avoiding large job arrays for now.

Comment 15 John Villa 2018-03-30 13:30:34 MDT

Hello,
We have seen this error periodically, maybe twice a week, for the past
month. But this week we started seeing it more often. Today it appears to
be happening constantly. It appears one user is spawning the array jobs.
Why would this be an issue now? We have not made any major changes on our
end.
Thank You,
John Villa

On Fri, Mar 30, 2018 at 3:28 PM, <bugs@schedmd.com> wrote:

> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c13> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from John Villa
> <jv2575@columbia.edu> *
>
> Created attachment 6514 [details] <https://bugs.schedmd.com/attachment.cgi?id=6514> [details] <https://bugs.schedmd.com/attachment.cgi?id=6514&action=edit>
> slurmctld
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c14> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> Thank you. Feel free to adjust the severity level as you see fit, following our
> guidelines:
> https://www.schedmd.com/support.php
>
> Since you have more information about the status of your site than I do.
>
> When did this start happening? Today? Yesterday?
>
> You said they're jobs that are part of a large job array. I advise avoiding
> large job arrays for now.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 16 John Villa 2018-03-30 13:34:15 MDT

Hello,
Be advised, our main concern here is the fact that slurm should be killing
these jobs should they exceed their limits and then reallocate the
resources. What we are seeing is the job being killed and then the node
being set to drain. There is a parameter we can tweak for this in the
meantime? Advising our end users to refrain from array jobs is not the
preferred course of action. Again this just started happening.
Thank You,
John Villa

On Fri, Mar 30, 2018 at 3:30 PM, John Villa <jv2575@columbia.edu> wrote:

> Hello,
> We have seen this error periodically, maybe twice a week, for the past
> month. But this week we started seeing it more often. Today it appears to
> be happening constantly. It appears one user is spawning the array jobs.
> Why would this be an issue now? We have not made any major changes on our
> end.
> Thank You,
> John Villa
>
> On Fri, Mar 30, 2018 at 3:28 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c13> on bug
>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from John Villa
>> <jv2575@columbia.edu> *
>>
>> Created attachment 6514 [details] <https://bugs.schedmd.com/attachment.cgi?id=6514> [details] <https://bugs.schedmd.com/attachment.cgi?id=6514&action=edit>
>> slurmctld
>>
>> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c14> on bug
>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
>> <marshall@schedmd.com> *
>>
>> Thank you. Feel free to adjust the severity level as you see fit, following our
>> guidelines:
>> https://www.schedmd.com/support.php
>>
>> Since you have more information about the status of your site than I do.
>>
>> When did this start happening? Today? Yesterday?
>>
>> You said they're jobs that are part of a large job array. I advise avoiding
>> large job arrays for now.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
>
> --
> Sincerely,
> John Villa
> Sr. Research Systems Engineer
>
>

Comment 17 Marshall Garey 2018-03-30 13:38:37 MDT

This explains why the nodes are being set to drain:

[2018-03-30T12:05:55.033] error: slurmd error running JobId=5880787 on node(s)=node266: Kill task failed
[2018-03-30T12:05:55.034] drain_nodes: node node266 state set to DRAIN


I advise constraining memory limits using cgroups, and disabling the JobAcctGather plugin from constraining memory limits (since that would be redundant).

# cgroup.conf:
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

# slurm.conf
JobAcctGatherParams=NoOverMemoryKill

Comment 18 Marshall Garey 2018-03-30 13:42:09 MDT

There may be other things, too - I'm continuing to investigate.


There is a significant patch for job arrays in 17.11.4:

https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a

An example of this bug is bug 4833.

Thankfully, I haven't seen a sign of this in your log file. However, it is something to be aware of, so I advise applying this patch,  and limiting the use of job arrays (at least large ones) until you are able to apply this patch. Again, this doesn't appear to be the issue here, but it is quite possible it could be an issue.

Comment 19 Marshall Garey 2018-03-30 13:48:55 MDT

If steps cannot be killed, you may also need to use the slurm.conf parameters

UnkillableStepTimeout
UnkillableStepProgram

If a step is not killed within UnkillableStepTimeout seconds, then UnkillableStepProgram will run as SlurmdUser (root for you).

Comment 20 John Villa 2018-03-30 13:49:28 MDT

Hello,
We have looked into using these constraints in the past however we are not
currently utilizing them for other reasons. I will talk with the rest of my
team concerning the changes you are advising me make. In the meantime I
will await more information.
Thanks,
John Villa

On Fri, Mar 30, 2018 at 3:42 PM, <bugs@schedmd.com> wrote:

> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c18> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> There may be other things, too - I'm continuing to investigate.
>
>
> There is a significant patch for job arrays in 17.11.4:
> https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a
>
> An example of this bug is bug 4833 <https://bugs.schedmd.com/show_bug.cgi?id=4833>.
>
> Thankfully, I haven't seen a sign of this in your log file. However, it is
> something to be aware of, so I advise applying this patch,  and limiting the
> use of job arrays (at least large ones) until you are able to apply this patch.
> Again, this doesn't appear to be the issue here, but it is quite possible it
> could be an issue.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 21 John Villa 2018-03-30 13:51:20 MDT

Hello,
We will keep this in mind but we do not want to apply these additional
settings until we try the first ones recommended.
Thanks,
John Villa

On Fri, Mar 30, 2018 at 3:48 PM, <bugs@schedmd.com> wrote:

> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c19> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> If steps cannot be killed, you may also need to use the slurm.conf parameters
>
> UnkillableStepTimeout
> UnkillableStepProgram
>
> If a step is not killed within UnkillableStepTimeout seconds, then
> UnkillableStepProgram will run as SlurmdUser (root for you).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 22 John Villa 2018-03-30 13:54:25 MDT

Marshal,
So the node is being set to drain because slurmd thinks there is an issue
due to the fact stepd couldn't kill the job? Please explain.
Thanks,
John

[2018-03-30T12:05:55.033] error: slurmd error running JobId=5880787 on
node(s)=node266: Kill task failed
[2018-03-30T12:05:55.034] drain_nodes: node node266 state set to DRAIN

On Fri, Mar 30, 2018 at 3:51 PM, John Villa <jv2575@columbia.edu> wrote:

> Hello,
> We will keep this in mind but we do not want to apply these additional
> settings until we try the first ones recommended.
> Thanks,
> John Villa
>
> On Fri, Mar 30, 2018 at 3:48 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c19> on bug
>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
>> <marshall@schedmd.com> *
>>
>> If steps cannot be killed, you may also need to use the slurm.conf parameters
>>
>> UnkillableStepTimeout
>> UnkillableStepProgram
>>
>> If a step is not killed within UnkillableStepTimeout seconds, then
>> UnkillableStepProgram will run as SlurmdUser (root for you).
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
>
> --
> Sincerely,
> John Villa
> Sr. Research Systems Engineer
>
>

Comment 23 John Villa 2018-03-30 13:55:58 MDT

Marshal,
Basically what you are getting at is we need to leave enough overhead on
these machines for slurmd to operate properly? Why is this just happening
now? Perhaps it is the perfect storm between this user's job arrays and
operations?
Thank You,
John Villa

On Fri, Mar 30, 2018 at 3:54 PM, John Villa <jv2575@columbia.edu> wrote:

> Marshal,
> So the node is being set to drain because slurmd thinks there is an issue
> due to the fact stepd couldn't kill the job? Please explain.
> Thanks,
> John
>
> [2018-03-30T12:05:55.033] error: slurmd error running JobId=5880787 on
> node(s)=node266: Kill task failed
> [2018-03-30T12:05:55.034] drain_nodes: node node266 state set to DRAIN
>
> On Fri, Mar 30, 2018 at 3:51 PM, John Villa <jv2575@columbia.edu> wrote:
>
>> Hello,
>> We will keep this in mind but we do not want to apply these additional
>> settings until we try the first ones recommended.
>> Thanks,
>> John Villa
>>
>> On Fri, Mar 30, 2018 at 3:48 PM, <bugs@schedmd.com> wrote:
>>
>>> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c19> on bug
>>> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
>>> <marshall@schedmd.com> *
>>>
>>> If steps cannot be killed, you may also need to use the slurm.conf parameters
>>>
>>> UnkillableStepTimeout
>>> UnkillableStepProgram
>>>
>>> If a step is not killed within UnkillableStepTimeout seconds, then
>>> UnkillableStepProgram will run as SlurmdUser (root for you).
>>>
>>> ------------------------------
>>> You are receiving this mail because:
>>>
>>>    - You reported the bug.
>>>
>>>
>>
>>
>> --
>> Sincerely,
>> John Villa
>> Sr. Research Systems Engineer
>>
>>
>
>
>
> --
> Sincerely,
> John Villa
> Sr. Research Systems Engineer
>
>

Comment 24 Marshall Garey 2018-03-30 14:12:23 MDT

slurmd couldn't kill the stepd, so it looks like the whole job was killed. When that happens, the slurmctld places the node to the drain state to prevent future jobs from being scheduled on the node. Please see bug 3941 for a longer explanation.

The backfill scheduling time from the sdiag output is also concerning - it had a max time of over 71 seconds, and an average time of over 3 seconds. That indicates to me something was going wrong - backfill should never take 71 seconds, and 3 seconds seems long for an average time.

See if you can find out why the job was unable to be killed - was it stuck on IO, or something else?

What version did you upgrade from? 17.02 or something older? The node being placed in a draining state due to an unkillable step was introduced in 17.02, I believe. I'll need to double check that.

Comment 25 Marshall Garey 2018-03-30 14:22:04 MDT

Before you set ConstrainRamSpace, we actually advise not setting that if you're on RHEL 6, since RHEL 6 has lots of bugs in cgroups. I just remembered that.

Comment 26 Marshall Garey 2018-03-30 15:22:12 MDT

One more thing. If you do enforce memory limits with cgroups and want to disable memory enforcement with the jobacctgather plugin, you also need to add

MemLimitEnforce=no

to slurm.conf. See bug 4637. There was also discussion there about MaxRSS being inaccurate when pages are swapped out, and to use UsePss instead. (UsePss is case-sensitive in 17.11.2 - that's fixed in 17.11.3.)

I'm hopeful that cgroup memory enforcement will solve the problems you're having. If not, or if you aren't able to use cgroup memory enforcement (e.g., if you're on RHEL6), please let us know and we'll continue to look for alternate solutions.

Comment 27 John Villa 2018-04-05 08:38:48 MDT

Hello,
I have tested the memory limits with cgroups within my test environment and
it appears to have worked in the past. We will not make these changes
suggested until the next downtime. In the meantime why would this just be
happening now? Our workload has not changed. Have previous users reported
such errors after upgrading?
Thank You,
John Villa

On Fri, Mar 30, 2018 at 5:22 PM, <bugs@schedmd.com> wrote:

> *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c26> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> One more thing. If you do enforce memory limits with cgroups and want to
> disable memory enforcement with the jobacctgather plugin, you also need to add
>
> MemLimitEnforce=no
>
> to slurm.conf. See bug 4637 <https://bugs.schedmd.com/show_bug.cgi?id=4637>. There was also discussion there about MaxRSS being
> inaccurate when pages are swapped out, and to use UsePss instead. (UsePss is
> case-sensitive in 17.11.2 - that's fixed in 17.11.3.)
>
> I'm hopeful that cgroup memory enforcement will solve the problems you're
> having. If not, or if you aren't able to use cgroup memory enforcement (e.g.,
> if you're on RHEL6), please let us know and we'll continue to look for
> alternate solutions.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 28 Marshall Garey 2018-04-05 11:23:15 MDT

Why are you just now having nodes set to DRAIN because of steps exceeding memory limits and not getting killed?

My best guess - because it's new behavior from the previous version of Slurm you were on. Looking at previous bugs submitted by your site, I'm guessing you upgraded from 16.05? I believe that previous to 17.02 you'd still get the errors, but the node wouldn't be set to DRAIN. (See comment 24.)

I haven't noticed other sites report this particular error after upgrading.

Comment 29 Marshall Garey 2018-04-09 11:15:42 MDT

I have a little more information on this now:

My guess in comment 28 was correct - when a step is unkillable, the node is put into the DRAIN state. This was introduced in 17.02 in commit f18390e81766a46b50ffa08a8cf1b7946ecdbf90, which references bug 3312 as the reason it was added.

Your applications probably weren't getting killed properly before you upgraded.

You'd have to check the application itself to see why it isn't dying to a SIGKILL, but it's possible that it's hung on IO.

If cgroup memory enforcement isn't enough, you can increase UnkillableStepTimeout to be long enough for the application to get killed. You'd have to investigate the application itself to see how long you need to wait for. This was a proper workaround for the customer in bug 4959, who experienced the same problem (nodes being set to DRAIN) because of a hung application.

Comment 30 John Villa 2018-04-09 11:21:08 MDT

Marshall
We appreciate the followup and the information provided. It makes sense
that this was introduced in 17.02 since we didn't see this prior. We will
need to schedule a downtime with our researchers before we can move forward
with any drastic changes in memory management. We will keep this bug open
until we implement our configuration changes for we might decide to add or
remove some tweaks as we get closer. Thanks again for follow up here.
Thanks,
John Villa

On Mon, Apr 9, 2018 at 1:15 PM, <bugs@schedmd.com> wrote:

> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c29> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> I have a little more information on this now:
>
> My guess in comment 28 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c28> was correct - when a step is unkillable, the node is put
> into the DRAIN state. This was introduced in 17.02 in commit
> f18390e81766a46b50ffa08a8cf1b7946ecdbf90, which references bug 3312 <https://bugs.schedmd.com/show_bug.cgi?id=3312> as the
> reason it was added.
>
> Your applications probably weren't getting killed properly before you upgraded.
>
> You'd have to check the application itself to see why it isn't dying to a
> SIGKILL, but it's possible that it's hung on IO.
>
> If cgroup memory enforcement isn't enough, you can increase
> UnkillableStepTimeout to be long enough for the application to get killed.
> You'd have to investigate the application itself to see how long you need to
> wait for. This was a proper workaround for the customer in bug 4959 <https://bugs.schedmd.com/show_bug.cgi?id=4959>, who
> experienced the same problem (nodes being set to DRAIN) because of a hung
> application.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 31 Marshall Garey 2018-04-09 11:29:30 MDT

You're welcome.

One question - how soon do you anticipate being able to schedule a downtime? If it's not very soon, would it be okay to close this bug for now as resolved/infogiven, and if you do have problems when you perform the changes you can reopen this ticket by changing the status back to unconfirmed? That makes it easier for us to keep track of tickets that are actually active and what work we have left to do. However, it's fine if you'd prefer to keep the ticket open until then.

- Marshall

Comment 32 John Villa 2018-04-09 11:31:28 MDT

Marshall,
As long as we have the ability re-open the ticket should we have questions
before or after implementation then feel free to close it.
Thanks,
John

On Mon, Apr 9, 2018 at 1:29 PM, <bugs@schedmd.com> wrote:

> *Comment # 31 <https://bugs.schedmd.com/show_bug.cgi?id=5005#c31> on bug
> 5005 <https://bugs.schedmd.com/show_bug.cgi?id=5005> from Marshall Garey
> <marshall@schedmd.com> *
>
> You're welcome.
>
> One question - how soon do you anticipate being able to schedule a downtime? If
> it's not very soon, would it be okay to close this bug for now as
> resolved/infogiven, and if you do have problems when you perform the changes
> you can reopen this ticket by changing the status back to unconfirmed? That
> makes it easier for us to keep track of tickets that are actually active and
> what work we have left to do. However, it's fine if you'd prefer to keep the
> ticket open until then.
>
> - Marshall
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 33 Marshall Garey 2018-04-09 11:37:59 MDT

Sounds good. Closing as resolved/infogiven.

To re-open the ticket, simply change the ticket status back to unconfirmed, post an additional comment, and click "Save Changes."