2353 – GRES and memory underflow on job time limit exhausted

Ticket 2353 - GRES and memory underflow on job time limit exhausted

Summary: GRES and memory underflow on job time limit exhausted

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	15.08.6
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-01-15 13:14 MST by Moe Jette
Modified:	2018-02-16 12:10 MST (History)
CC List:	2 users (show)

See Also:	2350 4801
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	15.08.8
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Moe Jette 2016-01-15 13:14:11 MST

This is a follow up on one issue discovered in the course of investigating bug 2350. Note that the job was created with "--gres=craynetwork:1" and every node has a Gres=craynetwork:4. See the original bug for configuration and log files.

[2016-01-15T04:05:42.016] Time limit exhausted for JobId=30731
[2016-01-15T04:05:42.021] debug:  backup controller responding
[2016-01-15T04:05:42.258] job_complete: JobID=30731 State=0x8006 NodeCnt=20 WTERMSIG 15
[2016-01-15T04:05:42.914] debug:  freed ports 63023 for step 31976.89
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00636 gres count underflowfile
[2016-01-15T04:05:43.010] error: cons_res: node nid00636 memory is under-allocated (0-64523) for job 28127
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00645 gres count underflow
[2016-01-15T04:05:43.010] error: cons_res: node nid00645 memory is under-allocated (0-64523) for job 28127
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00646 gres count underflow
[2016-01-15T04:05:43.010] error: cons_res: node nid00646 memory is under-allocated (0-64523) for job 28127
[2016-01-15T04:05:43.010] error: gres/craynetwork: job 28127 node nid00647 gres count underflow

Comment 1 Moe Jette 2016-01-19 10:01:40 MST

In the process of working on bug 2350, I reproduced this problem. What I did was suspend a running job, then run "scontrol reconfig". I am not sure that was the specific code path involved in your case. You could check to see if job 28127 or any other job associated with nid00636 was ever suspended around that time frame. My fix will be in version 15.08.7 when released later this week. The commit is here:
https://github.com/SchedMD/slurm/commit/21c52d2f61e8086209d0c4d18f4700c07588ead9

Note, I do not believe this was responsible for the node state of "mixed" when only whole node allocations were possible.

Comment 2 Moe Jette 2016-01-21 08:03:43 MST

This commit related to NHC race condition could also result in mis-counting of memory and GRES:
https://github.com/SchedMD/slurm/commit/79a21bd697cf2fd365e497872387628a1c670b39

I'm going to close this ticket on the assumption that these two patches will likely fix the problem.

Comment 3 Doug Jacobsen 2016-01-21 08:32:04 MST

Thanks, Moe!

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Thu, Jan 21, 2016 at 3:03 PM, <bugs@schedmd.com> wrote:

> *Comment # 2 <http://bugs.schedmd.com/show_bug.cgi?id=2353#c2> on bug 2353
> <http://bugs.schedmd.com/show_bug.cgi?id=2353> from Moe Jette
> <jette@schedmd.com> *
>
> This commit related to NHC race condition could also result in mis-counting of
> memory and GRES:https://github.com/SchedMD/slurm/commit/79a21bd697cf2fd365e497872387628a1c670b39
>
> I'm going to close this ticket on the assumption that these two patches will
> likely fix the problem.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 4 Moe Jette 2016-01-22 02:01:34 MST

Somehow this ticket got re-opened. Closing it again.

Comment 5 Miguel Gila 2016-05-31 17:30:45 MDT

Hi, 

at CSCS we just installed native Slurm 15.08.11 on our XC30 Daint and we're seeing this problem happening under similar conditions:

<27>1 2016-06-01T08:27:44.768593+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: gres/craynetwork: job 41846 node nid04585 gres count underflow (0 1)
<27>1 2016-06-01T08:27:44.768612+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: cons_res: node nid04585 memory is under-allocated (0-32000) for job 41846
<27>1 2016-06-01T08:27:44.768623+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: gres/craynetwork: job 41846 node nid04905 gres count underflow (0 1)
<27>1 2016-06-01T08:27:44.768633+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: cons_res: node nid04905 memory is under-allocated (0-32000) for job 41846
<27>1 2016-06-01T08:27:44.778120+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: gres/craynetwork: job 41846 node nid04585 gres count underflow (0 1)
<27>1 2016-06-01T08:27:44.778139+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: cons_res: node nid04585 memory is under-allocated (0-32000) for job 41846
<27>1 2016-06-01T08:27:44.778149+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: gres/craynetwork: job 41846 node nid04905 gres count underflow (0 1)
<27>1 2016-06-01T08:27:44.778160+02:00 c2-0c1s1n2 slurmctld 19662 p0-20160531t074203 -  error: cons_res: node nid04905 memory is under-allocated (0-32000) for job 41846

We did run a scontrol reconfigure prior to getting these messages. 

Any idea on how to solve it?

Thanks,
Miguel

Comment 6 Moe Jette 2016-06-01 02:37:42 MDT

Please attach more of your slurmctld log file. I need to see the history of the jobs and nodes from before the errors.

(In reply to Miguel Gila from comment #5)
> Hi, 
> 
> at CSCS we just installed native Slurm 15.08.11 on our XC30 Daint and we're
> seeing this problem happening under similar conditions:
> 
> We did run a scontrol reconfigure prior to getting these messages. 
> 
> Any idea on how to solve it?

Comment 7 Miguel Gila 2016-06-03 05:43:22 MDT

Hi Moe,

I think this is a false positive, the problem seems gone after we rebooted a bunch of problematic nodes that never picked up the right configuration. Yesterday we learned that /etc/init.d/wlm_switch overwrites gres.conf with its own generated config and bind mounts over it, so even if you change this value on your xtopview the service wlm_switch needs to be restarted (and not just slurmd) for the nodes to see the changes.

After playing with the system a bit, we resorted marking the problematic nodes in DOWN state and warmreboot them. Now the problem is no longer happening.

Apologies for the noise.

Kind regards,
Miguel

On 01 Jun 2016, at 17:37, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:

Moe Jette<mailto:jette@schedmd.com> changed bug 2353<https://bugs.schedmd.com/show_bug.cgi?id=2353>
What    Removed Added
Ever confirmed  1
Resolution      FIXED   ---
Status  RESOLVED        UNCONFIRMED

Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=2353#c6> on bug 2353<https://bugs.schedmd.com/show_bug.cgi?id=2353> from Moe Jette<mailto:jette@schedmd.com>

Please attach more of your slurmctld log file. I need to see the history of the
jobs and nodes from before the errors.

(In reply to Miguel Gila from comment #5<x-msg://19/show_bug.cgi?id=2353#c5>)
> Hi,
>
> at CSCS we just installed native Slurm 15.08.11 on our XC30 Daint and we're
> seeing this problem happening under similar conditions:
>
> We did run a scontrol reconfigure prior to getting these messages.
>
> Any idea on how to solve it?

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.

--
Miguel Gila
CSCS Swiss National Supercomputing Centre
HPC Operations
Via Trevano 131 | CH-6900 Lugano | Switzerland
mg [at] cscs.ch<http://cscs.ch>

Comment 8 Moe Jette 2016-06-03 05:51:19 MDT

(In reply to Miguel Gila from comment #7)
> After playing with the system a bit, we resorted marking the problematic
> nodes in DOWN state and warmreboot them. Now the problem is no longer
> happening.
> 
> Apologies for the noise.

No problem. I am just pleased that it is working now.

Comment 9 Doug Jacobsen 2016-06-03 07:36:58 MDT

hmmm.. note that NERSC does not use Cray's gres.conf, nor allow their bind
mounts to happen, or get in the way.

We simply have a global gres.conf in our sysconfig path
(/opt/slurm/etc/gres.conf) that sets the craynetwork gres for all our nodes.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Fri, Jun 3, 2016 at 11:51 AM, <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2353
> <https://bugs.schedmd.com/show_bug.cgi?id=2353>
> What Removed Added
> Status UNCONFIRMED RESOLVED
> Resolution --- INVALID
>
> *Comment # 8 <https://bugs.schedmd.com/show_bug.cgi?id=2353#c8> on bug
> 2353 <https://bugs.schedmd.com/show_bug.cgi?id=2353> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Miguel Gila from comment #7 <https://bugs.schedmd.com/show_bug.cgi?id=2353#c7>)> After playing with the system a bit, we resorted marking the problematic
> > nodes in DOWN state and warmreboot them. Now the problem is no longer
> > happening.
> >
> > Apologies for the noise.
>
> No problem. I am just pleased that it is working now.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 10 Moe Jette 2016-06-03 08:35:17 MDT

(In reply to Doug Jacobsen from comment #9)
> hmmm.. note that NERSC does not use Cray's gres.conf, nor allow their bind
> mounts to happen, or get in the way.
> 
> We simply have a global gres.conf in our sysconfig path
> (/opt/slurm/etc/gres.conf) that sets the craynetwork gres for all our nodes.
> 
> -Doug

If you are still seeing this error, could you attach some logs showing the problem.

Comment 11 Doug Jacobsen 2016-06-03 08:53:38 MDT

I'll try to send one later today once I get a few more things done,
otherwise will do on Monday.

I find an easy way to provoke this issue is to run `scontrol reconfigure`
on a busy system while jobs are running.
We've also seen some failure modes where slurmctld will start throwing huge
quantities of these (filling up log spaces), but restarting slurmcltd
usually clears that up.

Typically we also see memory underflows as well as gres underflows.  But
since it hasn't really impacted functionality, it hasn't been at the top of
my list to clear up.  The only scary issue the log space filling up one,
since that can be disruptive if we don't notice in time.

I'll send logs soon.
Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Fri, Jun 3, 2016 at 2:35 PM, <bugs@schedmd.com> wrote:

> *Comment # 10 <https://bugs.schedmd.com/show_bug.cgi?id=2353#c10> on bug
> 2353 <https://bugs.schedmd.com/show_bug.cgi?id=2353> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #9 <https://bugs.schedmd.com/show_bug.cgi?id=2353#c9>)> hmmm.. note that NERSC does not use Cray's gres.conf, nor allow their bind
> > mounts to happen, or get in the way.
> >
> > We simply have a global gres.conf in our sysconfig path
> > (/opt/slurm/etc/gres.conf) that sets the craynetwork gres for all our nodes.
> >
> > -Doug
>
> If you are still seeing this error, could you attach some logs showing the
> problem.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 12 Moe Jette 2016-06-03 09:31:03 MDT

I'm reopening this bug based upon Doug's last comment.

I did recently fix a related bug that would effect tracing jobs which are in a suspended state (either manually suspended or gang scheduled), but I'm guessing that would not be a factor here. This is the commit:
https://github.com/SchedMD/slurm/commit/4ce626789dbfff156254345362abb54ebda92784

The other thing that comes to my mind is if reconfiguration takes place when a job is in a completing state, especially if NHC is running. I'm thinking there could be a race condition involved, but that's just a guess.

Comment 13 Moe Jette 2016-06-06 09:43:01 MDT

(In reply to Moe Jette from comment #12)
> The other thing that comes to my mind is if reconfiguration takes place when
> a job is in a completing state, especially if NHC is running. I'm thinking
> there could be a race condition involved, but that's just a guess.

Doug, I don't think that I'll need your logs. I was able to reproduce this failure if "scontrol reconfig" happens while the NHC is running for a job. Based upon my experience, that's probably 80% of the way toward a solution.

Comment 15 Moe Jette 2016-06-07 02:59:46 MDT

I have confirmed a fix for this problem. The original code would release a job's resources  when the slurmctld was reconfigured if the NHC was running. Later when the NHC completed, it would release resources again and generate underflow errors. I modified the code so that if the NHC had been running for 5 minutes or more it continues that behaviour in the expectation that NHC may be hung and the system administrator may want to make those resources (memory, GPUs, etc.) available to other jobs. If/when the NHC finally completes, an underflow error will be generated. If the NHC has been running for less than 5 minutes, it marks the job's resources in use so that when NHC completes there will be no underflow error. I do think that we want to have some sort of NHC timeout so that a system administrator can "kick" slurm to get jobs running again if NHC hangs, but I'm not sure what that timeout should be. Also note that I am making this change only to Slurm version 16.05 rather than risk introducing some new problem late in the version 15.08 release cycle. The change is here:
https://github.com/SchedMD/slurm/commit/de1400c986fb2a879e466f9bda210865fce95579

Comments?

Comment 16 Moe Jette 2016-06-29 11:53:50 MDT

Fixed in v16.05.1 (to be released today).

Comment 17 Doug Jacobsen 2016-06-29 12:00:29 MDT

What time?  I'm getting ready to start image builds, but can delay a little
if it's about to be released...

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jun 29, 2016 at 10:53 AM, <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2353
> <https://bugs.schedmd.com/show_bug.cgi?id=2353>
> What Removed Added
> Resolution --- FIXED
> Status CONFIRMED RESOLVED
>
> *Comment # 16 <https://bugs.schedmd.com/show_bug.cgi?id=2353#c16> on bug
> 2353 <https://bugs.schedmd.com/show_bug.cgi?id=2353> from Moe Jette
> <jette@schedmd.com> *
>
> Fixed in v16.05.1 (to be released today).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>