230 – SLURM erroneously marking block in error

Ticket 230 - SLURM erroneously marking block in error

Summary: SLURM erroneously marking block in error

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Bluegene select plugin (show other tickets)
Version:	2.5.x
Hardware:	IBM BlueGene Linux

Severity:	1 - System not usable
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2013-02-08 05:25 MST by Don Lipari
Modified:	2013-02-08 10:14 MST (History)
CC List:	0 users

See Also:
Site:	LLNL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Don Lipari 2013-02-08 05:25:33 MST

Since the addition of the 1K, we have occasionally seen slurm create a block for the non-existent rzuseq0001 node (R00-M1).  smap show this right now:

  no part RMP08Fe104921195 Error(F        -  T,T,T,T   512 rzuseq0001

This happened after this sequence of events.  I removed all blocks related to R1.  I ran a 1K job in pbatch which is setup for rzuseq[0010x0011] (i.e. R1-M[0-1]).  That worked.  Then, I submitted a 512 node job.  Slurm created the bogus block above, then freed and deleted the 1K block (RMP08Fe104734636), and then attempted to boot the bad block (RMP08Fe104921195).  That failed, which is why it is in error now.

Now we have this:

  $ sinfo
  PARTITION AVAIL  TIMELIMIT  NODES  STATE MIDPLANELIST
  pdebug*      up    8:00:00      1    err rzuseq0000
  pdebug*      up    8:00:00    257  alloc rzuseq0000
  pdebug*      up    8:00:00    254   idle rzuseq0000
  pbatch       up    4:00:00    512    err rzuseq0010
  pbatch       up    4:00:00    512   idle rzuseq0011

  $ sinfo -Rl
  Fri Feb  8 10:59:46 2013
  REASON               USER         TIMESTAMP           STATE  MIDPLANELIST
  Block(s) in error st Unknown      Unknown             err    rzuseq0000
  Block(s) in error st Unknown      Unknown             alloc  rzuseq0000
  Block(s) in error st Unknown      Unknown             idle   rzuseq0000
  status_check: Boot f slurm(101)   2013-02-08T10:50:21 err    rzuseq0010

So Slurm is marking rzuseq0010 in error in response to the bogus
rzuseq0001 being in error.  Also, I cannot run a 512 node job on the remaining midplane (rzuseq0011) even though it appears fine.
My submitted job goes into pending on "Resources".

==Py

Comment 1 Danny Auble 2013-02-08 05:31:20 MST

Does this system only have 1.5k nodes?  Not sure what this system looks like.

So what does your slurm.conf say for node declarations?

My guess is it is something like rzuseq[0000x0011], at least it should be something like this.  Even if the nodes don't exist in reality, they exist in the db2 so they must be represented in Slurm.

More information on the system would be nice.  I am unaware of what it looks like.

Comment 2 Don Lipari 2013-02-08 05:34:23 MST

(In reply to comment #1)
> Does this system only have 1.5k nodes?  Not sure what this system looks like.
> 
> So what does your slurm.conf say for node declarations?
> 
> My guess is it is something like rzuseq[0000x0011], at least it should be
> something like this.  Even if the nodes don't exist in reality, they exist
> in the db2 so they must be represented in Slurm.
> 
> More information on the system would be nice.  I am unaware of what it looks
> like.

# COMPUTE NODES
FrontendName=rzuseqlac1
FrontendName=rzuseqlac2
NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN
NodeName=rzuseq[0000,0010x0011]

Include /etc/slurm/slurm.conf.updates

PartitionName=pbatch Nodes=rzuseq[0010x0011] Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=4:00:00  AllowGroups=bgldev,gupta7,jdelsign

cat /etc/slurm/slurm.conf.updates

PartitionName=pdebug Nodes=rzuseq0000 Default=YES State=UP Shared=FORCE DefaultTime=60 MaxTime=8:00:00 MaxNodes=64

Comment 3 Danny Auble 2013-02-08 05:43:55 MST

This is most likely a bad configuration.

I would put 

NodeName=rzuseq[0000,0010x0011]
NodeName=rzuseq0001 State=DOWN

For reasons sighted in the last comment.  You most likely don't need the second line and they can be combined.  Slurm is making that block on the missing node on purpose in most cases to avoid running jobs there.

See http://schedmd.com/slurmdocs/bluegene.html under "Naming Conventions" look at the second "IMPORTANT".  Perhaps it isn't very clear in this situation since your system does start on 0000 but the idea still applies, if you have a better wording to help prevent this in the future please submit a patch.

If this doesn't fix your problem please reopen the ticket.

Comment 4 Don Lipari 2013-02-08 10:14:41 MST

(In reply to comment #3)
> This is most likely a bad configuration.
> 
> I would put 
> 
> NodeName=rzuseq[0000,0010x0011]
> NodeName=rzuseq0001 State=DOWN
> 
> For reasons sighted in the last comment.  You most likely don't need the
> second line and they can be combined.  Slurm is making that block on the
> missing node on purpose in most cases to avoid running jobs there.
> 
> See http://schedmd.com/slurmdocs/bluegene.html under "Naming Conventions"
> look at the second "IMPORTANT".  Perhaps it isn't very clear in this
> situation since your system does start on 0000 but the idea still applies,
> if you have a better wording to help prevent this in the future please
> submit a patch.
> 
> If this doesn't fix your problem please reopen the ticket.

We added Danny's recommended line to slurm.conf and have not been able to recreate the problem. Jobs are running.