Ticket 212

Summary:	Fatal Error on BG/Q
Product:	Slurm	Reporter:	Don Lipari <lipari1>
Component:	Bluegene select plugin	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	2.5.x
Hardware:	IBM BlueGene
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Don Lipari 2013-01-22 04:58:16 MST

I just updated Sequoia to v2.5.1.  There was a problem converting the state files and I did not have time to diagnose.  So, I deleted all the state files.  When I restarted, I saw this:

[2013-01-22T08:51:34-08:00] RMP20Ja225511530 not found in the state file, adding
[2013-01-22T08:51:34-08:00] debug:  We don't allow X passthoughs
[2013-01-22T08:51:34-08:00] _fill_in_wires: we can't use this so return
[2013-01-22T08:51:34-08:00] fatal: I was unable to make the requested block.

Adam had seen this same problem over the weekend.  Here is what he found:

If hardware on the system is in error that would prevent the creation of a block, but the block was created before the hardware was in error, and you do a startclean on slurm... it does this.  Slurm tries to re-create the block to sync up its internal state, and when it sees that it has internal rules preventing that block creation, it exits.

The work-around is to delete any blocks from the blue gene control system that cause this error prior to doing the clean start of slurm.

The block name is the one in the log immediately before the error.  In this case RMP20Ja225511530.

Comment 1 Don Lipari 2013-01-22 06:31:25 MST

The following was in the bluegene.conf file at the time of the problem:

DenyPassthrough=A,X

Comment 2 Danny Auble 2013-01-22 07:24:18 MST

Having the state files would be very valuable.  Perhaps the ".old" files still exist?

Any idea how RMP20Ja225511530 was generated?

From what you say here it sounds like it was generated outside of Slurm.  Is that not the case?

Doing a startclean should remove all old blocks.  Is this not what is happening?  I wouldn't expect it to try to re-create them unless you have a non-dynamic setup.

Comment 3 Danny Auble 2013-01-22 07:28:58 MST

Even having the logs from the bad state files would be helpful.

Comment 4 Don Lipari 2013-01-22 07:52:04 MST

(In reply to comment #2)
> Having the state files would be very valuable.  Perhaps the ".old" files
> still exist?
> 
> Any idea how RMP20Ja225511530 was generated?
> 
> From what you say here it sounds like it was generated outside of Slurm.  Is
> that not the case?
> 
> Doing a startclean should remove all old blocks.  Is this not what is
> happening?  I wouldn't expect it to try to re-create them unless you have a
> non-dynamic setup.

I'm afraid I did not take care to preserve the old state files for diagnosis.  The system was in a dynamic setup.  The slurmctld.log files are still there as are some core files (in /tmp) from when the 2.4 state files were read in.

To summarize, there were two bugs:  one when reading the 2.4 state files, the other when starting with no state files.  I'm not too concerned with the first bug as this is a one-time only event.  This ticket is for the second bug.  The block creation record is in the logs:

[2013-01-20T22:55:11] Record: BlockID:RMP20Ja225511530 Nodes:seq[4120x4220] Conn:T,T,T,T

You can log in and look at the slurmctld.logs or I can mail them to you if it is easier.

Comment 5 Danny Auble 2013-01-22 10:34:32 MST

If you could email them it would be great.  Any backtrace from the cores would be of interest as well.  They probably point to the problem.

Comment 6 Don Lipari 2013-01-23 04:09:28 MST

(In reply to comment #5)
> If you could email them it would be great.  Any backtrace from the cores
> would be of interest as well.  They probably point to the problem.

slurmctld.log and backtraces mailed

Comment 7 Moe Jette 2013-01-24 02:41:27 MST

The second bug reported in comment #4, an abort in the reservation logic when upgrading from slurm v2.4 to v2.5.1, will be fixed in v2.5.2. The patch is here:

https://github.com/SchedMD/slurm/commit/604b869ecdd512e0aa4cc244e4138842e7556b1f

Comment 8 Danny Auble 2013-04-02 06:34:12 MDT

This is fixed in patch 6af6f6156a4f8bc71405b6ef22ad0d56f3e14d96.  Will be in the next 2.5 release.