Ticket 236

Summary:	strange slurm behavior with overlapping block on Sequoia
Product:	Slurm	Reporter:	Don Lipari <lipari1>
Component:	Bluegene select plugin	Assignee:	Danny Auble <da>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	2 - High Impact
Priority:	---
Version:	2.5.x
Hardware:	IBM BlueGene
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Don Lipari 2013-02-15 05:52:51 MST

This could be another manifestation of the problem reported in bug 218, but...

slurmctld log says:

# seqsn3 /var/log/slurm > grep 63820 slurmctld.log
[2013-02-15T09:19:03-08:00] sched: _slurm_rpc_allocate_resources 
JobId=63820 NodeList=(null) usec=36397
[2013-02-15T09:19:32-08:00] Queue start of job 63820 in BG block 
RMP15Fe091931090
[2013-02-15T09:19:32-08:00] backfill: Started JobId=63820 on 
seq[1110x1210,1112x1213]
[2013-02-15T09:29:34-08:00] Pending job 63820 on block RMP15Fe091931090 
will try to be requeued because overlapping block RMP15Fe063326881 is in 
an error state.
[2013-02-15T09:29:34-08:00] We are freeing a block (RMP15Fe091931090) 
that has job 63820(63820).
[2013-02-15T09:29:34-08:00] error: Couldn't requeue job 63820, failing 
it: Requested operation is presently disabled
[2013-02-15T09:29:34-08:00] Queue termination of job 63820 in BG block 
RMP15Fe091931090
# seqsn3 /var/log/slurm >


User saw:

seqlac2@mgd:salloc -t 240 -p pbatch -N 3072 -n 49152 -x seq3200
salloc: Pending job allocation 63820
salloc: job 63820 queued and waiting for resources
salloc: job 63820 has been allocated resources
salloc: Granted job allocation 63820
salloc: Waiting for block RMP15Fe091931090 to become ready for job

FR issue 596198.

It appears that after this the block was booted successfully.  The 
question is: How are we ending up with overlapping blocks in dynamic 
mode?  Should this ever happen?

Comment 1 Danny Auble 2013-02-15 06:17:11 MST

(In reply to comment #0)
> This could be another manifestation of the problem reported in bug 218,
> but...

This could be the correct happenings in reference to 218.  If you could send the complete log from 

[2013-02-15T09:19:32-08:00]
[2013-02-15T09:29:34-08:00] 

it would be more apparent.

> 
> slurmctld log says:
> 
> # seqsn3 /var/log/slurm > grep 63820 slurmctld.log
> [2013-02-15T09:19:03-08:00] sched: _slurm_rpc_allocate_resources 
> JobId=63820 NodeList=(null) usec=36397
> [2013-02-15T09:19:32-08:00] Queue start of job 63820 in BG block 
> RMP15Fe091931090
> [2013-02-15T09:19:32-08:00] backfill: Started JobId=63820 on 
> seq[1110x1210,1112x1213]
> [2013-02-15T09:29:34-08:00] Pending job 63820 on block RMP15Fe091931090 
> will try to be requeued because overlapping block RMP15Fe063326881 is in 
> an error state.
> [2013-02-15T09:29:34-08:00] We are freeing a block (RMP15Fe091931090) 
> that has job 63820(63820).
> [2013-02-15T09:29:34-08:00] error: Couldn't requeue job 63820, failing 
> it: Requested operation is presently disabled
> [2013-02-15T09:29:34-08:00] Queue termination of job 63820 in BG block 
> RMP15Fe091931090
> # seqsn3 /var/log/slurm >
> 
> 
> User saw:
> 
> seqlac2@mgd:salloc -t 240 -p pbatch -N 3072 -n 49152 -x seq3200
> salloc: Pending job allocation 63820
> salloc: job 63820 queued and waiting for resources
> salloc: job 63820 has been allocated resources
> salloc: Granted job allocation 63820
> salloc: Waiting for block RMP15Fe091931090 to become ready for job
> 
> FR issue 596198.
> 
> It appears that after this the block was booted successfully.  The 
> question is: How are we ending up with overlapping blocks in dynamic 
> mode?  Should this ever happen?

Are you asking this question or is the user?  If this is related to 218 then this should be ovious.  The requsted log will tell us for sure, but my guess is the block that went into error was being freed and didn't quite make it making the error state happen.

The last comment claims this block booted successfully later.  Is that really the case?  The log here claims the block this job was to run on is being freed, and I am guessing destroyed.  Based on this partial log it is hard to tell what happened to both blocks.  If this is bug 218 then both blocks would be left.  It doesn't seem like this is the case though.

Comment 2 Don Lipari 2013-02-15 07:41:34 MST

(In reply to comment #1)
> (In reply to comment #0)
> > This could be another manifestation of the problem reported in bug 218,
> > but...
> 
> This could be the correct happenings in reference to 218.  If you could send
> the complete log from 
> 
> [2013-02-15T09:19:32-08:00]
> [2013-02-15T09:29:34-08:00] 
> 
> it would be more apparent.

mailed.
[...]
> Are you asking this question or is the user?
Adam Bertsch

Comment 3 Danny Auble 2013-02-15 08:57:12 MST


*** This ticket has been marked as a duplicate of ticket 218 ***