This could be another manifestation of the problem reported in bug 218, but... slurmctld log says: # seqsn3 /var/log/slurm > grep 63820 slurmctld.log [2013-02-15T09:19:03-08:00] sched: _slurm_rpc_allocate_resources JobId=63820 NodeList=(null) usec=36397 [2013-02-15T09:19:32-08:00] Queue start of job 63820 in BG block RMP15Fe091931090 [2013-02-15T09:19:32-08:00] backfill: Started JobId=63820 on seq[1110x1210,1112x1213] [2013-02-15T09:29:34-08:00] Pending job 63820 on block RMP15Fe091931090 will try to be requeued because overlapping block RMP15Fe063326881 is in an error state. [2013-02-15T09:29:34-08:00] We are freeing a block (RMP15Fe091931090) that has job 63820(63820). [2013-02-15T09:29:34-08:00] error: Couldn't requeue job 63820, failing it: Requested operation is presently disabled [2013-02-15T09:29:34-08:00] Queue termination of job 63820 in BG block RMP15Fe091931090 # seqsn3 /var/log/slurm > User saw: seqlac2@mgd:salloc -t 240 -p pbatch -N 3072 -n 49152 -x seq3200 salloc: Pending job allocation 63820 salloc: job 63820 queued and waiting for resources salloc: job 63820 has been allocated resources salloc: Granted job allocation 63820 salloc: Waiting for block RMP15Fe091931090 to become ready for job FR issue 596198. It appears that after this the block was booted successfully. The question is: How are we ending up with overlapping blocks in dynamic mode? Should this ever happen?
(In reply to comment #0) > This could be another manifestation of the problem reported in bug 218, > but... This could be the correct happenings in reference to 218. If you could send the complete log from [2013-02-15T09:19:32-08:00] [2013-02-15T09:29:34-08:00] it would be more apparent. > > slurmctld log says: > > # seqsn3 /var/log/slurm > grep 63820 slurmctld.log > [2013-02-15T09:19:03-08:00] sched: _slurm_rpc_allocate_resources > JobId=63820 NodeList=(null) usec=36397 > [2013-02-15T09:19:32-08:00] Queue start of job 63820 in BG block > RMP15Fe091931090 > [2013-02-15T09:19:32-08:00] backfill: Started JobId=63820 on > seq[1110x1210,1112x1213] > [2013-02-15T09:29:34-08:00] Pending job 63820 on block RMP15Fe091931090 > will try to be requeued because overlapping block RMP15Fe063326881 is in > an error state. > [2013-02-15T09:29:34-08:00] We are freeing a block (RMP15Fe091931090) > that has job 63820(63820). > [2013-02-15T09:29:34-08:00] error: Couldn't requeue job 63820, failing > it: Requested operation is presently disabled > [2013-02-15T09:29:34-08:00] Queue termination of job 63820 in BG block > RMP15Fe091931090 > # seqsn3 /var/log/slurm > > > > User saw: > > seqlac2@mgd:salloc -t 240 -p pbatch -N 3072 -n 49152 -x seq3200 > salloc: Pending job allocation 63820 > salloc: job 63820 queued and waiting for resources > salloc: job 63820 has been allocated resources > salloc: Granted job allocation 63820 > salloc: Waiting for block RMP15Fe091931090 to become ready for job > > FR issue 596198. > > It appears that after this the block was booted successfully. The > question is: How are we ending up with overlapping blocks in dynamic > mode? Should this ever happen? Are you asking this question or is the user? If this is related to 218 then this should be ovious. The requsted log will tell us for sure, but my guess is the block that went into error was being freed and didn't quite make it making the error state happen. The last comment claims this block booted successfully later. Is that really the case? The log here claims the block this job was to run on is being freed, and I am guessing destroyed. Based on this partial log it is hard to tell what happened to both blocks. If this is bug 218 then both blocks would be left. It doesn't seem like this is the case though.
(In reply to comment #1) > (In reply to comment #0) > > This could be another manifestation of the problem reported in bug 218, > > but... > > This could be the correct happenings in reference to 218. If you could send > the complete log from > > [2013-02-15T09:19:32-08:00] > [2013-02-15T09:29:34-08:00] > > it would be more apparent. mailed. [...] > Are you asking this question or is the user? Adam Bertsch
*** This ticket has been marked as a duplicate of ticket 218 ***