| Summary: | strange slurm behavior with overlapping block on Sequoia | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | ||
| Version: | 2.5.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Don Lipari
2013-02-15 05:52:51 MST
(In reply to comment #0) > This could be another manifestation of the problem reported in bug 218, > but... This could be the correct happenings in reference to 218. If you could send the complete log from [2013-02-15T09:19:32-08:00] [2013-02-15T09:29:34-08:00] it would be more apparent. > > slurmctld log says: > > # seqsn3 /var/log/slurm > grep 63820 slurmctld.log > [2013-02-15T09:19:03-08:00] sched: _slurm_rpc_allocate_resources > JobId=63820 NodeList=(null) usec=36397 > [2013-02-15T09:19:32-08:00] Queue start of job 63820 in BG block > RMP15Fe091931090 > [2013-02-15T09:19:32-08:00] backfill: Started JobId=63820 on > seq[1110x1210,1112x1213] > [2013-02-15T09:29:34-08:00] Pending job 63820 on block RMP15Fe091931090 > will try to be requeued because overlapping block RMP15Fe063326881 is in > an error state. > [2013-02-15T09:29:34-08:00] We are freeing a block (RMP15Fe091931090) > that has job 63820(63820). > [2013-02-15T09:29:34-08:00] error: Couldn't requeue job 63820, failing > it: Requested operation is presently disabled > [2013-02-15T09:29:34-08:00] Queue termination of job 63820 in BG block > RMP15Fe091931090 > # seqsn3 /var/log/slurm > > > > User saw: > > seqlac2@mgd:salloc -t 240 -p pbatch -N 3072 -n 49152 -x seq3200 > salloc: Pending job allocation 63820 > salloc: job 63820 queued and waiting for resources > salloc: job 63820 has been allocated resources > salloc: Granted job allocation 63820 > salloc: Waiting for block RMP15Fe091931090 to become ready for job > > FR issue 596198. > > It appears that after this the block was booted successfully. The > question is: How are we ending up with overlapping blocks in dynamic > mode? Should this ever happen? Are you asking this question or is the user? If this is related to 218 then this should be ovious. The requsted log will tell us for sure, but my guess is the block that went into error was being freed and didn't quite make it making the error state happen. The last comment claims this block booted successfully later. Is that really the case? The log here claims the block this job was to run on is being freed, and I am guessing destroyed. Based on this partial log it is hard to tell what happened to both blocks. If this is bug 218 then both blocks would be left. It doesn't seem like this is the case though. (In reply to comment #1) > (In reply to comment #0) > > This could be another manifestation of the problem reported in bug 218, > > but... > > This could be the correct happenings in reference to 218. If you could send > the complete log from > > [2013-02-15T09:19:32-08:00] > [2013-02-15T09:29:34-08:00] > > it would be more apparent. mailed. [...] > Are you asking this question or is the user? Adam Bertsch *** This ticket has been marked as a duplicate of ticket 218 *** |