I was puzzled to see jobs from a user with invalid node counts. SLURM accepted the jobs' submission, queued them, and then when it went to run the jobs, failed them with the message, "_pick_best_nodes: job 65398 never runnable". I just did a simple test to confirm the behavior: salloc -N95500 salloc: Pending job allocation 84765 salloc: job 84765 queued and waiting for resources Please fix SLURM to reject a BG/Q job at submission time if it asks for a node count which, if greater than 512 nodes, * is not a multiple of MidplaneNodeCnt nodes * cannot complete a 2, 3, or 4 dimensional block of midplanes
This applies to any platform, not only Bluegene systems and this fix is being made only to the v2.5 code. You can back-port fairly similar code to v2.4 is necesary. https://github.com/SchedMD/slurm/commit/d46c7607d374eb36bf64fac74ed17d922b3df2fe
(In reply to comment #1) > This applies to any platform, not only Bluegene systems and this fix is > being made only to the v2.5 code. You can back-port fairly similar code to > v2.4 is necesary. > > https://github.com/SchedMD/slurm/commit/ > d46c7607d374eb36bf64fac74ed17d922b3df2fe With the v2.5.1 code now installed on Sequoia, I attempted to test this fix. I asked for 92K nodes - an impossibility. While salloc will now reject the job when an active partition is specified, it still accepts the job for down partitions. This results in the same problem: users submit their jobs to partitions that will only be enabled later in the week. They will wait in the queue for days and then fail when they are scheduled to run on a newly activated partition: lipari@seqlac2$ salloc -N92K -p pscale salloc: error: Failed to allocate resources: Requested node configuration is not available lipari@seqlac2$ salloc -N92K -p pbatch salloc: Requested partition configuration not available now salloc: Pending job allocation 35492 salloc: job 35492 queued and waiting for resources
I've changed the component from Bluegene plugin to scheduling. While this can be observed on a Bluegene, it is a generic Slurm bug.
Created attachment 220 [details] fix for v2.5.5
I was able to reproduce the problem and make a fix. This will be in v2.5.5 when released, probably in the coming days. What is your scheduling for bringing bgq back up? we could probably tag v2.5.5 before you bring the system up.
> What is your scheduling for bringing bgq back up? we could probably tag > v2.5.5 before you bring the system up. The plan calls around April 12.