| Summary: | Invalid node counts accepted in Sequoia | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Scheduling | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da |
| Version: | 2.4.x | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | fix for v2.5.5 | ||
|
Description
Don Lipari
2012-12-06 02:56:32 MST
This applies to any platform, not only Bluegene systems and this fix is being made only to the v2.5 code. You can back-port fairly similar code to v2.4 is necesary. https://github.com/SchedMD/slurm/commit/d46c7607d374eb36bf64fac74ed17d922b3df2fe (In reply to comment #1) > This applies to any platform, not only Bluegene systems and this fix is > being made only to the v2.5 code. You can back-port fairly similar code to > v2.4 is necesary. > > https://github.com/SchedMD/slurm/commit/ > d46c7607d374eb36bf64fac74ed17d922b3df2fe With the v2.5.1 code now installed on Sequoia, I attempted to test this fix. I asked for 92K nodes - an impossibility. While salloc will now reject the job when an active partition is specified, it still accepts the job for down partitions. This results in the same problem: users submit their jobs to partitions that will only be enabled later in the week. They will wait in the queue for days and then fail when they are scheduled to run on a newly activated partition: lipari@seqlac2$ salloc -N92K -p pscale salloc: error: Failed to allocate resources: Requested node configuration is not available lipari@seqlac2$ salloc -N92K -p pbatch salloc: Requested partition configuration not available now salloc: Pending job allocation 35492 salloc: job 35492 queued and waiting for resources I've changed the component from Bluegene plugin to scheduling. While this can be observed on a Bluegene, it is a generic Slurm bug. Created attachment 220 [details]
fix for v2.5.5
I was able to reproduce the problem and make a fix. This will be in v2.5.5 when released, probably in the coming days. What is your scheduling for bringing bgq back up? we could probably tag v2.5.5 before you bring the system up.
> What is your scheduling for bringing bgq back up? we could probably tag
> v2.5.5 before you bring the system up.
The plan calls around April 12.
|