187 – Invalid node counts accepted in Sequoia

Ticket 187 - Invalid node counts accepted in Sequoia

Summary: Invalid node counts accepted in Sequoia

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Scheduling (show other tickets)
Version:	2.4.x
Hardware:	All Linux

Severity:	3 - Medium Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2012-12-06 02:56 MST by Don Lipari
Modified:	2013-03-28 02:45 MDT (History)
CC List:	1 user (show)

See Also:
Site:	LLNL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
fix for v2.5.5 (2.96 KB, patch) 2013-03-27 06:19 MDT, Moe Jette	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Don Lipari 2012-12-06 02:56:32 MST

I was puzzled to see jobs from a user with invalid node counts.  SLURM accepted the jobs' submission, queued them, and then when it went to run the jobs, failed them with the message, "_pick_best_nodes: job 65398 never runnable".

I just did a simple test to confirm the behavior:

salloc -N95500
salloc: Pending job allocation 84765
salloc: job 84765 queued and waiting for resources

Please fix SLURM to reject a BG/Q job at submission time if it asks for a node count which, if greater than 512 nodes,

* is not a multiple of MidplaneNodeCnt nodes
* cannot complete a 2, 3, or 4 dimensional block of midplanes

Comment 1 Moe Jette 2012-12-06 06:27:45 MST

This applies to any platform, not only Bluegene systems and this fix is being made only to the v2.5 code. You can back-port fairly similar code to v2.4 is necesary.

https://github.com/SchedMD/slurm/commit/d46c7607d374eb36bf64fac74ed17d922b3df2fe

Comment 2 Don Lipari 2013-01-23 04:33:12 MST

(In reply to comment #1)
> This applies to any platform, not only Bluegene systems and this fix is
> being made only to the v2.5 code. You can back-port fairly similar code to
> v2.4 is necesary.
> 
> https://github.com/SchedMD/slurm/commit/
> d46c7607d374eb36bf64fac74ed17d922b3df2fe

With the v2.5.1 code now installed on Sequoia, I attempted to test this fix.  I asked for 92K nodes - an impossibility.  While salloc will now reject the job when an active partition is specified, it still accepts the job for down partitions.  This results in the same problem:  users submit their jobs to partitions that will only be enabled later in the week.  They will wait in the queue for days and then fail when they are scheduled to run on a newly activated partition:

lipari@seqlac2$ salloc -N92K -p pscale
salloc: error: Failed to allocate resources: Requested node configuration is not available
lipari@seqlac2$ salloc -N92K -p pbatch
salloc: Requested partition configuration not available now
salloc: Pending job allocation 35492
salloc: job 35492 queued and waiting for resources

Comment 3 Moe Jette 2013-01-28 08:50:35 MST

I've changed the component from Bluegene plugin to scheduling. While this can be observed on a Bluegene, it is a generic Slurm bug.

Comment 4 Moe Jette 2013-03-27 06:19:11 MDT

Created attachment 220 [details]
fix for v2.5.5

Comment 5 Moe Jette 2013-03-27 06:21:08 MDT

I was able to reproduce the problem and make a fix. This will be in v2.5.5 when released, probably in the coming days.

What is your scheduling for bringing bgq back up? we could probably tag v2.5.5 before you bring the system up.

Comment 6 Don Lipari 2013-03-28 02:45:36 MDT

> What is your scheduling for bringing bgq back up? we could probably tag
> v2.5.5 before you bring the system up.

The plan calls around April 12.