82 – Jobs Fail to Be Scheduled on Available BG/Q nodes

Ticket 82 - Jobs Fail to Be Scheduled on Available BG/Q nodes

Summary: Jobs Fail to Be Scheduled on Available BG/Q nodes

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Bluegene select plugin (show other tickets)
Version:	2.4.x
Hardware:	IBM BlueGene Linux

Severity:	2 - High Impact
Assignee:	Danny Auble
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2012-07-11 02:50 MDT by Don Lipari
Modified:	2012-07-12 09:38 MDT (History)
CC List:	0 users

See Also:
Site:	LLNL
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Log Snippet (2.12 KB, text/plain) 2012-07-11 02:50 MDT, Don Lipari	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Don Lipari 2012-07-11 02:50:00 MDT

Created attachment 89 [details]
Log Snippet

Dave Fox: I logged in this morning and found the machine idle.  No nodes were down
and everything looked fine.   The  last job that ran was 32K and ended
in the 4th about 17:40 the next job in the queue was 32K, but slurm was
not startinging it.  I double checked the control system looked ok, then 
did an sfree on the block to see if that would help.  Still slurm would not
start a job.  In the end, I did a /etc/init.d/slurm stop followed by start. 
Then slurm started scheduling again.

Again, on July 5 by Adam Bertsch:
Just before sending this message I was clearing up the last few nodes 
that had been replaced, and slurm was for some reason not scheduling the 
next job.  We had a 48K job trying to run, and the log made it look like 
it was trying to run, but it never started.  I restarted slurm, and the 
job was able to go.

slurmctld.log.thur copied to ~da on seq

Comment 1 Danny Auble 2012-07-11 06:20:03 MDT

This is fixed!

This is the same problem that has been seen before for quite some time.  So the problem here is if mmcs was hanging onto a job and while that was happening some other force (Admin or other) removes the block that was running the job the num_unused_cpus was not updated correctly.

This is fixed in this patch.

https://github.com/SchedMD/slurm/commit/11e2759f5c390e6d8c0448d6916104c28b4c3344