Ticket 872

Summary:	jobs failed on controller restart
Product:	Slurm	Reporter:	Michael Gutteridge <mrg>
Component:	Scheduling	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	da
Version:	2.6.7
Hardware:	Linux
OS:	Linux
Site:	FHCRC - Fred Hutchinson Cancer Research Center	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.03.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Michael Gutteridge 2014-06-10 06:17:53 MDT

We've had some trouble with curious job failures- the jobs aren't even assigned nodes and accumulate no run-time:

       JobID        NodeList      State ExitCode
------------ --------------- ---------- --------

7229124        None assigned     FAILED      0:1

We finally got some better log data (I'd turned it way too low) which suggests that restarting and/or reconfiguring the controller is at the root.  After some preliminaries (purging job records, recovering active jobs) there will be these sorts of messages:

> [2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in partition full
> [2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable: Requested node configuration is not available

The indicated job has specified --mem and --tmp, but the values are within the capacities for all nodes in that "full" partition.  Typically if a user were to request resources exceeding those available on nodes in this partition the submission is failed.  It appears that this failure only occurs for jobs with memory and/or disk constraints.  Worse yet, it's not consistent- only seems to happen sometime.  I also cannot reproduce this in our test environment.

A typical node configuration line looks thus:

NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000 Weight=10 Feature=full,restart,rx200,ssd

I do have FastSchedule=0... the RealMemory specification is some cruft left over from the test configuration used to build this one.  

Honestly it *feels* like there's a moment where the node data isn't fully loaded from the slurmd and thus the scheduler doesn't see any nodes that satisfy the requirements.

Thanks

Michael

Comment 1 Moe Jette 2014-06-16 10:11:34 MDT

This is fixed in the version 14.03.4 release, which just came out today. The commit with the fix is here:

https://github.com/SchedMD/slurm/commit/a8c0b7017edac8d27be51e7b4d5c2e66aee74bc1