Ticket 872

Summary: jobs failed on controller restart
Product: Slurm Reporter: Michael Gutteridge <mrg>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: da
Version: 2.6.7   
Hardware: Linux   
OS: Linux   
Site: FHCRC - Fred Hutchinson Cancer Research Center Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Michael Gutteridge 2014-06-10 06:17:53 MDT
We've had some trouble with curious job failures- the jobs aren't even assigned nodes and accumulate no run-time:

​       JobID        NodeList      State ExitCode
------------ --------------- ---------- --------
​
7229124        None assigned     FAILED      0:1
​​
​We finally got some better log data (I'd turned it way too low) which suggests that restarting and/or reconfiguring the controller is at the root.  After some preliminaries (purging job records, recovering active jobs) there will be these sorts of messages​:
​
> [2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in partition full
> [2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable: Requested node configuration is not available

​The indicated job has specified --mem and --tmp, but the values are within the capacities for all nodes in that "full" partition.  Typically if a user were to request resources exceeding those available on nodes in this partition the submission is failed.​  It appears that this failure only occurs for jobs with memory and/or disk constraints.  Worse yet, it's not consistent- only seems to happen sometime.  I also cannot reproduce this in our test environment.

A typical node configuration line looks thus:

NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000 Weight=10 Feature=full,restart,rx200,ssd

I do have FastSchedule=0... the RealMemory specification is some cruft left over from the test configuration used to build this one.  

Honestly it *feels* like there's a moment where the node data isn't fully loaded from the slurmd and thus the scheduler doesn't see any nodes that satisfy the requirements.

Thanks

Michael
Comment 1 Moe Jette 2014-06-16 10:11:34 MDT
This is fixed in the version 14.03.4 release, which just came out today. The commit with the fix is here:

https://github.com/SchedMD/slurm/commit/a8c0b7017edac8d27be51e7b4d5c2e66aee74bc1