| Summary: | jobs failed on controller restart | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Michael Gutteridge <mrg> |
| Component: | Scheduling | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | CC: | da |
| Version: | 2.6.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | FHCRC - Fred Hutchinson Cancer Research Center | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.03.4 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
This is fixed in the version 14.03.4 release, which just came out today. The commit with the fix is here: https://github.com/SchedMD/slurm/commit/a8c0b7017edac8d27be51e7b4d5c2e66aee74bc1 |
We've had some trouble with curious job failures- the jobs aren't even assigned nodes and accumulate no run-time: JobID NodeList State ExitCode ------------ --------------- ---------- -------- 7229124 None assigned FAILED 0:1 We finally got some better log data (I'd turned it way too low) which suggests that restarting and/or reconfiguring the controller is at the root. After some preliminaries (purging job records, recovering active jobs) there will be these sorts of messages: > [2014-06-09T23:10:15.920] No nodes satisfy job 7228909 requirements in partition full > [2014-06-09T23:10:15.920] sched: schedule: JobId=7228909 non-runnable: Requested node configuration is not available The indicated job has specified --mem and --tmp, but the values are within the capacities for all nodes in that "full" partition. Typically if a user were to request resources exceeding those available on nodes in this partition the submission is failed. It appears that this failure only occurs for jobs with memory and/or disk constraints. Worse yet, it's not consistent- only seems to happen sometime. I also cannot reproduce this in our test environment. A typical node configuration line looks thus: NodeName=gizmod[51-60] Sockets=2 CoresPerSocket=6 RealMemory=48000 Weight=10 Feature=full,restart,rx200,ssd I do have FastSchedule=0... the RealMemory specification is some cruft left over from the test configuration used to build this one. Honestly it *feels* like there's a moment where the node data isn't fully loaded from the slurmd and thus the scheduler doesn't see any nodes that satisfy the requirements. Thanks Michael