| Summary: | JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
| Component: | Scheduling | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DTU Physics | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
|
Description
Ole.H.Nielsen@fysik.dtu.dk
2018-06-13 05:27:23 MDT
Ole, could you show me the exact sbatch command request and the #SBATCH options as well as the slurmctld.log? ... also are any of these nodes in a reservation and/or DRAIN/DOWN and/or owned by a user who requested --exclusive=user? a117,c001,g[079-110],h[001-002],i[002-051] which job is running on h002 and how much time passed between: # scontrol show job 592174 and # scontrol show node h002 I'm wondering if h002 was resumed back to available from down/drained in between the two scontrol requests. (In reply to Alejandro Sanchez from comment #3) > which job is running on h002 and how much time passed between: > > # scontrol show job 592174 > and > # scontrol show node h002 > > I'm wondering if h002 was resumed back to available from down/drained in > between the two scontrol requests. These commands were issued within a few minutes, and no changes were made to the system in between. The node a117 was down, and node c001 was drained, but those nodes belong to a completely different partition. Unfortunately, we have not been able to reproduce this error. I guess the case should be closed, since we can't come up with a reproducer. All right. Once we get a reproducer we'll at least have something to work of off. Please reopen if you encounter this again. Thanks. |