| Summary: | Job stuck PENDING with ReqNodeNotAvail but all nodes are available | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Jeff White <jeff.white> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 15.08.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Washington State University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmctld.log.gz
slurmd.log.gz |
||
|
Description
Jeff White
2016-05-26 08:26:38 MDT
I wouldn't recommend routinely running jobs in reservations with the MAINT flag set... that flag indicates the hardware may be sporadically unavailable, and changes how the time is accounted for by sreport/sacct as well. The problem here is that the job was not submitted against the reservation. Jobs must be explicitly run under a reservation in Slurm. If the job specified "--reservation=sn3" it would have let the job launch when the reservation started. 'scontrol update jobid=74806 reservation=sn3' should get the job running now. Created attachment 3154 [details]
slurmctld.log.gz
Created attachment 3155 [details]
slurmd.log.gz
Well, that was fast... I found the cause too. It was a problem with the program that was being used to generate the sbatch file (it didn't handle the reservation correctly so it didn't ask Slurm for it; nor did it produce an error or log message saying it ignored the reservation we specified). |