Ticket 3058 - Job's pending reason lists unavailable nodes that are not in the job's partition
Summary: Job's pending reason lists unavailable nodes that are not in the job's partition
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 16.05.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-09-06 15:00 MDT by Kilian Cavalotti
Modified: 2017-10-17 08:51 MDT (History)
2 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2016-09-06 15:00:51 MDT
Hi!

Some of our users are confused by the Reason squeue sometimes gives for why their jobs are pending: when nodes are drained, even outside of the partition a job has been submitted to, squeue still lists those nodes names in the "Reason" field.

For instance, job 9691308 has been submitted to the "hns" partition, in which sh-27-11 is drained:

# scontrol show partition hns
PartitionName=hns
   [...]
   Nodes=sh-17-12,sh-27-[11-20],gpu-27-21
   [...]

# sinfo -p hns
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
hns          up   infinite      1  drain sh-27-11
hns          up   infinite      8    mix sh-27-[12-16,18-20]
hns          up   infinite      3  alloc gpu-27-21,sh-17-12,sh-27-17

The job is pending with a "Reason" of "ReqNodeNotAvailable":

# scontrol show job 9691308
JobId=9691308 JobName=CDf14r6
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:gpu-14-[4,6],sh-27-11 Dependency=(null)
   [...]
   Partition=hns AllocNode:Sid=sherlock-ln01:7726
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   [...]


I guess it would be more user-friendly if the reason was "Resources", because ReqNodeNotAvailable makes it sound like the user explicitly requested a specific node, which he hasn't. And listing all the cluster nodes that are drained in  _UnavailableNodes, including those that are not in the job's partition, doesn't help users figure things out either.

So would it make sense to only report ReqNodeNotAvail when a specific node has been requested, or at least only list in _UnavailableNodes the nodes that would be considered to run the job if they were not drained?

Thanks!
-- 
Kilian
Comment 1 Alejandro Sanchez 2016-09-09 06:33:11 MDT
Hi Kilian. We're looking into this and will come back to you.
Comment 2 Alejandro Sanchez 2016-09-12 08:15:47 MDT
Kilian, can you please upload your slurm.conf and indicate a specific sinfo state and detailed job submission to reproduce this? I'm able to reproduce on 15.08 but not on 16.05 (where some changes to job reason logic have been made). We've a local copy of sherlock from 11 days ago and xstream from 19 days ago. Not sure if this is happening in any of these clusters or another one. Thanks.
Comment 4 Alejandro Sanchez 2016-09-12 08:22:19 MDT
I see the Machine Name in the bug is sherlock, anyhow a specific sinfo state + job submission would help. Also an updated slurm.conf just in case something changed during these days.
Comment 5 Kilian Cavalotti 2016-09-13 23:09:05 MDT
Hi Alejandro, 

(In reply to Alejandro Sanchez from comment #4)
> I see the Machine Name in the bug is sherlock, anyhow a specific sinfo state
> + job submission would help. Also an updated slurm.conf just in case
> something changed during these days.

Yes, it's on Sherlock and the configuration didn't change since last time.
What options would you need for sinfo and for the job submission info?

Cheers,
Kilian
Comment 6 Alejandro Sanchez 2016-09-14 02:51:21 MDT
Just 'sinfo' just before the submission and the whole request/batch script with the parameters you are using for the job submission. Let's see if I'm able to reproduce with this and then be able to work on the problem.
Comment 7 Kilian Cavalotti 2016-09-16 12:57:07 MDT
(In reply to Alejandro Sanchez from comment #6)
> Just 'sinfo' just before the submission and the whole request/batch script
> with the parameters you are using for the job submission. Let's see if I'm
> able to reproduce with this and then be able to work on the problem.

Mmmh, I can't seem to be ab;e to replicate the issue right now. I guess it's ok to close the ticket, I'll reopen if it happens again.

Cheers,
-- 
Kilian
Comment 8 Alejandro Sanchez 2016-09-21 02:59:12 MDT
(In reply to Kilian Cavalotti from comment #7)
> Mmmh, I can't seem to be ab;e to replicate the issue right now. I guess it's
> ok to close the ticket, I'll reopen if it happens again.
> 
> Cheers,
> -- 
> Kilian

Ok, closing the ticket as WORKSFORME. Please reopen if you happen to reproduce this.