Hi! Some of our users are confused by the Reason squeue sometimes gives for why their jobs are pending: when nodes are drained, even outside of the partition a job has been submitted to, squeue still lists those nodes names in the "Reason" field. For instance, job 9691308 has been submitted to the "hns" partition, in which sh-27-11 is drained: # scontrol show partition hns PartitionName=hns [...] Nodes=sh-17-12,sh-27-[11-20],gpu-27-21 [...] # sinfo -p hns PARTITION AVAIL TIMELIMIT NODES STATE NODELIST hns up infinite 1 drain sh-27-11 hns up infinite 8 mix sh-27-[12-16,18-20] hns up infinite 3 alloc gpu-27-21,sh-17-12,sh-27-17 The job is pending with a "Reason" of "ReqNodeNotAvailable": # scontrol show job 9691308 JobId=9691308 JobName=CDf14r6 JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:gpu-14-[4,6],sh-27-11 Dependency=(null) [...] Partition=hns AllocNode:Sid=sherlock-ln01:7726 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) [...] I guess it would be more user-friendly if the reason was "Resources", because ReqNodeNotAvailable makes it sound like the user explicitly requested a specific node, which he hasn't. And listing all the cluster nodes that are drained in _UnavailableNodes, including those that are not in the job's partition, doesn't help users figure things out either. So would it make sense to only report ReqNodeNotAvail when a specific node has been requested, or at least only list in _UnavailableNodes the nodes that would be considered to run the job if they were not drained? Thanks! -- Kilian
Hi Kilian. We're looking into this and will come back to you.
Kilian, can you please upload your slurm.conf and indicate a specific sinfo state and detailed job submission to reproduce this? I'm able to reproduce on 15.08 but not on 16.05 (where some changes to job reason logic have been made). We've a local copy of sherlock from 11 days ago and xstream from 19 days ago. Not sure if this is happening in any of these clusters or another one. Thanks.
I see the Machine Name in the bug is sherlock, anyhow a specific sinfo state + job submission would help. Also an updated slurm.conf just in case something changed during these days.
Hi Alejandro, (In reply to Alejandro Sanchez from comment #4) > I see the Machine Name in the bug is sherlock, anyhow a specific sinfo state > + job submission would help. Also an updated slurm.conf just in case > something changed during these days. Yes, it's on Sherlock and the configuration didn't change since last time. What options would you need for sinfo and for the job submission info? Cheers, Kilian
Just 'sinfo' just before the submission and the whole request/batch script with the parameters you are using for the job submission. Let's see if I'm able to reproduce with this and then be able to work on the problem.
(In reply to Alejandro Sanchez from comment #6) > Just 'sinfo' just before the submission and the whole request/batch script > with the parameters you are using for the job submission. Let's see if I'm > able to reproduce with this and then be able to work on the problem. Mmmh, I can't seem to be ab;e to replicate the issue right now. I guess it's ok to close the ticket, I'll reopen if it happens again. Cheers, -- Kilian
(In reply to Kilian Cavalotti from comment #7) > Mmmh, I can't seem to be ab;e to replicate the issue right now. I guess it's > ok to close the ticket, I'll reopen if it happens again. > > Cheers, > -- > Kilian Ok, closing the ticket as WORKSFORME. Please reopen if you happen to reproduce this.