Ticket 17800

Summary: how to know if the job is waiting for license or machine in slurm?
Product: Slurm Reporter: Openfive Support <it_support>
Component: User CommandsAssignee: Jason Booth <jbooth>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: Alphawave Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Openfive Support 2023-09-28 00:36:43 MDT
Hi Slurm Support Team,


We see if the job is waiting for License, still it is show in the Reason as Priority.

Example:-

[debajitd@osvnc001 ~]$ srun -p normal -L Innovus_Impl_System --pty /bin/tcsh
srun: job 2377664 queued and waiting for resources


And license status:-

[debajitd@tmpxon014 ~]$ scontrol show lic Innovus_Impl_System
LicenseName=Innovus_Impl_System
    Total=160 Used=160 Free=0 Reserved=0 Remote=no


Here, the above job is waiting for license, but in the scontrol command it shows in the "Reason" as "Priority".


Scontrol command output:-

[root@hpcmaster ~]# scontrol show job 2377664
JobId=2377664 JobName=tcsh
   UserId=debajitd(3403) GroupId=engr(500) MCS_label=N/A
   Priority=4444 Nice=0 Account=(null) QOS=normal WCKey=*
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=15-00:00:00 TimeMin=N/A
   SubmitTime=2023-09-28T12:01:56 EligibleTime=2023-09-28T12:01:56
   AccrueTime=2023-09-28T12:01:56
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-09-28T12:02:11
   Partition=normal AllocNode:Sid=osvnc001:17136
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=4000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=4000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=Innovus_Impl_System Network=(null)
   Command=/bin/tcsh
   WorkDir=/home/debajitd
   Power=
   NtasksPerTRES:0

[root@hpcmaster ~]# 


Now the problem here is that from the slurm side we are not able to know if the job is waiting for CPUs/RAM or Licenses, because in either of the case it is showing as,  Reason=Priority.

Is there any command or option that we can use to see that for what exactly the job is waiting for?


Regards,
Debajit Dutta
Comment 1 Jason Booth 2023-09-28 13:30:16 MDT
Please attach your current slurm.conf and slurmctld.log. Also, how many jobs are in the queue scheduled to run before this one?

Normally once a job is considered for scheduling then the reason is applied to that job.

For example:

>              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
>               5512     debug hostname    jason PD       0:00      1 (Licenses)
>              5511     debug     wrap    jason  R       0:26      1 n1


> $ scontrol show jobs 5512
> JobId=5512 JobName=hostname
>    JobState=PENDING Reason=Licenses
Comment 2 Jason Booth 2023-10-20 13:23:53 MDT
Timing this out due to no reply. Should you need this reopened, please reply and attach the information requested in comment#1.