Ticket 11571

Summary: Unusal GrpTRES preventing job from launching
Product: Slurm Reporter: Greg Wickham <greg.wickham>
Component: User CommandsAssignee: Scott Hilton <scott>
Status: RESOLVED CANNOTREPRODUCE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil, scott
Version: 20.11.6   
Hardware: Linux   
OS: Linux   
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Output of scontrol show assoc
slurm.conf

Description Greg Wickham 2021-05-09 06:39:58 MDT
One of our users was using a QOS but their job wasn't launching.

Upon inspection the QOS GrpTRES was defined as:

$ sacctmgr -P show qos ibex-cs format=grptres
GrpTRES
cpu=18446744073709548616,gres/gpu=256

Modifying it to a sane value, the job started running immediately:

$ sacctmgr modify qos ibex-cs set grptres=cpu=4096,gres/gpu=256
 Modified qos...
  ibex-cs
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr -P show qos ibex-cs format=grptres
GrpTRES
cpu=4096,gres/gpu=256



  -Greg
Comment 2 Scott Hilton 2021-05-10 15:51:47 MDT
Greg,

What is the slurm.conf? Was there anything special about the job?

-Scott
Comment 3 Scott Hilton 2021-05-10 15:58:25 MDT
Greg,

I was able to get a job to run with cpu=18446744073709548616. There must be more to reproduce your issue. 

Do you know if someone set this value? It is 2997 less than the max value for that option.

-Scott
Comment 4 Greg Wickham 2021-05-11 06:49:50 MDT
Hi Scott,

We don't have anyone claiming responsibility for the odd value for cpu=.

This ticket was raised as a result of a user experiencing QOSMaxGRESPerUser with no discernible reason. The user now has another job pending however we're not able to tell why:

$ squeue -j 15275586
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          15275586     batch jobscrip shaima0d PD       0:00      8 (QOSMaxGRESPerUser)


$ scontrol show job=15275586
JobId=15275586 JobName=jobscript.slurm
   UserId=shaima0d(174988) GroupId=g-shaima0d(1174988) MCS_label=N/A
   Priority=11552 Nice=0 Account=ibex-cs QOS=ibex-cs
   JobState=PENDING Reason=QOSMaxGRESPerUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2021-05-10T18:35:49 EligibleTime=2021-05-10T18:35:49
   AccrueTime=2021-05-10T18:35:49
   StartTime=2021-05-15T13:07:26 EndTime=2021-05-15T23:07:26 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-11T15:45:27
   Partition=batch AllocNode:Sid=login510-22:48133
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=gpu208-02,gpu210-[06,18],gpu212-[14,18],gpu213-14,gpu214-[06,10]
   NumNodes=8-8 NumCPUs=1 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=700G,node=1,billing=1,gres/gpu=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=700G MinTmpDiskNode=0
   Features=v100&gpu_ai DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/jobscript.slurm
   WorkDir=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes
   StdErr=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/slurm-15275586.out
   StdIn=/dev/null
   StdOut=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/slurm-15275586.out
   Power=
   CpusPerTres=gpu:4
   TresPerJob=gpu:8
   TresPerNode=gpu:1
   NtasksPerTRES:0

$ sacctmgr -P show qos ibex-cs
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES
ibex-cs|5000|00:00:00|||cluster|||1.000000|cpu=4096,gres/gpu=256||||||||||||||||
Comment 5 Greg Wickham 2021-05-11 06:50:57 MDT
Created attachment 19414 [details]
Output of scontrol show assoc
Comment 6 Greg Wickham 2021-05-11 06:51:14 MDT
Created attachment 19415 [details]
slurm.conf
Comment 7 Scott Hilton 2021-05-13 16:00:03 MDT
Greg,

How many times has this issue occurred? How long have you been seeing it for? Does it only happen for this user?

The output of scontrol show assoc doesn't show the user or qos near any limits. Was this query taken during the issue?

Can you turn on debug2 and send me the slurmctld logs next time you see the issue? 
SlurmctldDebug=debug2

-Scott
Comment 8 Greg Wickham 2021-05-13 16:00:11 MDT
The University is closed for Eid al-Fitr, reopening on Sunday 23rd of May.

Until the University re-opens assisting for using Ibex can be obtained by either:


   - send a request to the Ibex slack channel #general

      (sign up at https://kaust-ibex.slack.com/signup)


   - open a ticket by sending an email to ibex@hpc.kaust.edu.sa

Please note that during Eid reduced staffing is in effect, so assistance will be prioritised. This may delay responses to some requests.


 -Greg


--
Comment 10 Scott Hilton 2021-06-03 15:37:56 MDT
Greg,

Are you still seeing the issue?

If so could you respond the the questions in my last comment.

Thanks, 

Scott
Comment 11 Scott Hilton 2021-06-14 11:11:22 MDT
Greg, 

Would you like me to look at this bug? If so please respond to comment 7

-Scott
Comment 12 Greg Wickham 2021-06-14 23:18:52 MDT
Hi Scott,

The issue hasn't happened again, so without more information I doubt there is more that can be done now.

If it does happen again, I'll reopen this ticket.

Please close it for now.

   -Greg
Comment 13 Scott Hilton 2021-06-15 11:51:05 MDT
Closing Ticket