Ticket 11571

Summary:	Unusal GrpTRES preventing job from launching
Product:	Slurm	Reporter:	Greg Wickham <greg.wickham>
Component:	User Commands	Assignee:	Scott Hilton <scott>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	albert.gil, scott
Version:	20.11.6
Hardware:	Linux
OS:	Linux
Site:	KAUST	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Output of scontrol show assoc slurm.conf

Description Greg Wickham 2021-05-09 06:39:58 MDT

One of our users was using a QOS but their job wasn't launching.

Upon inspection the QOS GrpTRES was defined as:

$ sacctmgr -P show qos ibex-cs format=grptres
GrpTRES
cpu=18446744073709548616,gres/gpu=256

Modifying it to a sane value, the job started running immediately:

$ sacctmgr modify qos ibex-cs set grptres=cpu=4096,gres/gpu=256
 Modified qos...
  ibex-cs
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
$ sacctmgr -P show qos ibex-cs format=grptres
GrpTRES
cpu=4096,gres/gpu=256



  -Greg

Comment 2 Scott Hilton 2021-05-10 15:51:47 MDT

Greg,

What is the slurm.conf? Was there anything special about the job?

-Scott

Comment 3 Scott Hilton 2021-05-10 15:58:25 MDT

Greg,

I was able to get a job to run with cpu=18446744073709548616. There must be more to reproduce your issue. 

Do you know if someone set this value? It is 2997 less than the max value for that option.

-Scott

Comment 4 Greg Wickham 2021-05-11 06:49:50 MDT

Hi Scott,

We don't have anyone claiming responsibility for the odd value for cpu=.

This ticket was raised as a result of a user experiencing QOSMaxGRESPerUser with no discernible reason. The user now has another job pending however we're not able to tell why:

$ squeue -j 15275586
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          15275586     batch jobscrip shaima0d PD       0:00      8 (QOSMaxGRESPerUser)


$ scontrol show job=15275586
JobId=15275586 JobName=jobscript.slurm
   UserId=shaima0d(174988) GroupId=g-shaima0d(1174988) MCS_label=N/A
   Priority=11552 Nice=0 Account=ibex-cs QOS=ibex-cs
   JobState=PENDING Reason=QOSMaxGRESPerUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A
   SubmitTime=2021-05-10T18:35:49 EligibleTime=2021-05-10T18:35:49
   AccrueTime=2021-05-10T18:35:49
   StartTime=2021-05-15T13:07:26 EndTime=2021-05-15T23:07:26 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-11T15:45:27
   Partition=batch AllocNode:Sid=login510-22:48133
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=gpu208-02,gpu210-[06,18],gpu212-[14,18],gpu213-14,gpu214-[06,10]
   NumNodes=8-8 NumCPUs=1 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=700G,node=1,billing=1,gres/gpu=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=700G MinTmpDiskNode=0
   Features=v100&gpu_ai DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/jobscript.slurm
   WorkDir=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes
   StdErr=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/slurm-15275586.out
   StdIn=/dev/null
   StdOut=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/slurm-15275586.out
   Power=
   CpusPerTres=gpu:4
   TresPerJob=gpu:8
   TresPerNode=gpu:1
   NtasksPerTRES:0

$ sacctmgr -P show qos ibex-cs
Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES
ibex-cs|5000|00:00:00|||cluster|||1.000000|cpu=4096,gres/gpu=256||||||||||||||||

Comment 5 Greg Wickham 2021-05-11 06:50:57 MDT

Created attachment 19414 [details]
Output of scontrol show assoc

Comment 6 Greg Wickham 2021-05-11 06:51:14 MDT

Created attachment 19415 [details]
slurm.conf

Comment 7 Scott Hilton 2021-05-13 16:00:03 MDT

Greg,

How many times has this issue occurred? How long have you been seeing it for? Does it only happen for this user?

The output of scontrol show assoc doesn't show the user or qos near any limits. Was this query taken during the issue?

Can you turn on debug2 and send me the slurmctld logs next time you see the issue? 
SlurmctldDebug=debug2

-Scott

Comment 8 Greg Wickham 2021-05-13 16:00:11 MDT

The University is closed for Eid al-Fitr, reopening on Sunday 23rd of May.

Until the University re-opens assisting for using Ibex can be obtained by either:


   - send a request to the Ibex slack channel #general

      (sign up at https://kaust-ibex.slack.com/signup)


   - open a ticket by sending an email to ibex@hpc.kaust.edu.sa

Please note that during Eid reduced staffing is in effect, so assistance will be prioritised. This may delay responses to some requests.


 -Greg


--

Comment 10 Scott Hilton 2021-06-03 15:37:56 MDT

Greg,

Are you still seeing the issue?

If so could you respond the the questions in my last comment.

Thanks, 

Scott

Comment 11 Scott Hilton 2021-06-14 11:11:22 MDT

Greg, 

Would you like me to look at this bug? If so please respond to comment 7

-Scott

Comment 12 Greg Wickham 2021-06-14 23:18:52 MDT

Hi Scott,

The issue hasn't happened again, so without more information I doubt there is more that can be done now.

If it does happen again, I'll reopen this ticket.

Please close it for now.

   -Greg

Comment 13 Scott Hilton 2021-06-15 11:51:05 MDT

Closing Ticket