One of our users was using a QOS but their job wasn't launching. Upon inspection the QOS GrpTRES was defined as: $ sacctmgr -P show qos ibex-cs format=grptres GrpTRES cpu=18446744073709548616,gres/gpu=256 Modifying it to a sane value, the job started running immediately: $ sacctmgr modify qos ibex-cs set grptres=cpu=4096,gres/gpu=256 Modified qos... ibex-cs Would you like to commit changes? (You have 30 seconds to decide) (N/y): y $ sacctmgr -P show qos ibex-cs format=grptres GrpTRES cpu=4096,gres/gpu=256 -Greg
Greg, What is the slurm.conf? Was there anything special about the job? -Scott
Greg, I was able to get a job to run with cpu=18446744073709548616. There must be more to reproduce your issue. Do you know if someone set this value? It is 2997 less than the max value for that option. -Scott
Hi Scott, We don't have anyone claiming responsibility for the odd value for cpu=. This ticket was raised as a result of a user experiencing QOSMaxGRESPerUser with no discernible reason. The user now has another job pending however we're not able to tell why: $ squeue -j 15275586 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 15275586 batch jobscrip shaima0d PD 0:00 8 (QOSMaxGRESPerUser) $ scontrol show job=15275586 JobId=15275586 JobName=jobscript.slurm UserId=shaima0d(174988) GroupId=g-shaima0d(1174988) MCS_label=N/A Priority=11552 Nice=0 Account=ibex-cs QOS=ibex-cs JobState=PENDING Reason=QOSMaxGRESPerUser Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=10:00:00 TimeMin=N/A SubmitTime=2021-05-10T18:35:49 EligibleTime=2021-05-10T18:35:49 AccrueTime=2021-05-10T18:35:49 StartTime=2021-05-15T13:07:26 EndTime=2021-05-15T23:07:26 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-05-11T15:45:27 Partition=batch AllocNode:Sid=login510-22:48133 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=gpu208-02,gpu210-[06,18],gpu212-[14,18],gpu213-14,gpu214-[06,10] NumNodes=8-8 NumCPUs=1 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=700G,node=1,billing=1,gres/gpu=8 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=700G MinTmpDiskNode=0 Features=v100&gpu_ai DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/jobscript.slurm WorkDir=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes StdErr=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/slurm-15275586.out StdIn=/dev/null StdOut=/ibex/scratch/shaima0d/weka_storage_benchmarking/beegfs_basline/8_jobs_on_8_nodes/slurm-15275586.out Power= CpusPerTres=gpu:4 TresPerJob=gpu:8 TresPerNode=gpu:1 NtasksPerTRES:0 $ sacctmgr -P show qos ibex-cs Name|Priority|GraceTime|Preempt|PreemptExemptTime|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GrpSubmit|GrpWall|MaxTRES|MaxTRESPerNode|MaxTRESMins|MaxWall|MaxTRESPU|MaxJobsPU|MaxSubmitPU|MaxTRESPA|MaxJobsPA|MaxSubmitPA|MinTRES ibex-cs|5000|00:00:00|||cluster|||1.000000|cpu=4096,gres/gpu=256||||||||||||||||
Created attachment 19414 [details] Output of scontrol show assoc
Created attachment 19415 [details] slurm.conf
Greg, How many times has this issue occurred? How long have you been seeing it for? Does it only happen for this user? The output of scontrol show assoc doesn't show the user or qos near any limits. Was this query taken during the issue? Can you turn on debug2 and send me the slurmctld logs next time you see the issue? SlurmctldDebug=debug2 -Scott
The University is closed for Eid al-Fitr, reopening on Sunday 23rd of May. Until the University re-opens assisting for using Ibex can be obtained by either: - send a request to the Ibex slack channel #general (sign up at https://kaust-ibex.slack.com/signup) - open a ticket by sending an email to ibex@hpc.kaust.edu.sa Please note that during Eid reduced staffing is in effect, so assistance will be prioritised. This may delay responses to some requests. -Greg --
Greg, Are you still seeing the issue? If so could you respond the the questions in my last comment. Thanks, Scott
Greg, Would you like me to look at this bug? If so please respond to comment 7 -Scott
Hi Scott, The issue hasn't happened again, so without more information I doubt there is more that can be done now. If it does happen again, I'll reopen this ticket. Please close it for now. -Greg
Closing Ticket