Ticket 1744

Summary: Job requesting --gres=
Product: Slurm Reporter: Susan Chacko <susanc>
Component: SchedulingAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, da, rl303f
Version: 14.11.7   
Hardware: Linux   
OS: Linux   
Site: NIH Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: gres.conf
slurm.conf

Description Susan Chacko 2015-06-15 00:50:01 MDT
Problem: With --gres and --ntasks-per-core=1, jobs go into PENDING state ('Resources') and then into JobHeldAdmin state. Releasing the job causes it to cycle through the same states. 

% sbatch --partition=gpu --gres=gpu:k20x:1 --ntasks=2 --ntasks-per-core=1 ~/test.bat

% scontrol --details show job 138316
JobId=138316 JobName=test.bat
   UserId=susanc(906) GroupId=staff(49)
   Priority=0 Nice=0 Account=sb QOS=gpu
   JobState=PENDING Reason=JobHeldAdmin Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=10-00:00:00 TimeMin=N/A
   SubmitTime=2015-06-15T08:47:50 EligibleTime=2015-06-15T08:47:50
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=gpu AllocNode:Sid=biowulf2:24819
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=2 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) Gres=gpu:k20x:1 Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/susanc/test.bat
   WorkDir=/usr/local/www/vhost.hpc/htdocs/apps
   StdErr=/usr/local/www/vhost.hpc/htdocs/apps/slurm-138316.out
   StdIn=/dev/null
   StdOut=/usr/local/www/vhost.hpc/htdocs/apps/slurm-138316.out
                                                      

slurmctld.log has:

------------------------
[2015-06-04T07:59:21.461] job_submit.lua: NIH_job_submit: job from susanc
[2015-06-04T07:59:21.461] job_submit.lua: NIH_job_submit: job from susanc, partition = gpu, setting default qos: gpu
[2015-06-04T07:59:21.461] Job submit request: account:(null) begin_time:0 dependency:(null) name:run.gpu partition:gpu qos:gpu   submit_uid:906 time_limit:4294967294 user_id:906 
[2015-06-04T07:59:21.467] _slurm_rpc_submit_batch_job JobId=69715 usec=6445
[2015-06-04T07:59:23.064] error: cons_res: sync loop not progressing, holding job 69715
[2015-06-04T07:59:23.064] backfill: Failed to start JobId=69715 on cn[0603-0626]: Requested nodes are busy
However, all GPU nodes are free. 


We have gres configured to allocate some CPUs along with each GPU. gres.conf:
###
###  NIH slurm generic resource (GRES) configuration file
###
NodeName=cn[0603-0626] Name=gpu Type=k20x File=/dev/nvidia0 CPUs=0-7,16-23
NodeName=cn[0603-0626] Name=gpu Type=k20x File=/dev/nvidia1 CPUs=8-15,24-31


Relevant lines from slurm.conf:
GresTypes=gpu
NodeName=cn[0603-0626] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=129007  TmpDisk=6450  State=UNKNOWN Weight=300 Feature=cpu32,core16,g128,gpuk20x,ssd800,x2650 Gres=gpu:k20x:2
Comment 1 Brian Christiansen 2015-06-15 10:33:43 MDT
I'm not able to reproduce this yet. Will you attach your slurm.conf and gres.conf?
Comment 2 Susan Chacko 2015-06-15 23:38:38 MDT
Created attachment 1971 [details]
gres.conf
Comment 3 Susan Chacko 2015-06-15 23:39:22 MDT
Created attachment 1972 [details]
slurm.conf
Comment 4 Brian Christiansen 2015-06-16 11:27:39 MDT
What does your lua job submit plugin do? I noticed this from your scontrol show jobs output, the job has:

   NumNodes=1 NumCPUs=2 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0

I'm looking at the 16 CPUs/Task and MinCPUsNode.

I can get my jobs into the same JobHeldAdmin scenario if I submit as:

sbatch --partition=gpu --gres=gpu:k20x:1 --ntasks=2 --ntasks-per-core=1 --wrap="sleep 60" -c16
Comment 5 Susan Chacko 2015-06-16 23:48:58 MDT
Yes, that's exactly what our job submit plugin does. We want 8 cores (16 CPUs) to be allocated along with each GPU. Here's the relevant section from the plugin:

		elseif job_desc.partition == "gpu" then
			NIH_qos = "gpu"
			job_desc.cpus_per_task = 16
Comment 6 Brian Christiansen 2015-06-17 09:13:05 MDT
Thanks for that information. So you're request is then asking for 1 core with 16 cpus which doesn't exist. 

Are you trying to keep the tasks on one socket?
Comment 7 Brian Christiansen 2015-06-25 05:57:33 MDT
Do you need any more assistance on this?
Comment 8 Susan Chacko 2015-06-26 07:22:29 MDT
Thanks for your explanation. We were setting cpus_per_task = 16 as a way to limit per-user allocation of GPUs, i.e. with each GPU, we allocated 16 CPUs and set a limit on total CPU allocation per user in the GPU queue. Now that we understand what's happening, we can find an alternate way to avoid using --ntasks-per-core=1,
and then we shouldn't hit this problem. 

When trackable resources are available we'll switch to using those.
Comment 9 Brian Christiansen 2015-06-26 11:25:12 MDT
Good to hear. One option is to have the jobs in the gpu queue allocate a full socket.

Here's the configuration to do that:
slurm.conf:
TaskPlugin=cgroup,affinity
SelectType=select/cons_res
SelectTypeParameters=CR_CORE_Memory,CR_ALLOCATE_FULL_SOCKET

PartitionName=gpu Nodes=cn[0603-0626] ... SelectTypeParameters=CR_SOCKET

and turn off TaskAffinity in the cgroup.conf since you are using the task/affinity plugin for task placement.


brian@compy:~/slurm/14.11/compy$ sbatch -p gpu -n3 -mblock:block --wrap="hostname"
Submitted batch job 432633
brian@compy:~/slurm/14.11/compy$ sbatch -p gpu -n13 -mblock:block --wrap="hostname"
Submitted batch job 432634

brian@compy:~/slurm/14.11/compy$ sacct -j 432633,432634
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
432633             wrap        gpu     normal         12  COMPLETED      0:0 
432633.batch      batch                normal         12  COMPLETED      0:0 
432634             wrap        gpu     normal         24  COMPLETED      0:0 
432634.batch      batch                normal         24  COMPLETED      0:0 


knc is 2 sockets with 12 cpus each. By default, tasks are allocated cyclically across sockets, so the block:block packs the tasks on one socket before allocating on the next socket. 



Let us know if you need any more help.

Thanks,
Brian