| Summary: | Job requesting --gres= | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Susan Chacko <susanc> |
| Component: | Scheduling | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian, da, rl303f |
| Version: | 14.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NIH | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
gres.conf
slurm.conf |
||
I'm not able to reproduce this yet. Will you attach your slurm.conf and gres.conf? Created attachment 1971 [details]
gres.conf
Created attachment 1972 [details]
slurm.conf
What does your lua job submit plugin do? I noticed this from your scontrol show jobs output, the job has: NumNodes=1 NumCPUs=2 CPUs/Task=16 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0 I'm looking at the 16 CPUs/Task and MinCPUsNode. I can get my jobs into the same JobHeldAdmin scenario if I submit as: sbatch --partition=gpu --gres=gpu:k20x:1 --ntasks=2 --ntasks-per-core=1 --wrap="sleep 60" -c16 Yes, that's exactly what our job submit plugin does. We want 8 cores (16 CPUs) to be allocated along with each GPU. Here's the relevant section from the plugin: elseif job_desc.partition == "gpu" then NIH_qos = "gpu" job_desc.cpus_per_task = 16 Thanks for that information. So you're request is then asking for 1 core with 16 cpus which doesn't exist. Are you trying to keep the tasks on one socket? Do you need any more assistance on this? Thanks for your explanation. We were setting cpus_per_task = 16 as a way to limit per-user allocation of GPUs, i.e. with each GPU, we allocated 16 CPUs and set a limit on total CPU allocation per user in the GPU queue. Now that we understand what's happening, we can find an alternate way to avoid using --ntasks-per-core=1, and then we shouldn't hit this problem. When trackable resources are available we'll switch to using those. Good to hear. One option is to have the jobs in the gpu queue allocate a full socket.
Here's the configuration to do that:
slurm.conf:
TaskPlugin=cgroup,affinity
SelectType=select/cons_res
SelectTypeParameters=CR_CORE_Memory,CR_ALLOCATE_FULL_SOCKET
PartitionName=gpu Nodes=cn[0603-0626] ... SelectTypeParameters=CR_SOCKET
and turn off TaskAffinity in the cgroup.conf since you are using the task/affinity plugin for task placement.
brian@compy:~/slurm/14.11/compy$ sbatch -p gpu -n3 -mblock:block --wrap="hostname"
Submitted batch job 432633
brian@compy:~/slurm/14.11/compy$ sbatch -p gpu -n13 -mblock:block --wrap="hostname"
Submitted batch job 432634
brian@compy:~/slurm/14.11/compy$ sacct -j 432633,432634
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
432633 wrap gpu normal 12 COMPLETED 0:0
432633.batch batch normal 12 COMPLETED 0:0
432634 wrap gpu normal 24 COMPLETED 0:0
432634.batch batch normal 24 COMPLETED 0:0
knc is 2 sockets with 12 cpus each. By default, tasks are allocated cyclically across sockets, so the block:block packs the tasks on one socket before allocating on the next socket.
Let us know if you need any more help.
Thanks,
Brian
|
Problem: With --gres and --ntasks-per-core=1, jobs go into PENDING state ('Resources') and then into JobHeldAdmin state. Releasing the job causes it to cycle through the same states. % sbatch --partition=gpu --gres=gpu:k20x:1 --ntasks=2 --ntasks-per-core=1 ~/test.bat % scontrol --details show job 138316 JobId=138316 JobName=test.bat UserId=susanc(906) GroupId=staff(49) Priority=0 Nice=0 Account=sb QOS=gpu JobState=PENDING Reason=JobHeldAdmin Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=10-00:00:00 TimeMin=N/A SubmitTime=2015-06-15T08:47:50 EligibleTime=2015-06-15T08:47:50 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=gpu AllocNode:Sid=biowulf2:24819 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=2 CPUs/Task=16 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=gpu:k20x:1 Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/home/susanc/test.bat WorkDir=/usr/local/www/vhost.hpc/htdocs/apps StdErr=/usr/local/www/vhost.hpc/htdocs/apps/slurm-138316.out StdIn=/dev/null StdOut=/usr/local/www/vhost.hpc/htdocs/apps/slurm-138316.out slurmctld.log has: ------------------------ [2015-06-04T07:59:21.461] job_submit.lua: NIH_job_submit: job from susanc [2015-06-04T07:59:21.461] job_submit.lua: NIH_job_submit: job from susanc, partition = gpu, setting default qos: gpu [2015-06-04T07:59:21.461] Job submit request: account:(null) begin_time:0 dependency:(null) name:run.gpu partition:gpu qos:gpu submit_uid:906 time_limit:4294967294 user_id:906 [2015-06-04T07:59:21.467] _slurm_rpc_submit_batch_job JobId=69715 usec=6445 [2015-06-04T07:59:23.064] error: cons_res: sync loop not progressing, holding job 69715 [2015-06-04T07:59:23.064] backfill: Failed to start JobId=69715 on cn[0603-0626]: Requested nodes are busy However, all GPU nodes are free. We have gres configured to allocate some CPUs along with each GPU. gres.conf: ### ### NIH slurm generic resource (GRES) configuration file ### NodeName=cn[0603-0626] Name=gpu Type=k20x File=/dev/nvidia0 CPUs=0-7,16-23 NodeName=cn[0603-0626] Name=gpu Type=k20x File=/dev/nvidia1 CPUs=8-15,24-31 Relevant lines from slurm.conf: GresTypes=gpu NodeName=cn[0603-0626] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=129007 TmpDisk=6450 State=UNKNOWN Weight=300 Feature=cpu32,core16,g128,gpuk20x,ssd800,x2650 Gres=gpu:k20x:2