|
Description
Dylan Simon
2021-06-22 13:11:41 MDT
Hi Dylan, Thank you for providing the steps to reproduce the issue. From them I can produce a different issue but is still related to how `scontrol update job` is handled with GPUs. ``` [2021-06-23T09:35:12.184] sched: _slurm_rpc_allocate_resources JobId=183 NodeList=qa-n[0-3] usec=6051 [2021-06-23T09:35:12.184] prolog_running_decr: Configuration for JobId=183 is complete [2021-06-23T09:35:26.378] sched: _update_job: setting nodes to qa-n1,qa-n2,qa-n3 for JobId=183 [2021-06-23T09:35:26.385] _slurm_rpc_update_job: complete JobId=183 uid=1000 usec=7567 [2021-06-23T09:35:54.824] _job_complete: JobId=183 WEXITSTATUS 0 [2021-06-23T09:35:54.824] error: gres/gpu: job 183 dealloc node qa-n3 GRES count underflow (0 < 1) [2021-06-23T09:35:54.824] _job_complete: JobId=183 done ``` Could you please provide the node configurations as defined in your slurm.conf and gres.conf for workergpu[17,20-21,32]. Thanks, Skyler I have definitely seen that "GRES underflow" message in the logs for other jobs like this (just not in my quick test). I think they happen after the conflicting allocations, which only happen after some time (I think because the job keeps using the originally assigned resource, and then a new job is assigned the same resource after the resize -- or it may be the other way around). We first noticed this because we found two jobs trying to use the same GPU. I can try to run some more tests to pin this down if it's helpful. NodeName=workergpu[12-17] CoresPerSocket=18 RealMemory=768000 Sockets=2 TmpDisk=450000 Weight=55 Gres=gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1 Feature=gpu,skylake,v100,v100-32gb NodeName=workergpu[18-42] CoresPerSocket=20 RealMemory=768000 Sockets=2 TmpDisk=450000 Weight=65 Gres=gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1 Feature=gpu,skylake,v100,v100-32gb NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia0 Cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17 NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia1 Cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17 NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia1 Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38 NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia2 Cores=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia2 Cores=18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35 NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia3 Cores=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39 NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia3 Cores=18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35 Thanks for the information. I think I have enough information to track down the source of the issue. I'll keep you posted. I have confirmed that there appears to be incongruent data between `scontrol show -d job $SLURM_JOBID` and `scontrol show -d nodes $SLURM_JOB_NODELIST`. Basically the job data is lying and the node data is correct, after the job resize. While the resource allocation is correct, certain environment variable may be incorrect. I will work on a patch to correct this behavior. Sounds good. Just want to confirm that I definitely see cases where two jobs end up allocated to the same GPUs after these resizes. Maybe this is a byproduct of the job info. I haven't been able to reproduce it exactly, but for a real example right now, see workergpu40:
NodeName=workergpu40 Arch=x86_64 CoresPerSocket=20
CPUAlloc=4 CPUTot=40 CPULoad=3.94
AvailableFeatures=gpu,skylake,v100,v100-32gb
ActiveFeatures=gpu,skylake,v100,v100-32gb
Gres=gpu:v100-32gb:4(S:0-1)
GresDrain=N/A
GresUsed=gpu:v100-32gb:4(IDX:0-3)
NodeAddr=workergpu40 NodeHostName=workergpu40 Version=20.02.5
OS=Linux 5.4.96.1.fi #1 SMP Sun Feb 7 20:29:42 EST 2021
RealMemory=768000 AllocMem=72000 FreeMem=749208 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=450000 Weight=60 Owner=N/A MCS_label=N/A
Partitions=gpu
BootTime=2021-06-30T09:56:59 SlurmdStartTime=2021-06-30T09:56:58
CfgTRES=cpu=40,mem=750G,billing=40,gres/gpu=4
AllocTRES=cpu=4,mem=72000M,gres/gpu=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
JobId=1056557 JobName=g2c2
UserId=XXX(1413) GroupId=XXX(1413) MCS_label=N/A
Priority=4294691783 Nice=0 Account=ccm QOS=gen
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=1-01:42:03 TimeLimit=10-00:00:00 TimeMin=N/A
SubmitTime=2021-06-30T10:07:50 EligibleTime=2021-06-30T10:07:50
AccrueTime=2021-06-30T10:07:51
StartTime=2021-06-30T10:07:51 EndTime=2021-07-10T10:07:51 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-30T10:07:51
Partition=gpu AllocNode:Sid=rusty1:1668636
ReqNodeList=(null) ExcNodeList=(null)
NodeList=workergpu40
BatchHost=workergpu40
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=36000M,node=1,billing=2,gres/gpu=2
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
JOB_GRES=gpu:2
Nodes=workergpu40 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
MinCPUsNode=2 MinMemoryCPU=18000M MinTmpDiskNode=0
Features=v100-32gb&skylake DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=XXX
WorkDir=XXX
StdErr=XXX
StdIn=/dev/null
StdOut=XXX
Power=
TresPerNode=gpu:2
MailUser=(null) MailType=NONE
JobId=1057529 JobName=wrap
UserId=YYY(1567) GroupId=YYY(1567) MCS_label=N/A
Priority=4294690811 Nice=0 Account=ccm QOS=gen
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=03:39:55 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2021-07-01T08:09:31 EligibleTime=2021-07-01T08:09:31
AccrueTime=2021-07-01T08:09:31
StartTime=2021-07-01T08:10:00 EndTime=2021-07-08T08:10:00 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-07-01T08:10:00
Partition=gpu AllocNode:Sid=rusty1:4050954
ReqNodeList=(null) ExcNodeList=workergpu[00-01]
NodeList=workergpu[14,40,46]
BatchHost=workergpu14
NumNodes=3 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=6,mem=108000M,node=3,billing=6,gres/gpu=6
Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
JOB_GRES=gpu:6
Nodes=workergpu14 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
Nodes=workergpu40 CPU_IDs=2-3 Mem=36000 GRES=gpu:2(IDX:0-1)
Nodes=workergpu46 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
MinCPUsNode=2 MinMemoryCPU=18000M MinTmpDiskNode=0
Features=v100-32gb DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=YYY
StdErr=YYY
StdIn=/dev/null
StdOut=YYY
Power=
TresPerNode=gpu:2
MailUser=(null) MailType=NONE
These jobs are definitely both sharing GPUs 0-1 (according to nvidia-smi), and nothing using 2-3. (I'm not sure if there's a way to check the cgroup devices list for the processes.) Another job had previously resized out workergpu40, now looks like (unfortunately I don't have the pre-resize state):
JobId=1057528 JobName=wrap
UserId=YYY(1567) GroupId=YYY(1567) MCS_label=N/A
Priority=4294690812 Nice=0 Account=ccm QOS=gen
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=03:43:46 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2021-07-01T08:08:18 EligibleTime=2021-07-01T08:08:18
AccrueTime=2021-07-01T08:08:18
ResizeTime=2021-07-01T08:09:17
StartTime=2021-07-01T08:08:46 EndTime=2021-07-08T08:08:46 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-07-01T08:08:30
Partition=gpu AllocNode:Sid=rusty1:4050954
ReqNodeList=(null) ExcNodeList=workergpu[00-01]
NodeList=workergpu13
BatchHost=workergpu13
NumNodes=1 NumCPUs=2 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=36000M,node=1,billing=2,gres/gpu=6
Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
JOB_GRES=gpu:6
Nodes=workergpu13 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
MinCPUsNode=2 MinMemoryCPU=18000M MinTmpDiskNode=0
Features=v100-32gb DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=YYY
StdErr=YYY
StdIn=/dev/null
StdOut=YYY
Power=
TresPerNode=gpu:2
MailUser=(null) MailType=NONE
Thanks for the additional outputs.
> Just want to confirm that I definitely see cases where two jobs end up
> allocated to the same GPUs after these resizes. Maybe this is a
> byproduct of the job info.
The job update might be incorrectly updating certain bitstrings, which could explain the the erroneous gres sharing, the gres underflow on deallocation, and the incorrect `scontrol show job` output. I am still investigating though.
Typically we suggest against shrinking running jobs and enforcing it via `SchedulerParameters=disable_job_shrink`. Also, this may mitigate the issue until a patch can be applied.
You have me curious as to your use case for shrinking running jobs.
Thanks. We have a number of cases where people want to run very short serial jobs. We also generally have an interest in keeping the queue size small and focusing on exclusive allocations (though obviously not for the GPUs in this case). As such we developed and encourage the use of a batch tool to run multiple tasks per allocation, so nodes/data can be staged once: https://github.com/flatironinstitute/disBatch it's essentially a very simple mini scheduler. One of the features is that it automatically releases nodes when it runs out of tasks for them. For now we can just disable resizing on gpu nodes. Hey Dylan, I'm able to reproduce the issue locally, so hopefully I can track down a fix. Thanks! -Michael Hi Dylan, Would you be interested in testing out a patch to see if it solves the problem for you, once a patch is available? -Michael Absolutely... in theory. The one complication is that we use bright, which has its own build configuration for slurm that we don't know how to replicate, but we can definitely send them a patch and ask for a custom build, it just make take a bit more time. (In reply to Dylan Simon from comment #11) > Absolutely... in theory. The one complication is that we use bright, which > has its own build configuration for slurm that we don't know how to > replicate, but we can definitely send them a patch and ask for a custom > build, it just make take a bit more time. Don't worry about it. This change would most likely only go into 22.05, since it's been broken like this for a long time. So I would need to backport the patch to 20.02 for you if you were to test it, which is just added work. Hi Dylan, GRES resizing should now work properly in the upcoming 21.08.2 release. For more details, see the following commits: https://github.com/SchedMD/slurm/compare/5a0a5c331285...5d6b93a1e1e8. I'll go ahead and close this bug out, but feel free to reopen if you think we missed something. Thanks! -Michael |