Ticket 11881

Summary: gres/gpu resources incorrect after job resize
Product: Slurm Reporter: Dylan Simon <dsimon>
Component: GPUAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: skyler
Version: 20.02.5   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=12143
https://bugs.schedmd.com/show_bug.cgi?id=6251
https://bugs.schedmd.com/show_bug.cgi?id=12728
https://bugs.schedmd.com/show_bug.cgi?id=7390
Site: Simons Foundation & Flatiron Institute Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 21.08.2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 12143    

Description Dylan Simon 2021-06-22 13:11:41 MDT
After resizing a job to remove nodes using "scontrol update job nodelist", the specific gres/GPUs allocated to the job on each node are incorrect.  This can cause the same GPU to be allocated to multiple jobs.  This example demonstrates the problem.  Notice the specific gpu IDX on each node.

>salloc -p gpu --gpus-per-task=1 -n7
salloc: Granted job allocation 1049427
salloc: Waiting for resource configuration
salloc: Nodes workergpu[17,20-21,32] are ready for job
>scontrol show -d job $SLURM_JOBID
JobId=1049427 JobName=zsh
   UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A
   Priority=4294698854 Nice=0 Account=scc QOS=gen
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:18 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2021-06-22T14:58:21 EligibleTime=2021-06-22T14:58:21
   AccrueTime=Unknown
   StartTime=2021-06-22T14:58:21 EndTime=2021-06-29T14:58:21 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-22T14:58:21
   Partition=gpu AllocNode:Sid=rusty1:12863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=workergpu[17,20-21,32]
   BatchHost=workergpu17
   NumNodes=4 NumCPUs=7 NumTasks=7 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=7,mem=126000M,node=4,billing=7,gres/gpu=7
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:7
     Nodes=workergpu17 CPU_IDs=18-19 Mem=36000 GRES=gpu:2(IDX:2-3)
     Nodes=workergpu20 CPU_IDs=20-21 Mem=36000 GRES=gpu:2(IDX:2-3)
     Nodes=workergpu21 CPU_IDs=0,20 Mem=36000 GRES=gpu:2(IDX:0,2)
     Nodes=workergpu32 CPU_IDs=21 Mem=18000 GRES=gpu:1(IDX:3)
   MinCPUsNode=1 MinMemoryCPU=18000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/home/dylan
   Power=
   TresPerTask=gpu:1
   MailUser=(null) MailType=NONE

>scontrol update job=$SLURM_JOBID NodeList=workergpu17,workergpu21,workergpu32
To reset Slurm environment variables, execute
  For bash or sh shells:  . ./slurm_job_1049427_resize.sh
  For csh shells:         source ./slurm_job_1049427_resize.csh
>scontrol show -d job $SLURM_JOBID
JobId=1049427 JobName=zsh
   UserId=dylan(1135) GroupId=dylan(1135) MCS_label=N/A
   Priority=4294698854 Nice=0 Account=scc QOS=gen
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:57 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2021-06-22T14:58:21 EligibleTime=2021-06-22T14:58:21
   AccrueTime=Unknown
   ResizeTime=2021-06-22T14:59:16
   StartTime=2021-06-22T14:58:21 EndTime=2021-06-29T14:58:21 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-22T14:58:21
   Partition=gpu AllocNode:Sid=rusty1:12863
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=workergpu[17,21,32]
   BatchHost=workergpu17
   NumNodes=3 NumCPUs=5 NumTasks=7 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=5,mem=90000M,node=3,billing=7,gres/gpu=7
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:7
     Nodes=workergpu17 CPU_IDs=18-19 Mem=36000 GRES=gpu:2(IDX:2-3)
     Nodes=workergpu21 CPU_IDs=0,20 Mem=36000 GRES=gpu:2(IDX:2-3)
     Nodes=workergpu32 CPU_IDs=21 Mem=18000 GRES=gpu:2(IDX:0,2)
   MinCPUsNode=1 MinMemoryCPU=18000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/mnt/home/dylan
   Power=
   TresPerTask=gpu:1
   MailUser=(null) MailType=NONE

>exit
salloc: Relinquishing job allocation 1049427

In particular, workergpu21 now has a different two GPUs, and workergpu32 now has 2 where it previously had 1.  (It looks like the gres for the removed node didn't get remove and the resources shifted nodes.)  The TRES gres/gpu is also not updated, but that's a more minor issue.

Not much unusual in the logs (though happy to enable more debug if it's helpful):

[2021-06-22T14:58:21.852] sched: _slurm_rpc_allocate_resources JobId=1049427 NodeList=workergpu[17,20-21,32] usec=4079
[2021-06-22T14:58:21.852] prolog_running_decr: Configuration for JobId=1049427 is complete
[2021-06-22T14:59:16.936] sched: _update_job: setting nodes to workergpu17,workergpu21,workergpu32 for JobId=1049427
[2021-06-22T14:59:16.936] Killing JobId=1049427 StepId=Extern on failed node workergpu20
[2021-06-22T14:59:16.980] _slurm_rpc_update_job: complete JobId=1049427 uid=1135 usec=44927
[2021-06-22T14:59:24.554] _job_complete: JobId=1049427 WEXITSTATUS 0
[2021-06-22T14:59:24.555] _job_complete: JobId=1049427 done
Comment 1 Skyler Malinowski 2021-06-23 08:30:09 MDT
Hi Dylan,

Thank you for providing the steps to reproduce the issue. From them I can produce a different issue but is still related to how `scontrol update job` is handled with GPUs.

```
[2021-06-23T09:35:12.184] sched: _slurm_rpc_allocate_resources JobId=183 NodeList=qa-n[0-3] usec=6051
[2021-06-23T09:35:12.184] prolog_running_decr: Configuration for JobId=183 is complete
[2021-06-23T09:35:26.378] sched: _update_job: setting nodes to qa-n1,qa-n2,qa-n3 for JobId=183
[2021-06-23T09:35:26.385] _slurm_rpc_update_job: complete JobId=183 uid=1000 usec=7567
[2021-06-23T09:35:54.824] _job_complete: JobId=183 WEXITSTATUS 0
[2021-06-23T09:35:54.824] error: gres/gpu: job 183 dealloc node qa-n3 GRES count underflow (0 < 1)
[2021-06-23T09:35:54.824] _job_complete: JobId=183 done
```

Could you please provide the node configurations as defined in your slurm.conf and gres.conf for workergpu[17,20-21,32].

Thanks,
Skyler
Comment 2 Dylan Simon 2021-06-23 10:01:51 MDT
I have definitely seen that "GRES underflow" message in the logs for other jobs like this (just not in my quick test).  I think they happen after the conflicting allocations, which only happen after some time (I think because the job keeps using the originally assigned resource, and then a new job is assigned the same resource after the resize -- or it may be the other way around).  We first noticed this because we found two jobs trying to use the same GPU.  I can try to run some more tests to pin this down if it's helpful.

NodeName=workergpu[12-17] CoresPerSocket=18 RealMemory=768000 Sockets=2 TmpDisk=450000 Weight=55 Gres=gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1 Feature=gpu,skylake,v100,v100-32gb
NodeName=workergpu[18-42] CoresPerSocket=20 RealMemory=768000 Sockets=2 TmpDisk=450000 Weight=65 Gres=gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1,gpu:v100-32gb:1 Feature=gpu,skylake,v100,v100-32gb

NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia0 Cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia1 Cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia1 Cores=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia2 Cores=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia2 Cores=18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
NodeName=workergpu[18-42] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia3 Cores=1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
NodeName=workergpu[12-17] Name=gpu Type=v100-32gb Count=1 File=/dev/nvidia3 Cores=18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
Comment 3 Skyler Malinowski 2021-06-24 13:19:43 MDT
Thanks for the information. I think I have enough information to track down the source of the issue. I'll keep you posted.
Comment 4 Skyler Malinowski 2021-06-30 13:30:23 MDT
I have confirmed that there appears to be incongruent data between `scontrol show -d job $SLURM_JOBID` and `scontrol show -d nodes $SLURM_JOB_NODELIST`. Basically the job data is lying and the node data is correct, after the job resize. While the resource allocation is correct, certain environment variable may be incorrect.

I will work on a patch to correct this behavior.
Comment 5 Dylan Simon 2021-07-01 09:59:29 MDT
Sounds good.  Just want to confirm that I definitely see cases where two jobs end up allocated to the same GPUs after these resizes.  Maybe this is a byproduct of the job info.  I haven't been able to reproduce it exactly, but for a real example right now, see workergpu40:

NodeName=workergpu40 Arch=x86_64 CoresPerSocket=20
   CPUAlloc=4 CPUTot=40 CPULoad=3.94
   AvailableFeatures=gpu,skylake,v100,v100-32gb
   ActiveFeatures=gpu,skylake,v100,v100-32gb
   Gres=gpu:v100-32gb:4(S:0-1)
   GresDrain=N/A
   GresUsed=gpu:v100-32gb:4(IDX:0-3)
   NodeAddr=workergpu40 NodeHostName=workergpu40 Version=20.02.5
   OS=Linux 5.4.96.1.fi #1 SMP Sun Feb 7 20:29:42 EST 2021
   RealMemory=768000 AllocMem=72000 FreeMem=749208 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=450000 Weight=60 Owner=N/A MCS_label=N/A
   Partitions=gpu
   BootTime=2021-06-30T09:56:59 SlurmdStartTime=2021-06-30T09:56:58
   CfgTRES=cpu=40,mem=750G,billing=40,gres/gpu=4
   AllocTRES=cpu=4,mem=72000M,gres/gpu=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

JobId=1056557 JobName=g2c2
   UserId=XXX(1413) GroupId=XXX(1413) MCS_label=N/A
   Priority=4294691783 Nice=0 Account=ccm QOS=gen
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=1-01:42:03 TimeLimit=10-00:00:00 TimeMin=N/A
   SubmitTime=2021-06-30T10:07:50 EligibleTime=2021-06-30T10:07:50
   AccrueTime=2021-06-30T10:07:51
   StartTime=2021-06-30T10:07:51 EndTime=2021-07-10T10:07:51 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-30T10:07:51
   Partition=gpu AllocNode:Sid=rusty1:1668636
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=workergpu40
   BatchHost=workergpu40
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=36000M,node=1,billing=2,gres/gpu=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:2
     Nodes=workergpu40 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
   MinCPUsNode=2 MinMemoryCPU=18000M MinTmpDiskNode=0
   Features=v100-32gb&skylake DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=XXX
   WorkDir=XXX
   StdErr=XXX
   StdIn=/dev/null
   StdOut=XXX
   Power=
   TresPerNode=gpu:2
   MailUser=(null) MailType=NONE

JobId=1057529 JobName=wrap
   UserId=YYY(1567) GroupId=YYY(1567) MCS_label=N/A
   Priority=4294690811 Nice=0 Account=ccm QOS=gen
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=03:39:55 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2021-07-01T08:09:31 EligibleTime=2021-07-01T08:09:31
   AccrueTime=2021-07-01T08:09:31
   StartTime=2021-07-01T08:10:00 EndTime=2021-07-08T08:10:00 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-07-01T08:10:00
   Partition=gpu AllocNode:Sid=rusty1:4050954
   ReqNodeList=(null) ExcNodeList=workergpu[00-01]
   NodeList=workergpu[14,40,46]
   BatchHost=workergpu14
   NumNodes=3 NumCPUs=6 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=6,mem=108000M,node=3,billing=6,gres/gpu=6
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   JOB_GRES=gpu:6
     Nodes=workergpu14 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
     Nodes=workergpu40 CPU_IDs=2-3 Mem=36000 GRES=gpu:2(IDX:0-1)
     Nodes=workergpu46 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
   MinCPUsNode=2 MinMemoryCPU=18000M MinTmpDiskNode=0
   Features=v100-32gb DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=YYY
   StdErr=YYY
   StdIn=/dev/null
   StdOut=YYY
   Power=
   TresPerNode=gpu:2
   MailUser=(null) MailType=NONE

These jobs are definitely both sharing GPUs 0-1 (according to nvidia-smi), and nothing using 2-3.  (I'm not sure if there's a way to check the cgroup devices list for the processes.)  Another job had previously resized out workergpu40, now looks like (unfortunately I don't have the pre-resize state):

JobId=1057528 JobName=wrap
   UserId=YYY(1567) GroupId=YYY(1567) MCS_label=N/A
   Priority=4294690812 Nice=0 Account=ccm QOS=gen
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=03:43:46 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2021-07-01T08:08:18 EligibleTime=2021-07-01T08:08:18
   AccrueTime=2021-07-01T08:08:18
   ResizeTime=2021-07-01T08:09:17
   StartTime=2021-07-01T08:08:46 EndTime=2021-07-08T08:08:46 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-07-01T08:08:30
   Partition=gpu AllocNode:Sid=rusty1:4050954
   ReqNodeList=(null) ExcNodeList=workergpu[00-01]
   NodeList=workergpu13
   BatchHost=workergpu13
   NumNodes=1 NumCPUs=2 NumTasks=6 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=36000M,node=1,billing=2,gres/gpu=6
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   JOB_GRES=gpu:6
     Nodes=workergpu13 CPU_IDs=0-1 Mem=36000 GRES=gpu:2(IDX:0-1)
   MinCPUsNode=2 MinMemoryCPU=18000M MinTmpDiskNode=0
   Features=v100-32gb DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=YYY
   StdErr=YYY
   StdIn=/dev/null
   StdOut=YYY
   Power=
   TresPerNode=gpu:2
   MailUser=(null) MailType=NONE
Comment 6 Skyler Malinowski 2021-07-13 13:47:27 MDT
Thanks for the additional outputs.

> Just want to confirm that I definitely see cases where two jobs end up
> allocated to the same GPUs after these resizes.  Maybe this is a
> byproduct of the job info.
The job update might be incorrectly updating certain bitstrings, which could explain the the erroneous gres sharing, the gres underflow on deallocation, and the incorrect `scontrol show job` output. I am still investigating though.


Typically we suggest against shrinking running jobs and enforcing it via `SchedulerParameters=disable_job_shrink`. Also, this may mitigate the issue until a patch can be applied.

You have me curious as to your use case for shrinking running jobs.
Comment 7 Dylan Simon 2021-07-13 14:58:16 MDT
Thanks. We have a number of cases where people want to run very short serial jobs. We also generally have an interest in keeping the queue size small and focusing on exclusive allocations (though obviously not for the GPUs in this case).  As such we developed and encourage the use of a batch tool to run multiple tasks per allocation, so nodes/data can be staged once: https://github.com/flatironinstitute/disBatch
it's essentially a very simple mini scheduler. One of the features is that it automatically releases nodes when it runs out of tasks for them. For now we can just disable resizing on gpu nodes.
Comment 9 Michael Hinton 2021-08-27 19:16:20 MDT
Hey Dylan,

I'm able to reproduce the issue locally, so hopefully I can track down a fix.

Thanks!
-Michael
Comment 10 Michael Hinton 2021-08-30 15:20:12 MDT
Hi Dylan,

Would you be interested in testing out a patch to see if it solves the problem for you, once a patch is available?

-Michael
Comment 11 Dylan Simon 2021-08-31 08:23:29 MDT
Absolutely... in theory.  The one complication is that we use bright, which has its own build configuration for slurm that we don't know how to replicate, but we can definitely send them a patch and ask for a custom build, it just make take a bit more time.
Comment 12 Michael Hinton 2021-08-31 09:52:35 MDT
(In reply to Dylan Simon from comment #11)
> Absolutely... in theory.  The one complication is that we use bright, which
> has its own build configuration for slurm that we don't know how to
> replicate, but we can definitely send them a patch and ask for a custom
> build, it just make take a bit more time.
Don't worry about it. This change would most likely only go into 22.05, since it's been broken like this for a long time. So I would need to backport the patch to 20.02 for you if you were to test it, which is just added work.
Comment 22 Michael Hinton 2021-09-30 13:06:00 MDT
Hi Dylan,

GRES resizing should now work properly in the upcoming 21.08.2 release. For more details, see the following commits: https://github.com/SchedMD/slurm/compare/5a0a5c331285...5d6b93a1e1e8.

I'll go ahead and close this bug out, but feel free to reopen if you think we missed something.

Thanks!
-Michael