Ticket 6729

Summary: topo gres count underflow error
Product: Slurm Reporter: Davide Vanzo <davide.vanzo>
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6746
Site: Vanderbilt Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmctld log
slurmd log (gpu0010)
Slurm configuration
GRES configuration

Description Davide Vanzo 2019-03-20 09:38:01 MDT
Hello there,

Recently we started observing in the slurmctld log error messages like this one:

> error: gres/gpu: job 7149683 dealloc node gpu0014 topo gres count underflow (0 1)
The gres.conf configuration has not changed and the error persists after rebooting either slurmctld and slurmd. No particular errors are evident in the slurmd log except some controller connect failure that may be the result of the changes we are testing in Bug#6639.

Attached you con find our current configuration files and the logs. Please let me know if you need additional information.

Davide
Comment 1 Davide Vanzo 2019-03-20 09:38:36 MDT
Created attachment 9633 [details]
slurmctld log
Comment 2 Davide Vanzo 2019-03-20 09:39:12 MDT
Created attachment 9634 [details]
slurmd log (gpu0010)
Comment 3 Davide Vanzo 2019-03-20 09:39:31 MDT
Created attachment 9635 [details]
Slurm configuration
Comment 4 Davide Vanzo 2019-03-20 09:39:53 MDT
Created attachment 9636 [details]
GRES configuration
Comment 6 Davide Vanzo 2019-03-20 09:48:01 MDT
For the reference, here are the details of the two jobs reported in the error.

> JobId=7107167 JobName=titan.batch
>    UserId=khanfm(161909) GroupId=nbody(58698) MCS_label=N/A
>    Priority=10 Nice=0 Account=nbody_acc QOS=nbody_maxwell_acc
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:15
>    RunTime=2-14:35:57 TimeLimit=5-00:00:00 TimeMin=N/A
>    SubmitTime=2019-03-17T20:05:59 EligibleTime=2019-03-17T20:08:00
>    AccrueTime=2019-03-17T20:08:00
>    StartTime=2019-03-17T20:08:52 EndTime=2019-03-22T20:08:52 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-03-17T20:08:52
>    Partition=maxwell AllocNode:Sid=gw346:21185
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=gpu[0001-0005,0008-0010]
>    BatchHost=gpu0001
>    NumNodes=8 NumCPUs=96 NumTasks=32 CPUs/Task=3 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=96,mem=800G,node=8,billing=96,gres/gpu=32
>    Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=*
>      Nodes=gpu[0001-0005,0008-0010] CPU_IDs=0-11 Mem=102400 GRES_IDX=gpu(IDX:0-3)
>    MinCPUsNode=12 MinMemoryNode=100G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/titan.batch
>    WorkDir=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m
>    StdErr=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/run_phi.out
>    StdIn=/dev/null
>    StdOut=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/run_phi.out
>    Power=
>    TresPerNode=gpu:4

> JobId=7234306 JobName=svd_trainnew1.slurm
>    UserId=luoh3(649317) GroupId=grissom_lab(20461) MCS_label=N/A
>    Priority=1 Nice=0 Account=grissom_lab_acc QOS=grissom_lab_maxwell_acc
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=13:21:42 TimeLimit=1-00:10:00 TimeMin=N/A
>    SubmitTime=2019-03-19T21:22:17 EligibleTime=2019-03-19T21:22:17
>    AccrueTime=2019-03-19T21:22:17
>    StartTime=2019-03-19T21:22:59 EndTime=2019-03-20T21:32:59 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-03-19T21:22:59
>    Partition=maxwell AllocNode:Sid=gw343:13962
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=gpu0007
>    BatchHost=gpu0007
>    NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=3,mem=20G,node=1,billing=3,gres/gpu=2
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>      Nodes=gpu0007 CPU_IDs=2-3,6 Mem=20480 GRES_IDX=gpu(IDX:1-2)
>    MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/gpfs22/home/luoh3/code/svd_trainnew1.slurm
>    WorkDir=/gpfs22/home/luoh3/code
>    StdErr=/gpfs22/home/luoh3/code/slurm-7234306.out
>    StdIn=/dev/null
>    StdOut=/gpfs22/home/luoh3/code/slurm-7234306.out
>    Power=
>    TresPerNode=gpu:2
Comment 8 Michael Hinton 2019-03-20 10:16:24 MDT
Hi Davide,

This error is caused by restarting the slurmctld while a job with multiple gpus allocated to it is running.

This is fixed in Slurm 18.08.6; see bug 6370 for more information. If you still run into this error after upgrading, please reopen this ticket and let us know.

Thanks!
Michael

*** This ticket has been marked as a duplicate of ticket 6370 ***
Comment 9 Davide Vanzo 2019-03-20 13:14:19 MDT
Michael,

I do not know if this is related to my previous inquiry or not. What I am observing is that jobs requesting multiple GPUs with "--gres-flags=enforce-binding" get GPUs belonging to different PCIe root complexes. For example:

> JobId=7235226 JobName=D1790G-eq
>    UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A
>    Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=14:51:38 TimeLimit=1-16:00:00 TimeMin=N/A
>    SubmitTime=2019-03-19T22:47:22 EligibleTime=2019-03-19T22:47:22
>    AccrueTime=2019-03-19T22:47:22
>    StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-03-19T22:47:34
>    Partition=pascal AllocNode:Sid=gw341:13528
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=gpu0024
>    BatchHost=gpu0024
>    NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>      Nodes=gpu0024 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2)
>    MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/./runeq.sb
>    WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq
>    StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log
>    StdIn=/dev/null
>    StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log
>    Power=
>    GresEnforceBind=Yes
>    TresPerNode=gpu:2

> JobId=7235218 JobName=E1784K-eq
>    UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A
>    Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=14:53:22 TimeLimit=1-16:00:00 TimeMin=N/A
>    SubmitTime=2019-03-19T22:46:29 EligibleTime=2019-03-19T22:46:29
>    AccrueTime=2019-03-19T22:46:29
>    StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A
>    PreemptTime=None SuspendTime=None SecsPreSuspend=0
>    LastSchedEval=2019-03-19T22:47:34
>    Partition=pascal AllocNode:Sid=gw341:13528
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=gpu0017
>    BatchHost=gpu0017
>    NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>      Nodes=gpu0017 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2)
>    MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/./runeq.sb
>    WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq
>    StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log
>    StdIn=/dev/null
>    StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log
>    Power=
>    GresEnforceBind=Yes
>    TresPerNode=gpu:2

Happy to open a new ticket if it is not related.

Davide
Comment 10 Michael Hinton 2019-03-20 15:13:30 MDT
Hey Davide,

I'm don't believe they are related, but then again I'm not sure I understand what the problem is.

At any rate, it does help us out if you could file a separate ticket for that.

Thanks!
Michael
Comment 11 Michael Hinton 2019-03-25 14:00:45 MDT
Marking this as a duplicate. The GPU enforce-binding issue mentioned in comment 9 has been moved to bug 6746. It may likely turn out to be a symptom of the same issue, but until we know for sure, discussion will be continued there.

*** This ticket has been marked as a duplicate of ticket 6370 ***