Hello there, Recently we started observing in the slurmctld log error messages like this one: > error: gres/gpu: job 7149683 dealloc node gpu0014 topo gres count underflow (0 1) The gres.conf configuration has not changed and the error persists after rebooting either slurmctld and slurmd. No particular errors are evident in the slurmd log except some controller connect failure that may be the result of the changes we are testing in Bug#6639. Attached you con find our current configuration files and the logs. Please let me know if you need additional information. Davide
Created attachment 9633 [details] slurmctld log
Created attachment 9634 [details] slurmd log (gpu0010)
Created attachment 9635 [details] Slurm configuration
Created attachment 9636 [details] GRES configuration
For the reference, here are the details of the two jobs reported in the error. > JobId=7107167 JobName=titan.batch > UserId=khanfm(161909) GroupId=nbody(58698) MCS_label=N/A > Priority=10 Nice=0 Account=nbody_acc QOS=nbody_maxwell_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:15 > RunTime=2-14:35:57 TimeLimit=5-00:00:00 TimeMin=N/A > SubmitTime=2019-03-17T20:05:59 EligibleTime=2019-03-17T20:08:00 > AccrueTime=2019-03-17T20:08:00 > StartTime=2019-03-17T20:08:52 EndTime=2019-03-22T20:08:52 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-17T20:08:52 > Partition=maxwell AllocNode:Sid=gw346:21185 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu[0001-0005,0008-0010] > BatchHost=gpu0001 > NumNodes=8 NumCPUs=96 NumTasks=32 CPUs/Task=3 ReqB:S:C:T=0:0:*:* > TRES=cpu=96,mem=800G,node=8,billing=96,gres/gpu=32 > Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=* > Nodes=gpu[0001-0005,0008-0010] CPU_IDs=0-11 Mem=102400 GRES_IDX=gpu(IDX:0-3) > MinCPUsNode=12 MinMemoryNode=100G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/titan.batch > WorkDir=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m > StdErr=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/run_phi.out > StdIn=/dev/null > StdOut=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/run_phi.out > Power= > TresPerNode=gpu:4 > JobId=7234306 JobName=svd_trainnew1.slurm > UserId=luoh3(649317) GroupId=grissom_lab(20461) MCS_label=N/A > Priority=1 Nice=0 Account=grissom_lab_acc QOS=grissom_lab_maxwell_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=13:21:42 TimeLimit=1-00:10:00 TimeMin=N/A > SubmitTime=2019-03-19T21:22:17 EligibleTime=2019-03-19T21:22:17 > AccrueTime=2019-03-19T21:22:17 > StartTime=2019-03-19T21:22:59 EndTime=2019-03-20T21:32:59 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T21:22:59 > Partition=maxwell AllocNode:Sid=gw343:13962 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0007 > BatchHost=gpu0007 > NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=3,mem=20G,node=1,billing=3,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0007 CPU_IDs=2-3,6 Mem=20480 GRES_IDX=gpu(IDX:1-2) > MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs22/home/luoh3/code/svd_trainnew1.slurm > WorkDir=/gpfs22/home/luoh3/code > StdErr=/gpfs22/home/luoh3/code/slurm-7234306.out > StdIn=/dev/null > StdOut=/gpfs22/home/luoh3/code/slurm-7234306.out > Power= > TresPerNode=gpu:2
Hi Davide, This error is caused by restarting the slurmctld while a job with multiple gpus allocated to it is running. This is fixed in Slurm 18.08.6; see bug 6370 for more information. If you still run into this error after upgrading, please reopen this ticket and let us know. Thanks! Michael *** This ticket has been marked as a duplicate of ticket 6370 ***
Michael, I do not know if this is related to my previous inquiry or not. What I am observing is that jobs requesting multiple GPUs with "--gres-flags=enforce-binding" get GPUs belonging to different PCIe root complexes. For example: > JobId=7235226 JobName=D1790G-eq > UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A > Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=14:51:38 TimeLimit=1-16:00:00 TimeMin=N/A > SubmitTime=2019-03-19T22:47:22 EligibleTime=2019-03-19T22:47:22 > AccrueTime=2019-03-19T22:47:22 > StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T22:47:34 > Partition=pascal AllocNode:Sid=gw341:13528 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0024 > BatchHost=gpu0024 > NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0024 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2) > MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/./runeq.sb > WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq > StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log > StdIn=/dev/null > StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 > JobId=7235218 JobName=E1784K-eq > UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A > Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=14:53:22 TimeLimit=1-16:00:00 TimeMin=N/A > SubmitTime=2019-03-19T22:46:29 EligibleTime=2019-03-19T22:46:29 > AccrueTime=2019-03-19T22:46:29 > StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T22:47:34 > Partition=pascal AllocNode:Sid=gw341:13528 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0017 > BatchHost=gpu0017 > NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0017 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2) > MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/./runeq.sb > WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq > StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log > StdIn=/dev/null > StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 Happy to open a new ticket if it is not related. Davide
Hey Davide, I'm don't believe they are related, but then again I'm not sure I understand what the problem is. At any rate, it does help us out if you could file a separate ticket for that. Thanks! Michael
Marking this as a duplicate. The GPU enforce-binding issue mentioned in comment 9 has been moved to bug 6746. It may likely turn out to be a symptom of the same issue, but until we know for sure, discussion will be continued there. *** This ticket has been marked as a duplicate of ticket 6370 ***