| Summary: | topo gres count underflow error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Davide Vanzo <davide.vanzo> |
| Component: | slurmctld | Assignee: | Director of Support <support> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 18.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6746 | ||
| Site: | Vanderbilt | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld log
slurmd log (gpu0010) Slurm configuration GRES configuration |
||
|
Description
Davide Vanzo
2019-03-20 09:38:01 MDT
Created attachment 9633 [details]
slurmctld log
Created attachment 9634 [details]
slurmd log (gpu0010)
Created attachment 9635 [details]
Slurm configuration
Created attachment 9636 [details]
GRES configuration
For the reference, here are the details of the two jobs reported in the error. > JobId=7107167 JobName=titan.batch > UserId=khanfm(161909) GroupId=nbody(58698) MCS_label=N/A > Priority=10 Nice=0 Account=nbody_acc QOS=nbody_maxwell_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:15 > RunTime=2-14:35:57 TimeLimit=5-00:00:00 TimeMin=N/A > SubmitTime=2019-03-17T20:05:59 EligibleTime=2019-03-17T20:08:00 > AccrueTime=2019-03-17T20:08:00 > StartTime=2019-03-17T20:08:52 EndTime=2019-03-22T20:08:52 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-17T20:08:52 > Partition=maxwell AllocNode:Sid=gw346:21185 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu[0001-0005,0008-0010] > BatchHost=gpu0001 > NumNodes=8 NumCPUs=96 NumTasks=32 CPUs/Task=3 ReqB:S:C:T=0:0:*:* > TRES=cpu=96,mem=800G,node=8,billing=96,gres/gpu=32 > Socks/Node=* NtasksPerN:B:S:C=4:0:*:* CoreSpec=* > Nodes=gpu[0001-0005,0008-0010] CPU_IDs=0-11 Mem=102400 GRES_IDX=gpu(IDX:0-3) > MinCPUsNode=12 MinMemoryNode=100G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/titan.batch > WorkDir=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m > StdErr=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/run_phi.out > StdIn=/dev/null > StdOut=/gpfs23/data/nbody/khanfm/phi-GRAPE+GPU/5Mrun/run-3m/run_phi.out > Power= > TresPerNode=gpu:4 > JobId=7234306 JobName=svd_trainnew1.slurm > UserId=luoh3(649317) GroupId=grissom_lab(20461) MCS_label=N/A > Priority=1 Nice=0 Account=grissom_lab_acc QOS=grissom_lab_maxwell_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=13:21:42 TimeLimit=1-00:10:00 TimeMin=N/A > SubmitTime=2019-03-19T21:22:17 EligibleTime=2019-03-19T21:22:17 > AccrueTime=2019-03-19T21:22:17 > StartTime=2019-03-19T21:22:59 EndTime=2019-03-20T21:32:59 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T21:22:59 > Partition=maxwell AllocNode:Sid=gw343:13962 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0007 > BatchHost=gpu0007 > NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=3,mem=20G,node=1,billing=3,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0007 CPU_IDs=2-3,6 Mem=20480 GRES_IDX=gpu(IDX:1-2) > MinCPUsNode=1 MinMemoryNode=20G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs22/home/luoh3/code/svd_trainnew1.slurm > WorkDir=/gpfs22/home/luoh3/code > StdErr=/gpfs22/home/luoh3/code/slurm-7234306.out > StdIn=/dev/null > StdOut=/gpfs22/home/luoh3/code/slurm-7234306.out > Power= > TresPerNode=gpu:2 Hi Davide, This error is caused by restarting the slurmctld while a job with multiple gpus allocated to it is running. This is fixed in Slurm 18.08.6; see bug 6370 for more information. If you still run into this error after upgrading, please reopen this ticket and let us know. Thanks! Michael *** This ticket has been marked as a duplicate of ticket 6370 *** Michael, I do not know if this is related to my previous inquiry or not. What I am observing is that jobs requesting multiple GPUs with "--gres-flags=enforce-binding" get GPUs belonging to different PCIe root complexes. For example: > JobId=7235226 JobName=D1790G-eq > UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A > Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=14:51:38 TimeLimit=1-16:00:00 TimeMin=N/A > SubmitTime=2019-03-19T22:47:22 EligibleTime=2019-03-19T22:47:22 > AccrueTime=2019-03-19T22:47:22 > StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T22:47:34 > Partition=pascal AllocNode:Sid=gw341:13528 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0024 > BatchHost=gpu0024 > NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0024 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2) > MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/./runeq.sb > WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq > StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log > StdIn=/dev/null > StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 > JobId=7235218 JobName=E1784K-eq > UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A > Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=14:53:22 TimeLimit=1-16:00:00 TimeMin=N/A > SubmitTime=2019-03-19T22:46:29 EligibleTime=2019-03-19T22:46:29 > AccrueTime=2019-03-19T22:46:29 > StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T22:47:34 > Partition=pascal AllocNode:Sid=gw341:13528 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0017 > BatchHost=gpu0017 > NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0017 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2) > MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/./runeq.sb > WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq > StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log > StdIn=/dev/null > StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 Happy to open a new ticket if it is not related. Davide Hey Davide, I'm don't believe they are related, but then again I'm not sure I understand what the problem is. At any rate, it does help us out if you could file a separate ticket for that. Thanks! Michael Marking this as a duplicate. The GPU enforce-binding issue mentioned in comment 9 has been moved to bug 6746. It may likely turn out to be a symptom of the same issue, but until we know for sure, discussion will be continued there. *** This ticket has been marked as a duplicate of ticket 6370 *** |