Hello all, Recently we have observed that for some jobs two GPUs belonging to different PCIe root complex are allocated even if "--gres-flags=enforce-binding" is specified. We have not been able to replicate the issue on the same nodes, which indicates that the gres.conf configuration is correct. Here are two jobs that showed the association issue: > JobId=7235226 JobName=D1790G-eq > UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A > Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=14:51:38 TimeLimit=1-16:00:00 TimeMin=N/A > SubmitTime=2019-03-19T22:47:22 EligibleTime=2019-03-19T22:47:22 > AccrueTime=2019-03-19T22:47:22 > StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T22:47:34 > Partition=pascal AllocNode:Sid=gw341:13528 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0024 > BatchHost=gpu0024 > NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0024 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2) > MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/./runeq.sb > WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq > StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log > StdIn=/dev/null > StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/D1790G/eq/runeq.log > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 > JobId=7235218 JobName=E1784K-eq > UserId=jsmith(90423) GroupId=csb(10005) MCS_label=N/A > Priority=1 Nice=0 Account=csb_gpu_acc QOS=csb_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=14:53:22 TimeLimit=1-16:00:00 TimeMin=N/A > SubmitTime=2019-03-19T22:46:29 EligibleTime=2019-03-19T22:46:29 > AccrueTime=2019-03-19T22:46:29 > StartTime=2019-03-19T22:47:34 EndTime=2019-03-21T14:47:34 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-03-19T22:47:34 > Partition=pascal AllocNode:Sid=gw341:13528 > ReqNodeList=(null) ExcNodeList=(null) > NodeList=gpu0017 > BatchHost=gpu0017 > NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=2,mem=4G,node=1,billing=2,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0017 CPU_IDs=0,4 Mem=4096 GRES_IDX=gpu(IDX:0,2) > MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/./runeq.sb > WorkDir=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq > StdErr=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log > StdIn=/dev/null > StdOut=/dors/csb/home/jsmith/proj/Nav/aMD/5x0m/DMPC/E1784K/eq/runeq.log > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 On the two nodes P2P communications are available only between GPU (0,1) and (2,3). Please let me know if you need additional information. Davide
Created attachment 9671 [details] Slurm configuration
Created attachment 9672 [details] GRES configuration
Created attachment 9673 [details] slurmd log gpu0017
Created attachment 9674 [details] slurmd log gpu0024
Hi Davide, Let me see if I understand correctly: are you expecting to see `GRES_IDX=gpu(IDX:0,1)` instead of `GRES_IDX=gpu(IDX:0,2)` if enforce-binding is specified? And is this because GPUs 0-1 are in a separate PCIe root complex from GPUs 2-3? The slurmd logs indicate that the slurmd was restarted partway through the job. So this may in fact be related to bug 6729, like you mentioned. I'm interested to see if the fix in Slurm 18.08.6 prevents this problem from happening. If you can reproduce: Before the slurmd restart occurs, do jobs show `IDX:0,2` or `IDX:0,1`? In other words, is the restart messing up the binding, or is the binding messed up since the beginning of the job? Thanks, -Michael
Michael, The restarts were due to the work we were doing for #6639. Unfortunately I have not found a way to replicate the issue yet and I rely on the feedback from the user that reported it first. What I can do is to try to look for some correlation between jobs reporting this issue and slurmd restarts. Davide
Michael, The incorrect GPU binding seems to have nothing to do with the slurmd restarts. I have identified some jobs that present the same issue and that started after the latest slurmd restart. No other changes to Slurm have been applied while the job has been running. Davide
(In reply to Davide Vanzo from comment #7) > The incorrect GPU binding seems to have nothing to do with the slurmd > restarts. I have identified some jobs that present the same issue and that > started after the latest slurmd restart. No other changes to Slurm have been > applied while the job has been running. Could you send the logs and Slurm commands for these jobs as well? Thanks
Created attachment 9725 [details] job 7395004 info
Created attachment 9726 [details] job 7395004 batch script
Created attachment 9727 [details] slurmd log gpu0021
Created attachment 9728 [details] slurmctld log
Michael, I have attached all the information you requested. Please let me know if you need anything else while the job is still running. Davide
Hi Davide, You actually stumbled upon one of the motivations for the cons_tres select plugin work being done in Slurm 19.05. Unfortunately, it turns out that in Slurm 18.08 (and SelectType=cons_res), there is no way to say “I want two GPUs and I want them to be on the same socket.” If this happens, it’s only by chance. And since Slurm will try to schedule GPUs sequentially, that might happen the majority of the time. But it’s not guaranteed, as you are occasionally seeing. Now, the enforce-binding flag only checks that each allocated GPU is given a core that matches its `CPUs` configuration in gres.conf; it does NOT guarantee that two allocated GPUs are on the same socket. You can see this with job 7395004 on node gpu0021. The job asked for 1 node, with 2 tasks (1 cpu/task, thus 2 CPUs) and 2 GPUs within that node. gres.conf shows this: NodeName=gpu[0013-0034] Name=gpu File=/dev/nvidia[0-1] CPUs=0-3 NodeName=gpu[0013-0034] Name=gpu File=/dev/nvidia[2-3] CPUs=4-7 What Slurm actually allocated was this: Nodes=gpu0021 CPU_IDs=0,5 Mem=8192 GRES_IDX=gpu(IDX:0,3) enforce-binding worked as expected here: CPU 0 was allocated due to GPU 0, and CPU 5 was allocated due to GPU 3. So what likely happened is that GPUs 1 and 2 were being used by some other single-GPU jobs, so Slurm allocated GPU 3, since there is nothing saying the 2 GPUs for the job have to be on the same socket. In 19.05, if `SelectType=cons_tres` is specified in slurm.conf, then a user could specify this: --gpus-per-socket=2 and that would guarantee the behavior you are looking for. In the meantime, here are a few ideas on how to deal with the issue: * You could allocate a whole node exclusively to a job. Then, if 2 GPUs are requested, they will likely be the first 2 GPUs found, which will naturally be on the same socket. * You could allocate GPUs for a job in multiples of 2 (a similar idea to memory alignment), because if a job allocates 1 GPU, it will mess up the “alignment,” causing fragmentation and a chance that the remaining GPUs on the node might be on different sockets for a 2-GPU job. * You could resubmit a job and hope for a better allocation if it happens to be allocated GPUs that are not on the same socket (maybe even exclude the mis-aligned node from the resubmission, so you don’t get it again). * Finally, you could put up with this behavior until Slurm 19.05 is released in a month. This isn’t an exhaustive list; there might be better ways to handle this that I haven’t thought of. Let me know if that works for you! Thanks, Michael
Michael, Thank you for the clarification. This is quite strange since I can successfully preventing a job to get two GPUs on separate sockets by doing the following test: 1. Allocate three jobs on GPU #0, #1 and #2 2. Cancel job on GPU #1 3. Launch a job requesting two GPUs on the same socket The last job will remain pending until one of the other two remaining jobs complete or are cancelled. So the logic seems to work ok, but as you said it may not always work on less ideal situations. When transitioning to 19.05 gres.conf is not supported anymore, correct?
(In reply to Davide Vanzo from comment #15) > Michael, > > Thank you for the clarification. > This is quite strange since I can successfully preventing a job to get two > GPUs on separate sockets by doing the following test: > > 1. Allocate three jobs on GPU #0, #1 and #2 > 2. Cancel job on GPU #1 > 3. Launch a job requesting two GPUs on the same socket > > The last job will remain pending until one of the other two remaining jobs > complete or are cancelled. > So the logic seems to work ok, but as you said it may not always work on > less ideal situations. That does seem strange, because 18.08 doesn't look at sockets when scheduling GPUs. How are you doing step 3? Maybe there is some other resource that is causing the job to pend instead. > When transitioning to 19.05 gres.conf is not supported anymore, correct? Actually, gres.conf will still be needed. However, you won't have to manually specify all the fields if you don't want to (assuming you are using NVIDIA GPUs, install NVML, and specify AutoDetect=nvml).
> That does seem strange, because 18.08 doesn't look at sockets when > scheduling GPUs. How are you doing step 3? Maybe there is some other > resource that is causing the job to pend instead. Is it possible that it works on 18.08.4 (our currently running version) and not in 18.08.8? Step 3 is done by simply submitting a job with enforce-binding with two GPUs, nothing else. And no, no other job is contending the resources. > Actually, gres.conf will still be needed. However, you won't have to > manually specify all the fields if you don't want to (assuming you are using > NVIDIA GPUs, install NVML, and specify AutoDetect=nvml). Ok, thanks
Could you output the resource allocation (job info) of the three test jobs? And are you sure that no other jobs are running on the node?
Michael, This is the part of the batch job scripts that is common for all three jobs: > #SBATCH --account=accre_gpu_acc > #SBATCH --partition=pascal > #SBATCH --nodelist=gpu0022 > #SBATCH --ntasks=1 > #SBATCH --time=1:00:00 > > deviceQuery > sleep 3600 With deviceQuery I can get exactly which PCIe bus the GPU is connected to. And here is how I conducted the test: 1. Submit three jobs requesting one GPU each. From their output: > slurm-7566843.out: Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0 > slurm-7566845.out: Device PCI Domain ID / Bus ID / location ID: 0 / 130 / 0 > slurm-7566847.out: Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 2. Kill job 7566847. In this way now we have one GPU allocated per PCIe root complex. 3. Submit a job requesting two GPUs WITHOUT enforce-binding: > slurm-7566843.out: Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0 > slurm-7566845.out: Device PCI Domain ID / Bus ID / location ID: 0 / 130 / 0 > slurm-7566940.out: Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 > slurm-7566940.out: Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0 As expected the new job is allocated on the two available GPUs on two different PCIe root complexes. 4. Kill job 7566940 and start a new job requesting two GPUs WITH enforce binding. Now as expected the job remains pending: > $ scontrol show job 7567042 -dd > JobId=7567042 JobName=gpu_2.sb > UserId=vanzod(389801) GroupId=accre(36014) MCS_label=N/A > Priority=9240 Nice=0 Account=accre_gpu_acc QOS=accre_gpu_pascal_acc > JobState=PENDING Reason=Resources Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:00:00 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2019-04-05T11:54:54 EligibleTime=2019-04-05T11:54:54 > AccrueTime=2019-04-05T11:54:54 > StartTime=2019-04-05T12:50:04 EndTime=2019-04-05T13:50:04 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-04-05T12:03:49 > Partition=pascal AllocNode:Sid=gw343:30162 > ReqNodeList=gpu0022 ExcNodeList=(null) > NodeList=(null) SchedNodeList=gpu[0014,0020,0022] > NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=1,mem=1G,node=1,billing=1,gres/gpu=2 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs22/home/vanzod/Test/gpu_2.sb > WorkDir=/gpfs22/home/vanzod/Test > StdErr=/gpfs22/home/vanzod/Test/slurm-7567042.out > StdIn=/dev/null > StdOut=/gpfs22/home/vanzod/Test/slurm-7567042.out > Power= > GresEnforceBind=Yes > TresPerNode=gpu:2 Here are the details for the other two single GPU jobs: > JobId=7566843 JobName=gpu_1.sb > UserId=vanzod(389801) GroupId=accre(36014) MCS_label=N/A > Priority=9232 Nice=0 Account=accre_gpu_acc QOS=accre_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:16:06 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2019-04-05T11:49:48 EligibleTime=2019-04-05T11:49:48 > AccrueTime=2019-04-05T11:49:48 > StartTime=2019-04-05T11:49:49 EndTime=2019-04-05T12:49:49 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-04-05T11:49:49 > Partition=pascal AllocNode:Sid=gw343:30162 > ReqNodeList=gpu0022 ExcNodeList=(null) > NodeList=gpu0022 > BatchHost=gpu0022 > NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=1,mem=1G,node=1,billing=1,gres/gpu=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0022 CPU_IDs=0 Mem=1024 GRES_IDX=gpu(IDX:0) > MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs22/home/vanzod/Test/gpu_1.sb > WorkDir=/gpfs22/home/vanzod/Test > StdErr=/gpfs22/home/vanzod/Test/slurm-7566843.out > StdIn=/dev/null > StdOut=/gpfs22/home/vanzod/Test/slurm-7566843.out > Power= > TresPerNode=gpu:1 > JobId=7566845 JobName=gpu_1.sb > UserId=vanzod(389801) GroupId=accre(36014) MCS_label=N/A > Priority=9232 Nice=0 Account=accre_gpu_acc QOS=accre_gpu_pascal_acc > JobState=RUNNING Reason=None Dependency=(null) > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 > DerivedExitCode=0:0 > RunTime=00:16:10 TimeLimit=01:00:00 TimeMin=N/A > SubmitTime=2019-04-05T11:50:04 EligibleTime=2019-04-05T11:50:04 > AccrueTime=2019-04-05T11:50:04 > StartTime=2019-04-05T11:50:04 EndTime=2019-04-05T12:50:04 Deadline=N/A > PreemptTime=None SuspendTime=None SecsPreSuspend=0 > LastSchedEval=2019-04-05T11:50:04 > Partition=pascal AllocNode:Sid=gw343:30162 > ReqNodeList=gpu0022 ExcNodeList=(null) > NodeList=gpu0022 > BatchHost=gpu0022 > NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* > TRES=cpu=1,mem=1G,node=1,billing=1,gres/gpu=1 > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* > Nodes=gpu0022 CPU_IDs=4 Mem=1024 GRES_IDX=gpu(IDX:2) > MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0 > Features=(null) DelayBoot=00:00:00 > OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) > Command=/gpfs22/home/vanzod/Test/gpu_1.sb > WorkDir=/gpfs22/home/vanzod/Test > StdErr=/gpfs22/home/vanzod/Test/slurm-7566845.out > StdIn=/dev/null > StdOut=/gpfs22/home/vanzod/Test/slurm-7566845.out > Power= > TresPerNode=gpu:1
Hi Davide, Ok, I think I see what’s happening. The # of CPUs requested makes a huge difference. Since job 7567042 requests 2 GPUs and 1 CPU, enforce-binding _does_ guarantee that both GPUs are on the same socket. This is because there is only one CPU for the GPUs to match against, so the GPUs of necessity _must_ be on the same socket. From https://slurm.schedmd.com/srun.html#OPT_enforce-binding: "The only CPUs available to the job will be those bound to the selected GRES (i.e. the CPUs identified in the gres.conf file will be strictly enforced)... For example a job requiring two GPUs and one CPU will be delayed until both GPUs on a single socket are available rather than using GPUs bound to separate sockets... Requires the node to be configured with more than one socket and resource filtering will be performed on a per-socket basis." However, nothing above guarantees that two GPUs will be on the same socket when two or more CPUs are specified. The jobs that originally raised this ticket all have 2 CPUs. This means that one CPU/GPU pair could be on one socket, and another CPU/GPU pair could be on a different socket. As long as each GPU is on the same socket as one of the CPUs, this still passes the enforce-binding check. So if job 7567042 had requested 2 CPUs instead of 1, I would expect that the job would run, even with enforce-binding specified. What this means is that another solution to the original problem is to only use a single CPU with enforce-binding, and that will guarantee that both GPUs will be on the same socket. (In reply to Davide Vanzo from comment #17) > Is it possible that it works on 18.08.4 (our currently running version) and > not in 18.08.8? It’s possible, but I highly doubt it. Thanks, -Michael
Michael, Good point. However we restrict a 1:2 GPU:CPU ratio via job_submit.lua and CPU IDs are bound to GPU IDs in gres.conf. Does this mean that the associations in gres.conf are ignored?
Anyway, you are right. If I ask for 2 CPU cores it starts despite the bind enforcement. Oh well, we'll wait for 19.05 then.
(In reply to Davide Vanzo from comment #21) > Michael, > > Good point. However we restrict a 1:2 GPU:CPU ratio via job_submit.lua and > CPU IDs are bound to GPU IDs in gres.conf. Does this mean that the > associations in gres.conf are ignored? I'm not quite sure. How are you implementing the 1:2 GPU:CPU ratio restriction in your lua script? I doubt that Slurm ignores what's in gres.conf at any point.
(In reply to Davide Vanzo from comment #22) > Anyway, you are right. If I ask for 2 CPU cores it starts despite the bind > enforcement. Oh well, we'll wait for 19.05 then. Thanks for double-checking that for me. Luckily, 19.05 is right around the corner!
> Thanks for double-checking that for me. Luckily, 19.05 is right around the > corner! You'll definitely hear from me if I need help at upgrade. In the meantime thanks for the help and close the ticket. Have a great weekend!
Ok great, thanks! Closing out ticket. -Michael