|
Description
Michael Hinton
2021-02-23 12:09:09 MST
Created attachment 18094 [details]
20.11.4 DEBUG v1
Kilian,
Would you mind applying the following DEBUG patch, restarting some slurmds with multiple GPUs that are using AutoDetect, and attaching the slurmd.log outputs? I'd like at least one for an Intel node and one for an AMD node. I'm hoping this will explain why the core affinities are being incorrectly converted in xcpuinfo_mac_to_abs().
Thanks,
-Michael
Hi Kilian, Do you think you will be able to run the debug patch at some point? Hi Michael (In reply to Michael Hinton from comment #3) > Do you think you will be able to run the debug patch at some point? Yep, sorry for the delay, I'll send you the info shortly. Cheers, -- Kilian Created attachment 18163 [details]
slurmd logs w/ debug path
Here are the slurmd logs with the debug patch, from an Intel node (sh02-14n08) and an AMD ndoe (sh03-13n14). I provided the output of lstopo and nvidia-smi topo -m for both.
Let me know if you need anything else!
Cheers,
--
Kilian
Great! I'll investigate this some more and get back to you. Thanks, -Michael Created attachment 18165 [details]
v2
I have some good news (at the expense of exposing myself as being stupid): I found the issue and have a patch for you!
Attached is a v2 that fixes the issue in commit [1/2] and then adds optional debugging in [2/2]. Once v2 appears to fix everything, you can go ahead and drop commit [2/2], though it won't hurt to leave it in, since it's just extra logging info in the slurmd.log.
It turns out that in my effort to improve AutoDetect for 20.11 in the case where threads per core > 1, AutoDetect erroneously makes the detected GPUs have affinity for ALL cores. Hence, I'm stupid :) Commit [1/2] should fix that.
Just to clarify what you should now see for a GPUs on sh03-13n14 in the slurmd.log NVML debug2 output:
CPU Affinity Range - Machine: 0-15
Core Affinity Range - Abstract: 0-31
...
CPU Affinity Range - Machine: 16-31
Core Affinity Range - Abstract: 0-31
Should now show up as:
CPU Affinity Range - Machine: 0-15
Core Affinity Range - Abstract: 0-15
...
CPU Affinity Range - Machine: 16-31
Core Affinity Range - Abstract: 16-31
For that node (and for a lot of your nodes), the physical/machine CPU ids happen to be the same as the logical/"Slurm abstract" CPU ids, as verified by the lstopo output. And since you have threads per core == 1, CPU ids == core ids.
After applying v2 and restarting your slurmds, make sure to check a few of your nodes to see if they have the correct core affinity, and cross-check that with `nvidia-smi topo -m` (which will show physical/machine CPU ids, I think) and `lstopo-no-graphics` (which should show both physcial (P#) ids and logical ids (L#)).
Sorry for the regression, and thanks for helping us find this. Let me know how it goes!
-Michael
(In reply to Michael Hinton from comment #8) > Just to clarify what you should now see for a GPUs on sh03-13n14 in the > slurmd.log NVML debug2 output: Via commit [2/2], you should also see something like: BUG_10932: xcpuinfo_mac_to_abs: INPUT: machine CPU range: 0-11 ... BUG_10932: xcpuinfo_mac_to_abs: OUTPUT: abstract core range: 0-5 Proving that the conversion from the machine CPU range reported by NVML to the abstract core range used by Slurm occurred correctly. Created attachment 18179 [details] slurmd logs w/ topo info after patch v2 Hi Michael, Thanks you very much for the patch, I think that fixes the issue in all the topologies I was able to check (including another one where the physical CPU ids are spread around the sockets in a round-robin fashion). I'm attaching lstopo, nvidia-smi topo and slurmd debug logs with your patch applied for those 4 topologies * sh03-13n15: AMD node with highest CPU ids on NUMA 0 node (and GPU ids not matching the device minor numbers). This is the same hardware as sh03-13n14. * sh02-14n08: Intel node with contiguous physical CPU ids * sh02-16n05: Intel node with physical CPU ids alternating over sockets, GPUs attached to a single socket * sh01-27n21: Intel node with physical CPU ids alternating over sockets, GPUs attached to both sockets I think that all those cases are covered correctly and CPU affinities generated by the NVML gres plugin are matching the actual topology, so that looks great! And I aslo believe that this may be fixing bug 10827, correct? On sh03-13n15, I know see a links matrix that seems to be correctly oriented: sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1,0,0,0 sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,-1,0,0 sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,-1,0 sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: 0,0,0,-1 Thanks! -- Kilian (In reply to Kilian Cavalotti from comment #10) > Hi Michael, > > Thanks you very much for the patch, I think that fixes the issue in all the > topologies I was able to check (including another one where the physical CPU > ids are spread around the sockets in a round-robin fashion). I'm attaching > lstopo, nvidia-smi topo and slurmd debug logs with your patch applied for > those 4 topologies > > * sh03-13n15: AMD node with highest CPU ids on NUMA 0 node (and GPU ids not > matching the device minor numbers). This is the same hardware as sh03-13n14. > * sh02-14n08: Intel node with contiguous physical CPU ids > * sh02-16n05: Intel node with physical CPU ids alternating over sockets, > GPUs attached to a single socket > * sh01-27n21: Intel node with physical CPU ids alternating over sockets, > GPUs attached to both sockets > > I think that all those cases are covered correctly and CPU affinities > generated by the NVML gres plugin are matching the actual topology, so that > looks great! Excellent! I'll review your logs just to make sure everything looks correct, but barring anything strange, I'll go ahead and get the patch on the review queue so it can make it into 20.11.5. > And I also believe that this may be fixing bug 10827, correct? On > sh03-13n15, I know see a links matrix that seems to be correctly oriented: > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: -1,0,0,0 > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: 0,-1,0,0 > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: 0,0,-1,0 > sh03-13n15.int slurmd[144136]: debug2: gpu/nvml: _get_system_gpu_list_nvml: > NVLinks: 0,0,0,-1 I don't see how this patch could possibly fix the nvlinks issue... Maybe this is just proof of the intermittent nature of the issue. I'll continue my thoughts on this in bug 10827. Thanks! -Michael (In reply to Michael Hinton from comment #11) > Excellent! I'll review your logs just to make sure everything looks correct, > but barring anything strange, I'll go ahead and get the patch on the review > queue so it can make it into 20.11.5. Great, thank you! > I don't see how this patch could possibly fix the nvlinks issue... Maybe > this is just proof of the intermittent nature of the issue. I'll continue my > thoughts on this in bug 10827. Ah you're right: on sh03-13n14, even though the matrix looks correct, job submission with --gres gpu:4 still fails with the "Invalid generic resource (gres) specification" error. I'll follow up in bug 10827. Cheers, -- Kilian Kilian, This has now been fixed in 20.11.5 with commit https://github.com/SchedMD/slurm/commit/d98b3538ea. I'll go ahead and close this out. Thanks for helping us find and fix this bug! -Michael Hi Michael, On Mon, Mar 8, 2021 at 2:33 PM <bugs@schedmd.com> wrote: > This has now been fixed in 20.11.5 with commit > https://github.com/SchedMD/slurm/commit/d98b3538ea. Awesome, thanks for the update! Cheers, -- Kilian |