| Summary: | --gpu-bind=closest appears to result in wrong bindings for our hardware unless I lie to Slurm in gres.conf | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Chris Samuel (NERSC) <csamuel> |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | agaur, dmjacobsen, kilian |
| Version: | 20.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=10933 https://bugs.schedmd.com/show_bug.cgi?id=10827 |
||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | 20.11 v1 | ||
|
Description
Chris Samuel (NERSC)
2021-12-06 22:47:16 MST
Hi Chris, (In reply to Chris Samuel (NERSC) from comment #0) > From what I can tell the cause appears to be because the minor device > ordering is the opposite of what you would expect, in that /dev/nvidia0 is > GPU 3, /dev/nvidia1 is GPU 2, /dev/nvidia2 is GPU 1 and /dev/nvidia3 is GPU > 0. I believe this is a known set of issues with 20.11 and earlier that were fixed in 21.08 and beyond with commits https://github.com/SchedMD/slurm/commit/0ebfd37834 and https://github.com/SchedMD/slurm/commit/f589b480d8. I worked with Kilian on these issues in bugs 10827 and 10933 - bug 10827 is public, but bug 10933 is private. Slurm's GPU code has always assumed that the minor numbering is in the same order as the device order as detected by NVML (i.e. PCI bus ID order; nvidia-smi shows this NVML order). This assumption is what the "Note: GPU index X is different from minor number Y" warning was alluding to. (Note that this NVML device order will also match the CUDA device order if CUDA_DEVICE_ORDER=PCI_BUS_ID.) However, sometimes the NVML device order and the minor number order are not the same, as you have seen. This seems to frequently be the case on newer AMD systems with NVIDIA GPUs, for whatever reason. AutoDetect exacerbated this issue, since it does a bunch of internal sorting, causing the GPU order in Slurm to be changed in unexpected ways. But these issues should now all be fixed with the commits above in 21.08. Would you be willing to see if things are fixed for you on 21.08? Thanks! -Michael P.S. Slurm further assumed that the trailing number in the device filename was equivalent to the minor number (e.g. X in /dev/nvidiaX). This is a bad assumption with AMD GPUs, so that issue was fixed as well with the above commits, since the device order was decoupled from the minor number and the device filename. Hi Michael, Thanks for the info, that's really useful. Sadly due to time pressure on us we're not going to have an opportunity to go to 21.08 this year but I'm hoping to make that jump on Perlmutter at least very early next year. In the meantime I'll see if I can backport these 2 commits, do you think that's feasible or do they rely on too many other changes to work? I can tell at least the first one doesn't apply cleanly to 20.11. :-) All the best, Chris You aren't using any AMD GPUs, right? (In reply to Chris Samuel (NERSC) from comment #2) > In the meantime I'll see if I can backport these 2 commits, do you think > that's feasible or do they rely on too many other changes to work? I can > tell at least the first one doesn't apply cleanly to 20.11. :-) I can do that for you. I think it's possible to backport to 20.11, but it apparently needs more than those two commits, as I am finding out. (In reply to Michael Hinton from comment #4) > I can do that for you. I think it's possible to backport to 20.11, but it > apparently needs more than those two commits, as I am finding out. Oh fantastic, thank you so much! All the best, Chris Created attachment 22673 [details]
20.11 v1
Chris, can you try out 20.11 v1 and see if it solves the issue? I believe it should, but I might have missed something. Thanks!
(In reply to Michael Hinton from comment #6) > Chris, can you try out 20.11 v1 and see if it solves the issue? I believe it > should, but I might have missed something. Thanks! Thanks Michael! Will get some RPMs built later this afternoon, much obliged! (In reply to Chris Samuel (NERSC) from comment #0) > So I think I have a workaround (which will go into production tomorrow) but > I thought I should report the issue to get it looked at! What was your workaround, btw? Did it work? (In reply to Michael Hinton from comment #8) > (In reply to Chris Samuel (NERSC) from comment #0) > > So I think I have a workaround (which will go into production tomorrow) but > > I thought I should report the issue to get it looked at! > What was your workaround, btw? Did it work? Oh sorry! No it didn't - the workaround was in the description and whilst it did fix the test case I got from the user when I ran our reframe tests to confirm it didn't have wider reaching impacts things exploded messily so I had to back it out.. :-( All the best, Chris (In reply to Michael Hinton from comment #6) > Chris, can you try out 20.11 v1 and see if it solves the issue? I believe it > should, but I might have missed something. Thanks! Looks good Michael, thank you! Binding appears good. > srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores --gpu-bind=closest -l ./gpus_for_tasks 2>&1 | sort 0: mpi=0 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=14 core_affinity=0-15,64-79 1: mpi=1 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=24 core_affinity=16-31,80-95 2: mpi=2 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=107 core_affinity=32-47,96-111 3: mpi=3 CUDA_VISIBLE_DEVICES=0 gpu=0000:03:00.0 cpu=115 core_affinity=48-63,112-127 Our reframe tests pass: [2021-12-14T17:19:25-08:00] [ PASSED ] Ran 47/47 test case(s) from 33 check(s) (0 failure(s), 0 skipped) Much obliged! (In reply to Chris Samuel (NERSC) from comment #10) > Looks good Michael, thank you! > > Binding appears good. > > > srun --gpus-per-task=1 -C gpu -A nstaff_g -N 1 -n 4 --ntasks-per-node=4 -c 32 --cpu-bind=cores --gpu-bind=closest -l ./gpus_for_tasks 2>&1 | sort > 0: mpi=0 CUDA_VISIBLE_DEVICES=3 gpu=0000:C1:00.0 cpu=14 > core_affinity=0-15,64-79 > 1: mpi=1 CUDA_VISIBLE_DEVICES=2 gpu=0000:81:00.0 cpu=24 > core_affinity=16-31,80-95 > 2: mpi=2 CUDA_VISIBLE_DEVICES=1 gpu=0000:41:00.0 cpu=107 > core_affinity=32-47,96-111 > 3: mpi=3 CUDA_VISIBLE_DEVICES=0 gpu=0000:03:00.0 cpu=115 > core_affinity=48-63,112-127 > > Our reframe tests pass: > > [2021-12-14T17:19:25-08:00] [ PASSED ] Ran 47/47 test case(s) from 33 > check(s) (0 failure(s), 0 skipped) > > > Much obliged! Excellent! I'm glad it's working, and that we have a patch for anyone else running into this issue on 20.11. I'll go ahead and close this out. Thanks! -Michael Thanks Michael! This should go on to Perlmutter next week. |