Hello, We are running slurm21.08-21.08.8 on RHEL 9 with Bright Computing. We have 8 A6000 on 13 nodes and are trying to enable MIG. See below error and our slurm and goes.conf: [root@m001 ~]# nvidia-smi -i 0 -mig 1 Unable to enable MIG Mode for GPU 00000000:01:00.0: Not Supported NodeName=m[001-013] Procs=192 CoresPerSocket=48 RealMemory=953674 Sockets=2 ThreadsPerCore=2 Gres=gpu:A6000:8 Feature=location=local # Partitions PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscrib> # Scheduler SchedulerType=sched/backfill # Generic resources types GresTypes=gpu # Epilog/Prolog section Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Power saving section (disabled) SelectType=select/cons_tres SelectTypeParameters=CR_Core # GPU related plugins AccountingStorageTRES=gres/gpu [root@m001 ~]# nvidia-smi Mon Feb 27 14:45:51 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A6000 On | 00000000:01:00.0 Off | Off | | 30% 27C P8 10W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA RTX A6000 On | 00000000:25:00.0 Off | Off | | 30% 26C P8 8W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA RTX A6000 On | 00000000:41:00.0 Off | Off | | 30% 26C P8 6W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA RTX A6000 On | 00000000:61:00.0 Off | Off | | 30% 26C P8 7W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA RTX A6000 On | 00000000:81:00.0 Off | Off | | 30% 25C P8 15W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA RTX A6000 On | 00000000:A1:00.0 Off | Off | | 30% 25C P8 8W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA RTX A6000 On | 00000000:C1:00.0 Off | Off | | 30% 26C P8 10W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA RTX A6000 On | 00000000:E1:00.0 Off | Off | | 30% 26C P8 8W / 300W | 0MiB / 49140MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ AutoDetect=nvml NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia0 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia1 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia2 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia3 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia4 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia5 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia6 NodeName=m[001-013] Name=gpu Type=A6000 Count=1 File=/dev/nvidia7
The A6000 does not support MIGs. See nvidia's list of GPUs that support MIGs here: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#supported-gpus Despite this, Slurm still does provide a generic mechanism for sharing GPUs among multiple jobs called "sharding". You can read more about how it works and how to set it up here: https://slurm.schedmd.com/gres.html#Sharding. Please read through that thoroughly if you want to try test it on your own. Let me know if you still have any questions about sharding after having read the documentation and tested it out for yourself if you do decide to try it.
Thanks Ben. I've read through the very short description of sharding but can you describe the difference between MPS and sharding? For example, a node has 4 GPUs, and 8 users request a GPU at the same time. How would the work load be distributed? How well does it work with GPU memory sharing as well as issues with TensorFlow being a bit 'greedy' and trying to use all available GPU memory?
In the case of a more heterogeneous workload as you mention you have, MPS is actually preferred over sharding. Sharding does nothing to fence processes' resources, while MPS will allocate only a percentage of a GPUs resources to each job, effectively controlling the resources available to each job. This is important in cases when one process may attempt to use the entire GPU, which would starve resources from other jobs or vice versa. I would suggest reading https://slurm.schedmd.com/gres.html#MPS_Management, as well as the several external links included in that documentation to better understand everything.
Hi Robert, Did my last reply help clear things up? Do you have any other questions?
For MPS, with Tensorflow I noticed in my previous place once TF grabbed a GPU it would gradually start using more GPU memory, as in it's "greedy". Does MPS have a way to deal with this?
Support for MPS within Slurm ensures that jobs are scheduled with CUDA_MPS_ACTIVE_THREAD_PERCENTAGE set to whatever you requested, as well as ensuring that jobs are only scheduled when MPS resources are available. As for actual MPS behavior, you'll need to review nvidia's documentation and/or contact their support for more help. There is some interesting information in this section of nvidia's MPS documentation: https://docs.nvidia.com/deploy/mps/index.html#topic_3_3_5_2. It does imply some interesting things about how idle resources are used (specifically when talking about different provisioning strategies), but again, you'll need to contact nvidia's support for more information on the exact behavior of MPS. As a more general note: if you're seeing issues with MPS/sharding, specifically with individual jobs frequently getting bottle-necked by GPU resources (jobs being "greedy"), it might be best for users to instead scale their applications to use entire GPUs rather than just a fraction of them with MPS.
Any further questions on this topic?
I users were to adjust their jobs to use an entire GPU doesn't that defeat the purpose for MPS? We haven't configured it but it's good to know if case we ever do. Other than that this can be closed.
(In reply to Robert Kudyba from comment #8) > I users were to adjust their jobs to use an entire GPU doesn't that defeat > the purpose for MPS? We haven't configured it but it's good to know if case > we ever do. Other than that this can be closed. Yes, exactly. In our experience, we have seen some sites configure MPS/shards, only to have users adjust their jobs to use entire GPUs and not use those features anymore. If users' jobs are using entire GPUs, or can be scaled to do such, that is often preferable over using MPS/sharding. Hopefully this makes more sense now :) Closing this now.