| Summary: | bug with requesting gpu memory | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Xing Huang <x.huang> |
| Component: | GPU | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 21.08.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | WA St. Louis | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Xing Huang
2021-10-18 10:48:06 MDT
This is a duplicate of bug 9229. We fixed --mem-per-gpu in 21.08. However, in Slurm versions before 21.08, --mem-per-gpu is just broken and you should not be using it. Unfortunately, the fixes involved quite a few commits and a few different issues, so they can't be easily backported to 20.02. My recommendation is to just not use --mem-per-gpu until you upgrade to 21.08. By the way, 20.02 is not supported anymore, so I recommend you make a plan to upgrade to 20.11 or 21.08. Is there anything else I can help with? Marshall, Thanks for your reply! Is there a guidance for proper update? Do I need to drain all compute nodes before upgrade? Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, October 18, 2021 11:54 AM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 12683] bug with requesting gpu memory * External Email - Caution * Marshall Garey<mailto:marshall@schedmd.com> changed bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> What Removed Added Assignee support@schedmd.com marshall@schedmd.com Comment # 1<https://bugs.schedmd.com/show_bug.cgi?id=12683#c1> on bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> from Marshall Garey<mailto:marshall@schedmd.com> This is a duplicate of bug 9229<show_bug.cgi?id=9229>. We fixed --mem-per-gpu in 21.08. However, in Slurm versions before 21.08, --mem-per-gpu is just broken and you should not be using it. Unfortunately, the fixes involved quite a few commits and a few different issues, so they can't be easily backported to 20.02. My recommendation is to just not use --mem-per-gpu until you upgrade to 21.08. By the way, 20.02 is not supported anymore, so I recommend you make a plan to upgrade to 20.11 or 21.08. Is there anything else I can help with? ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Here is our online guide to updating Slurm: https://slurm.schedmd.com/quickstart_admin.html#upgrade As you will find here, you do not need to drain the nodes when upgrading Slurm, but can do a "live" upgrade (with jobs running). We also often have notes about upgrading in our "Field Notes" slides from our SLUG conferences. Those can be found on our "publications" page: https://slurm.schedmd.com/publications.html And here's a link to our latest "Field Notes" presentation (Upgrading starts on slide 21): https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf There are multiple ways to do an upgrade - some sites like to do "live" upgrades to make it appear like the cluster never goes down. Other sites like to schedule a maintenance period. Some sites prefer building Slurm from source while others prefer to use RPM's. Some sites will upgrade all daemons and Slurm user commands at once while others will incrementally upgrade each one separately. If you have any specific questions about upgrading, I suggest opening a new ticket. If you have never upgraded Slurm before, then I suggest opening a ticket with us with your proposed upgrade plan. Marshall, Thanks again for your reply! You can close the ticket now. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Monday, October 18, 2021 12:20 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 12683] bug with requesting gpu memory * External Email - Caution * Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=12683#c3> on bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> from Marshall Garey<mailto:marshall@schedmd.com> Here is our online guide to updating Slurm: https://slurm.schedmd.com/quickstart_admin.html#upgrade As you will find here, you do not need to drain the nodes when upgrading Slurm, but can do a "live" upgrade (with jobs running). We also often have notes about upgrading in our "Field Notes" slides from our SLUG conferences. Those can be found on our "publications" page: https://slurm.schedmd.com/publications.html And here's a link to our latest "Field Notes" presentation (Upgrading starts on slide 21): https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf There are multiple ways to do an upgrade - some sites like to do "live" upgrades to make it appear like the cluster never goes down. Other sites like to schedule a maintenance period. Some sites prefer building Slurm from source while others prefer to use RPM's. Some sites will upgrade all daemons and Slurm user commands at once while others will incrementally upgrade each one separately. If you have any specific questions about upgrading, I suggest opening a new ticket. If you have never upgraded Slurm before, then I suggest opening a ticket with us with your proposed upgrade plan. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. You're welcome! I'm closing this as a duplicate of bug 9229. *** This ticket has been marked as a duplicate of ticket 9229 *** Hi Marshall, I have upgraded slurm from 20.02 to 21.08. However, we still have problems using mem-per-gpu. [xinghuang@login01 ~]$ srun -N 1 --gres=gpu:1 --mem-per-gpu=38000M --time=00:30:00 --pty bash [xinghuang@gpu04 ~]$ nvidia-smi Wed Oct 27 09:33:27 2021 ... | 0 Tesla V100S-PCI... Off | 00000000:3D:00.0 Off | 0 | | N/A 29C P0 24W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | This should have dropped me on an A100 with 40GB of VRAM, instead it dropped me to V100S with 32GB of VRAM. What is the proper way to use mem-per-gpu? Do I need to some special configurations? Best, Xing Hi Marshall, This is reopening of ticket 12683. I have upgraded slurm from 20.02 to 21.08. However, we still have problems using mem-per-gpu. [xinghuang@login01 ~]$ srun -N 1 --gres=gpu:1 --mem-per-gpu=38000M --time=00:30:00 --pty bash [xinghuang@gpu04 ~]$ nvidia-smi Wed Oct 27 09:33:27 2021 ... | 0 Tesla V100S-PCI... Off | 00000000:3D:00.0 Off | 0 | | N/A 29C P0 24W / 250W | 0MiB / 32510MiB | 0% Default | | | | N/A | This should have dropped me on an A100 with 40GB of VRAM, instead it dropped me to V100S with 32GB of VRAM. What is the proper way to use mem-per-gpu? Do I need to some special configurations? Best, Xing --mem-per-gpu requests *node* memory, not gpu memory. For example: sbatch --mem-per-gpu=1000 --gpus=4 job.sh This job will be allocated 4 GPUs and 4000 MB of memory. Slurm does not know about GPU memory. If you want a specific type of GPU, then you can define GPU types in gres.conf, then you can request that type. On a node with 2 a100 GPUs, in gres.conf, you might have this: Name=gpu Type=a100 File=/path/to/gpu/file[0-1] Then, request two a100 GPUs: sbatch --gres=gpu:a100:2 job.sh This requests one a100 GPU. So is there a way to request gpu vram memory via slurm? What we are trying to achieve is to ask slurm give the job a gpu node with vram of 38g and hopefully slurm would smartly enough to figure out gpu node with a100 will be assigned to the job. Is this not possible in slurm? Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, October 27, 2021 3:21 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 12683] bug with requesting gpu memory * External Email - Caution * Comment # 8<https://bugs.schedmd.com/show_bug.cgi?id=12683#c8> on bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> from Marshall Garey<mailto:marshall@schedmd.com> --mem-per-gpu requests *node* memory, not gpu memory. For example: sbatch --mem-per-gpu=1000 --gpus=4 job.sh This job will be allocated 4 GPUs and 4000 MB of memory. Slurm does not know about GPU memory. If you want a specific type of GPU, then you can define GPU types in gres.conf, then you can request that type. On a node with 2 a100 GPUs, in gres.conf, you might have this: Name=gpu Type=a100 File=/path/to/gpu/file[0-1] Then, request two a100 GPUs: sbatch --gres=gpu:a100:2 job.sh This requests one a100 GPU. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. (In reply to Xing Huang from comment #9) > So is there a way to request gpu vram memory via slurm? What we are trying > to achieve is to ask slurm give the job a gpu node with vram of 38g and > hopefully slurm would smartly enough to figure out gpu node with a100 will > be assigned to the job. Is this not possible in slurm? It's not possible to do it that way because Slurm doesn't know about GPU memory. However, there are workarounds that can get you close. Here are a few options: (1) Request the GPU type. sbatch --gres=gpu:a100:<number_of_gpus> job.sh I showed this in my last comment - if a user wants a specific type of GPU, they should really just request that type. (2) Use node features (https://slurm.schedmd.com/slurm.conf.html#OPT_Features and https://slurm.schedmd.com/sbatch.html#OPT_constraint). Maybe you have different types of GPUs that have the required GPU memory that could satisfy a job request, and the user doesn't want to specify a specific type of GPU. You could define node features describing what types of GPUs are available on different sets of nodes. NodeName=n[1-5] Features=gpu_32gb NodeName=n[6-10] Features=gpu_64gb NodeName=n[11-15] Features=gpu_40gb Then you could request a node with at least 40 GB of GPU memory: sbatch --constraint=gpu_40gb|gpu_64gb --gpus=2 job.sh You may want to define your node features differently, but I hope this example gives you an idea of how you can use node features. (3) Make gpu memory another gres. You can define anything as a GRES in gres.conf. You could define "gpumemory" as a gres, and then a job could request --gres=gpumemory:40G. Example: # gres.conf on nodes n[1-5] Name=gpu Type=a100 File=/path/to/file Name=gpumemory Count=42949672960 # gres.conf on nodes n[6-10] Name=gpu Type=v100s File=/path/to/file Name=gpumemory Count=34359738368 # slurm.conf GresTypes=gpu,gpumemory NodeName=n[1-5] Gres=gpu:a100:1,gpumemory:42949672960 NodeName=n[6-10] Gres=gpu:v100s:1,gpumemory:34359738368 $ srun --gres=gpumemory:40G --gpus=1 printenv SLURMD_NODENAME n1 $ srun --gres=gpumemory:32G --gpus=1 printenv SLURMD_NODENAME n1 Notice that I was allocated node n1 both times. That's because --gres=gpumemory:32G will give me a node with *at least* that much available GRES. But if nodes n[1-5] are all busy, then I still get allocated an available node: $ srun --gres=gpumemory:32G --gpus=1 printenv SLURMD_NODENAME n6 WARNING: Slurm does NOT know about GPU memory. So even if you define a "gpumemory" gres, Slurm still doesn't know anything about GPU memory. Also, a user does not have to request the "gpumemory" gres to be allocated a GPU. If you want to enforce that, then you'll need to do it with a job_submit.lua script. Also, you will need to define enough "gpumemory" to cover all the GPUs on the node. Also, if a single node has different types of GPUs, then you are not guaranteed to be given the GPU that you want by requesting "gpumemory" - you will have to request the specific GPU type (such as a100 or v100s). For these reasons, I recommend trying options (1) and (2) first. Users may not understand all the "gotchas" of option (3). But, it's up to you. Thanks for your comment! We have a quite complicated case. Would the definition I made in gres.conf and slurm.conf work? vmem are parameters to define gpu memory in the size of MB. Looking forward to your help! ####### Define Gres in gres.conf ####### NodeName=gpu01 Name=gpu Count=4 File=/dev/nvidia[0-3] Type=tesla_a100 NodeName=gpu02 Name=gpu Count=4 File=/dev/nvidia[0-3] Type=tesla_v100S NodeName=gpu03 Name=gpu Count=2 File=/dev/nvidia[0-1] Type=tesla_V100S NodeName=gpu[04-05] Name=gpu Count=2 File=/dev/nvidia[0-1] Type=tesla_v100S NodeName=gpu06 Name=gpu Count=4 File=/dev/nvidia[0-3] Type=tesla_v100 NodeName=gpu07 Name=gpu Count=3 File=/dev/nvidia[0-2] Type=tesla_v100 NodeName=gpu08 Name=gpu Count=2 File=/dev/nvidia[0-1] Type=tesla_t4 NodeName=gpu01 Name=vmem Count=40536 NodeName=gpu02 Name=vmem Count=32510 NodeName=gpu03 Name=vmem Count=32510 NodeName=gpu[04-05] Name=vmem Count=32510 NodeName=gpu06 Name=vmem Count=32510 NodeName=gpu07 Name=vmem Count=32510 NodeName=gpu08 Name=vmem Count=15109 ####### Define Gres in slurm.conf ####### GresTypes=gpu,vmem NodeName=gpu01 CoresPerSocket=16 RealMemory=385000 Sockets=2 Weight=1000 State=UNKNOWN Gres=gpu:tesla_a100:4,vmem:40536 NodeName=gpu02 CoresPerSocket=16 RealMemory=770000 Sockets=2 Weight=900 State=UNKNOWN Gres=gpu:tesla_v100S:4,vmem:32510 NodeName=gpu03 CoresPerSocket=16 RealMemory=770000 Sockets=2 Weight=900 State=UNKNOWN Gres=gpu:tesla_v100S:2,vmem:32510 NodeName=gpu[04-05] CoresPerSocket=16 RealMemory=385000 Sockets=2 Weight=800 State=UNKNOWN Gres=gpu:tesla_v100S:2,vmem:32510 NodeName=gpu06 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=700 State=UNKNOWN Gres=gpu:tesla_v100:4,vmem:32510 NodeName=gpu07 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=700 State=UNKNOWN Gres=gpu:tesla_v100:3:vmem:32510 NodeName=gpu08 CoresPerSocket=12 RealMemory=385000 Sockets=2 Weight=600 State=UNKNOWN Gres=gpu:tesla_t4:2,vmem:15109 Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Wednesday, October 27, 2021 4:49 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 12683] bug with requesting gpu memory * External Email - Caution * Comment # 10<https://bugs.schedmd.com/show_bug.cgi?id=12683#c10> on bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> from Marshall Garey<mailto:marshall@schedmd.com> (In reply to Xing Huang from comment #9<show_bug.cgi?id=12683#c9>) > So is there a way to request gpu vram memory via slurm? What we are trying > to achieve is to ask slurm give the job a gpu node with vram of 38g and > hopefully slurm would smartly enough to figure out gpu node with a100 will > be assigned to the job. Is this not possible in slurm? It's not possible to do it that way because Slurm doesn't know about GPU memory. However, there are workarounds that can get you close. Here are a few options: (1) Request the GPU type. sbatch --gres=gpu:a100:<number_of_gpus> job.sh I showed this in my last comment - if a user wants a specific type of GPU, they should really just request that type. (2) Use node features (https://slurm.schedmd.com/slurm.conf.html#OPT_Features and https://slurm.schedmd.com/sbatch.html#OPT_constraint). Maybe you have different types of GPUs that have the required GPU memory that could satisfy a job request, and the user doesn't want to specify a specific type of GPU. You could define node features describing what types of GPUs are available on different sets of nodes. NodeName=n[1-5] Features=gpu_32gb NodeName=n[6-10] Features=gpu_64gb NodeName=n[11-15] Features=gpu_40gb Then you could request a node with at least 40 GB of GPU memory: sbatch --constraint=gpu_40gb|gpu_64gb --gpus=2 job.sh You may want to define your node features differently, but I hope this example gives you an idea of how you can use node features. (3) Make gpu memory another gres. You can define anything as a GRES in gres.conf. You could define "gpumemory" as a gres, and then a job could request --gres=gpumemory:40G. Example: # gres.conf on nodes n[1-5] Name=gpu Type=a100 File=/path/to/file Name=gpumemory Count=42949672960 # gres.conf on nodes n[6-10] Name=gpu Type=v100s File=/path/to/file Name=gpumemory Count=34359738368 # slurm.conf GresTypes=gpu,gpumemory NodeName=n[1-5] Gres=gpu:a100:1,gpumemory:42949672960 NodeName=n[6-10] Gres=gpu:v100s:1,gpumemory:34359738368 $ srun --gres=gpumemory:40G --gpus=1 printenv SLURMD_NODENAME n1 $ srun --gres=gpumemory:32G --gpus=1 printenv SLURMD_NODENAME n1 Notice that I was allocated node n1 both times. That's because --gres=gpumemory:32G will give me a node with *at least* that much available GRES. But if nodes n[1-5] are all busy, then I still get allocated an available node: $ srun --gres=gpumemory:32G --gpus=1 printenv SLURMD_NODENAME n6 WARNING: Slurm does NOT know about GPU memory. So even if you define a "gpumemory" gres, Slurm still doesn't know anything about GPU memory. Also, a user does not have to request the "gpumemory" gres to be allocated a GPU. If you want to enforce that, then you'll need to do it with a job_submit.lua script. Also, you will need to define enough "gpumemory" to cover all the GPUs on the node. Also, if a single node has different types of GPUs, then you are not guaranteed to be given the GPU that you want by requesting "gpumemory" - you will have to request the specific GPU type (such as a100 or v100s). For these reasons, I recommend trying options (1) and (2) first. Users may not understand all the "gotchas" of option (3). But, it's up to you. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Is the value you set for vmem the total GPU memory for *all* GPUs on the node? Or just the amount of GPU memory for one GPU? If you want users to request the total amount of GPU memory that they want, then you should define it as the total amount of GPU memory on the node, not the amount of GPU memory for a single GPU. I just want to repeat my warning in my last comment: WARNING: Slurm does NOT know about GPU memory. So even if you define a "gpumemory" gres, Slurm still doesn't know anything about GPU memory. Also, a user does not have to request the "gpumemory" gres to be allocated a GPU. If you want to enforce that, then you'll need to do it with a job_submit.lua script. Also, you will need to define enough "gpumemory" to cover all the GPUs on the node. Also, if a single node has different types of GPUs, then you are not guaranteed to be given the GPU that you want by requesting "gpumemory" - you will have to request the specific GPU type (such as a100 or v100s). Another warning: Outside of MPS (https://slurm.schedmd.com/gres.html#MPS_Management) and MIG (https://slurm.schedmd.com/gres.html#MIG_Management), GPUs can NOT be shared. And even with MPS or MIG, it doesn't really make sense to have a user request a subset of the GPU's memory. What is your use case for wanting to request the exact amount of GPU memory? Is there a reason that you don't want to try my first and second recommendations in my last comment? If a user wants a specific type of GPU then they should request that type. If they are okay with some types of GPUs but not others, then node features works really well. Marshall, The reason is that we are already using lua script to handle jobs. Method 3 is much simpler for us compared to method 2. I just did a test run and my way of implementing your 3rd suggestion worked. Thanks for your advice and you can close the ticket now. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, October 28, 2021 4:36 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 12683] bug with requesting gpu memory * External Email - Caution * Comment # 12<https://bugs.schedmd.com/show_bug.cgi?id=12683#c12> on bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> from Marshall Garey<mailto:marshall@schedmd.com> Is the value you set for vmem the total GPU memory for *all* GPUs on the node? Or just the amount of GPU memory for one GPU? If you want users to request the total amount of GPU memory that they want, then you should define it as the total amount of GPU memory on the node, not the amount of GPU memory for a single GPU. I just want to repeat my warning in my last comment: WARNING: Slurm does NOT know about GPU memory. So even if you define a "gpumemory" gres, Slurm still doesn't know anything about GPU memory. Also, a user does not have to request the "gpumemory" gres to be allocated a GPU. If you want to enforce that, then you'll need to do it with a job_submit.lua script. Also, you will need to define enough "gpumemory" to cover all the GPUs on the node. Also, if a single node has different types of GPUs, then you are not guaranteed to be given the GPU that you want by requesting "gpumemory" - you will have to request the specific GPU type (such as a100 or v100s). Another warning: Outside of MPS (https://slurm.schedmd.com/gres.html#MPS_Management) and MIG (https://slurm.schedmd.com/gres.html#MIG_Management), GPUs can NOT be shared. And even with MPS or MIG, it doesn't really make sense to have a user request a subset of the GPU's memory. What is your use case for wanting to request the exact amount of GPU memory? Is there a reason that you don't want to try my first and second recommendations in my last comment? If a user wants a specific type of GPU then they should request that type. If they are okay with some types of GPUs but not others, then node features works really well. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Okay, just be really careful with this. I really want to stress my warnings about this method. Another potential problem that I thought of: What if a user requests 16 GB of vram with the new GRES, but then they get allocated a GPU with 32 GB of vram? Then if another user requests 32 GB of vram then they won't be able to be allocated to the other GPU on that node because the vram GRES isn't available on that node. You can also enforce node features with the job_submit/lua plugin, so I don't see why already using a job_submit/lua plugin locks you out of that method. In short, we definitely recommend using node features first, but it's up to you. I'm afraid that users will think that they can request a subset of GPU memory when they actually can't. In addition, if you open a ticket on this later another support engineer might be confused why you aren't using node features instead. For now I'll close this as infogiven. Marshall, This is a very good warning. We will definitely watch on this. Currently, we are using priority to deal with the situation you are mentioning. GPU with higher memory would be less likely allocated to user than GPU with lower memory. This would be negated if user specifically request a GPU with higher memory. Best, Xing ________________________________ From: bugs@schedmd.com <bugs@schedmd.com> Sent: Thursday, October 28, 2021 4:53 PM To: Huang, Xing <x.huang@wustl.edu> Subject: [Bug 12683] bug with requesting gpu memory * External Email - Caution * Comment # 14<https://bugs.schedmd.com/show_bug.cgi?id=12683#c14> on bug 12683<https://bugs.schedmd.com/show_bug.cgi?id=12683> from Marshall Garey<mailto:marshall@schedmd.com> Okay, just be really careful with this. I really want to stress my warnings about this method. Another potential problem that I thought of: What if a user requests 16 GB of vram with the new GRES, but then they get allocated a GPU with 32 GB of vram? Then if another user requests 32 GB of vram then they won't be able to be allocated to the other GPU on that node because the vram GRES isn't available on that node. You can also enforce node features with the job_submit/lua plugin, so I don't see why already using a job_submit/lua plugin locks you out of that method. In short, we definitely recommend using node features first, but it's up to you. I'm afraid that users will think that they can request a subset of GPU memory when they actually can't. In addition, if you open a ticket on this later another support engineer might be confused why you aren't using node features instead. For now I'll close this as infogiven. ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. Sounds good. Closing this now |