Ticket 12251

Summary:	GRES GPU details
Product:	Slurm	Reporter:	Torkil Svensgaard <torkil>
Component:	Documentation	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bas.vandervlies, cinek, rkv
Version:	20.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=9567
Site:	DRCMR	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Torkil Svensgaard 2021-08-11 01:57:02 MDT

Hi

We have some different types of GPUs on different compute nodes and are using Autodetect=nvml.

How do I as a user see which GPUs are available and what their capabilities are, like RAM?

And how do I target GPUs with more than 6GB RAM as an example?

Thanks,

Torkil

Comment 1 Marcin Stolarek 2021-08-11 02:25:49 MDT

Torkil,

>How do I as a user see which GPUs are available and what their capabilities are, like RAM?
From our experience, this is most often part of site documentation together with partitions' names and resources available there, so the user can check specific partition/nodes utilization using slurm commands, but to learn more about GPU (or CPU/platform ) architecture external (think about CPU topology details, or RAM frequency etc.).

>And how do I target GPUs with more than 6GB RAM as an example?
If you don't mix different GPU types on the same node, one of the approaches is to use NodeFeatures as the indicator of per GPU available memory on the node, then end-user can specify those like:
sbatch --gres=gpu:1 -C '[GPU_RAM_6GB|GPU_RAM_12GB]' 

Let me know if that helps.

cheers,
Marcin

Comment 2 Torkil Svensgaard 2021-08-11 02:46:44 MDT

Hi Marcin

Thanks. Too bad this information isn't available through sinfo but alas.

Feel free to close the ticket.

Mvh.

Torkil

Comment 3 Bas van der Vlies 2021-08-11 05:14:37 MDT

Torkil,

 sinfo -N --format="%N %30f"

Regards 

Bas

Comment 4 Torkil Svensgaard 2021-08-11 05:40:19 MDT

(In reply to Bas van der Vlies from comment #3)
> Torkil,
> 
>  sinfo -N --format="%N %30f"

torkil@averell:/tmp$ sinfo -N --format="%N %30f"
NODELIST AVAIL_FEATURES                
big20 (null)                        
big21 (null)                        
big22 (null)                        
big27 (null)                        
big28 (null)                        
bigger2 (null)                        
bigger3 (null)                        
bigger4 (null)                        
bigger6 (null)                        
bigger7 (null)                        
bigger9 (null)                        
bigger10 (null)                        
bigger11 (null)                        
bigger12 (null)                        
bigger13 (null)                        
chimera (null)                        
drakkisath (null)                        
fenrir (null)                        
gojira (null)                        
ix1 (null)                        
kong (null)                        
rivendare (null)                        
small1 (null)                        
small2 (null)                        
small19 (null)                        
small24 (null)                        
small25 (null)                        
small29 (null)                        
small30 (null)                        
small31 (null)                        
small32 (null)                        
small33 (null)                        
small34 (null)                        
small35 (null)                        
smaug (null)                        

I guess that would list NodeFeatures, if I had any of those?

Comment 5 Torkil Svensgaard 2021-08-11 05:52:03 MDT

You have this:

"
sinfo -N --format="%N %G"
NODELIST GRES
big20 (null)
bigger2 gpu:1(S:0)
chimera gpu:2(S:0)
"

Had hoped for something like this:

NODELIST GRES
big20 (null)
bigger2 (gpu1 RTX2060 6GB)
chimera (gpu1 RTX2060 6GB) (gpu2 RTX3090 24GB)

That would be cool *wink* wink*

Mvh.

Torkil

Comment 6 Marcin Stolarek 2021-08-11 05:53:56 MDT

>Feel free to close the ticket.
OK. I'll check with the team if there are any plans/other requests to make some of those details available over Slurm cli.

Comment 7 Torkil Svensgaard 2021-08-11 06:01:00 MDT

(In reply to Marcin Stolarek from comment #6)
> >Feel free to close the ticket.
> OK. I'll check with the team if there are any plans/other requests to make
> some of those details available over Slurm cli.

Thanks.

I could write it in my site documentation sure, but it would be more user friendly and less work for the poor overworked sysadmins if the users could just go:

sinfo -N --format="%N %G" | grep gpu
bigger2 gpu:1(S:0) (RTX2060/6GB)
chimera gpu:2(S:0) (RTX2060/6GB, RTX3090/24GB)

And easily decide where to put jobs with speciel requirements, like GPU memory.

Mvh.

Torkil

Comment 8 Marcin Stolarek 2021-08-11 09:39:10 MDT

In fact there is a new development for Slurm 21.08 in Bug 9567 introducing node_features/helpers plugin that may be used to set features more automatically.

The bug is public, so you can check the details there.

I'm closing this bug report now, if you have any question please reopen.

cheers,
Marcin