Ticket 10179

Summary:	Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL
Product:	Slurm	Reporter:	Torkil Svensgaard <torkil>
Component:	Build System and Packaging	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cinek, rkv
Version:	20.11.0
Hardware:	Linux
OS:	Linux
Site:	DRCMR	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Log from rpmbuild gres.conf slurm.conf slurmctld.log slurmd.log debugging patch(v1) debugging patch(v2)

Description Torkil Svensgaard 2020-11-09 00:17:12 MST

Created attachment 16561 [details]
Log from rpmbuild

Hi

Trying to get "Autodetect=nvml" to work, so I built new SLURM packages on a box with CUDA libraries installed and reinstalled slurmd. The build seems to find CUDA and build NVML just fine, and it installs but I'm still getting that error when starting slurmd. 

What do I miss?

Mvh.

Torkil

Comment 1 Torkil Svensgaard 2020-11-09 00:24:51 MST

Solved that but now looking at this:

"
[2020-11-09T08:20:07.501] Message aggregation disabled
[2020-11-09T08:20:08.041] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-11-09T08:20:08.106] 1 GPU system device(s) detected
[2020-11-09T08:20:08.106] WARNING: The following autodetected GPUs are being ignored:
[2020-11-09T08:20:08.106]     GRES[gpu] Type:geforce_rtx_2060 Count:1 Cores(48):0-47  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
"

What seems to be the problem?

Mvh.

Torkil

Comment 2 Torkil Svensgaard 2020-11-09 01:24:03 MST

Added Gres=gpu:1 to slurm.conf and now getting this:

"
[2020-11-09T08:37:16.139] Message aggregation disabled
[2020-11-09T08:37:16.658] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-11-09T08:37:16.727] 1 GPU system device(s) detected
[2020-11-09T08:37:16.727] Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL
[2020-11-09T08:37:16.727] error: Discarding the following config-only GPU due to lack of File specification:
[2020-11-09T08:37:16.727] error:     GRES[gpu] Type:(null) Count:1 Cores(48):(null)  Links:(null) Flags: File:(null)
[2020-11-09T08:37:34.953] Message aggregation disabled
[2020-11-09T08:37:35.472] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-11-09T08:37:35.541] 1 GPU system device(s) detected
[2020-11-09T08:37:35.541] Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL
"

Mvh.

Torkil

Comment 3 Marcin Stolarek 2020-11-09 03:00:15 MST

Torkil,

Could you please add type to device definition in gres.conf like:
>AutoDetect=nvml
>Name=gpu Type=geforce

and add type to slurm.conf, like:
>Gres=gpu:geforce:1

and check if this will let you go?

cheers,
Marcin

Comment 4 Torkil Svensgaard 2020-11-09 03:08:13 MST

Hi Marcin

Thanks, that seemed to work:

"
[2020-11-09T11:01:52.936] debug:  init: Gres GPU plugin loaded
[2020-11-09T11:01:52.936] debug:  Gres GPU plugin: Resetting gres_devices
[2020-11-09T11:01:53.442] debug:  Systems Graphics Driver Version: 450.80.02
[2020-11-09T11:01:53.442] debug:  NVML Library Version: 11.450.80.02
[2020-11-09T11:01:53.450] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-11-09T11:01:53.518] 1 GPU system device(s) detected
[2020-11-09T11:01:53.518] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2020-11-09T11:01:53.518] debug:  Including the following GPU matched between system and configuration:
[2020-11-09T11:01:53.518] debug:      GRES[gpu] Type:geforce Count:1 Cores(48):0-47  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2020-11-09T11:01:53.518] debug:  Gres GPU plugin: Final normalized gres.conf list:
[2020-11-09T11:01:53.518] debug:      GRES[gpu] Type:geforce Count:1 Cores(48):0-47  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2020-11-09T11:01:53.518] Gres Name=gpu Type=geforce Count=1
[2020-11-09T11:01:53.521] debug:  AcctGatherEnergy NONE plugin loaded
[2020-11-09T11:01:53.521] debug:  AcctGatherProfile NONE plugin loaded
[2020-11-09T11:01:53.521] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-11-09T11:01:53.521] debug:  AcctGatherFilesystem NONE plugin loaded
"

"
# sinfo -o "%20N  %10c  %10m  %25f  %10G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES       
bigger9               48          257552      (null)                     gpu:1(S:0) 
"

It was my impression that "Autodetect=nvml" should be sufficient in the gres.conf though?

Mvh.

Torkil

Comment 5 Marcin Stolarek 2020-11-09 03:13:56 MST

>It was my impression that "Autodetect=nvml" should be sufficient in the gres.conf though?

I'm looking into the code to check how we can better handle it. I agree that `Gres=gpu:1` in slurm.conf should be enough according to our docs, so either documentation or gpu/nvml should be improved.

Are you OK with a severity decrease to 4 for the time being?

cheers,
Marcin

Comment 6 Torkil Svensgaard 2020-11-09 03:18:06 MST

(In reply to Marcin Stolarek from comment #5)
> >It was my impression that "Autodetect=nvml" should be sufficient in the gres.conf though?
> 
> I'm looking into the code to check how we can better handle it. I agree that
> `Gres=gpu:1` in slurm.conf should be enough according to our docs, so either
> documentation or gpu/nvml should be improved.

Roger.

There's also the "error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported" which seems to be because we have an older GPU architecture.

> Are you OK with a severity decrease to 4 for the time being?

Yep, that's fine. 

Mvh.

Torkil

Comment 7 Marcin Stolarek 2020-11-09 03:30:55 MST

>There's also the "error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported" which seems to be because we have an older GPU architecture.
I've noticed it and you're reading it correctly - it's just not supported in your configuration and the error is not critical. I'll check if we should decrease the log level for "NVML_ERROR_NOT_SUPPORTED" errno.

I'll keep you posted.

cheers,
Marcin

Comment 8 Torkil Svensgaard 2020-11-09 06:34:06 MST

A somewhat related question, if you don't mind. 

I'm running configless, which is pretty cool, but this also means that gres.conf is copied to all nodes. Is there some way to have different flavours of the gres.conf so only nodes with a GPU get the "Autodetect=nvml" parameter, or how do I solve that?

Thanks,

Torkil

Comment 9 Marcin Stolarek 2020-11-09 06:40:21 MST

>Is there some way to have different flavours of the gres.conf so only nodes with a GPU get the "Autodetect=nvml" parameter, or how do I solve that?
Generaly we prefer thoes (not really related) questions to be separate tickets, to keep the history clear. Although, it's something addressed by 2b3d8a168b3[1] that we'll be released in Slurm 20.11.

cheers,
Marcin

[1]https://github.com/SchedMD/slurm/commit/2b3d8a168b36611d6112c8f2056514bf3bfa4e47

Comment 10 Torkil Svensgaard 2020-11-09 06:45:21 MST

(In reply to Marcin Stolarek from comment #9)
> >Is there some way to have different flavours of the gres.conf so only nodes with a GPU get the "Autodetect=nvml" parameter, or how do I solve that?
> Generaly we prefer thoes (not really related) questions to be separate
> tickets, to keep the history clear. Although, it's something addressed by
> 2b3d8a168b3[1] that we'll be released in Slurm 20.11.

Appreciated, thanks. Is 20.11 due this month?

Mvh.

Torkil

Comment 11 Marcin Stolarek 2020-11-09 06:57:40 MST

>Appreciated, thanks. Is 20.11 due this month?
Yes, slurm-20.11rc1 (release candidate) was already tagged and can be downloaded from our web page[1].

Please keep in mind that it's rc1, not even a .0 release.

cheers,
Marcin
[1]https://www.schedmd.com/downloads.php

Comment 15 Marcin Stolarek 2020-11-18 06:53:15 MST

Torkil,

I can't reproduce the behavior you described in comment 0 using other NVML gpu. Are you able to share reproduce it again and share your slurm.conf, gres.conf and debug3 level slurmd logs?

cheers,
Marcin

Comment 17 Torkil Svensgaard 2020-11-19 00:06:07 MST

(In reply to Marcin Stolarek from comment #15)
 
> I can't reproduce the behavior you described in comment 0 using other NVML
> gpu. Are you able to share reproduce it again and share your slurm.conf,
> gres.conf and debug3 level slurmd logs?

My setup is currently broken (ticket 10193) but I'll try to reproduce as soon as I am able.

I am pretty sure I was trying to make the configuration as minimal as possible, so only had "Autodetect=nvml" in gres.conf and no gres configuration for the node in slurm.conf.

How do I get debug3 on logs?

Mvh.

Torkil

Comment 18 Torkil Svensgaard 2020-12-01 02:56:36 MST

Hi

I'm now on 20.11 and seeing this:

"
[2020-12-01T10:53:08.072] got reconfigure request
[2020-12-01T10:53:08.073] all threads complete
[2020-12-01T10:53:08.073] debug:  Reading slurm.conf file: /var/spool/slurm/d/conf-cache/slurm.conf
[2020-12-01T10:53:08.073] debug:  NodeNames=slurmhpc1 setting Sockets=Boards(1)
[2020-12-01T10:53:08.073] debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
[2020-12-01T10:53:08.073] debug:  Log file re-opened
[2020-12-01T10:53:08.074] debug:  CPUs:48 Boards:1 Sockets:1 CoresPerSocket:24 ThreadsPerCore:2
[2020-12-01T10:53:08.074] debug:  Resource spec: No specialized cores configured by default on this node
[2020-12-01T10:53:08.074] debug:  Resource spec: Reserved system memory limit not configured for this node
[2020-12-01T10:53:08.075] debug:  gres/gpu: init: loaded
[2020-12-01T10:53:08.075] debug:  gres/gpu: node_config_load: Gres GPU plugin: Resetting gres_devices
[2020-12-01T10:53:08.579] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 455.45.01
[2020-12-01T10:53:08.579] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.455.45.01
[2020-12-01T10:53:08.595] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-12-01T10:53:08.659] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected
[2020-12-01T10:53:08.659] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2020-12-01T10:53:08.659] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL
[2020-12-01T10:53:08.659] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2020-12-01T10:53:08.659] debug:      GRES[gpu] Type:(null) Count:1 Cores(48):0-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2020-12-01T10:53:08.659] debug:  Gres GPU plugin: Final normalized gres.conf list:
[2020-12-01T10:53:08.659] debug:      GRES[gpu] Type:(null) Count:1 Cores(48):0-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2020-12-01T10:53:08.659] Gres Name=gpu Type=(null) Count=1
[2020-12-01T10:53:08.668] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2020-12-01T10:53:08.668] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2020-12-01T10:53:08.668] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2020-12-01T10:53:08.668] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
"

From slurm.conf:

"
NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1
"

From gres.conf:

"
NodeName=bigger9 AutoDetect=nvml
"

Comment 19 Torkil Svensgaard 2020-12-01 05:09:39 MST

Changing gres.conf to this makes it possible to use the gres for a while:

"
NodeName=bigger9 Type=gforce_rtx_2060 AutoDetect=nvml
"

But eventually this happens:

"
[2020-12-01T12:53:28.912] error: Setting node bigger9 state to DRAIN with reason:gres/gpu count reported lower than configured (0 < 1)
[2020-12-01T12:53:28.912] drain_nodes: node bigger9 state set to DRAIN
[2020-12-01T12:53:28.912] error: _slurm_rpc_node_registration node=bigger9: Invalid argument
"

Comment 21 Marcin Stolarek 2020-12-02 02:32:24 MST

Torkil,

Did you restart all slurmd and slurmctld daemons or just executed `scontrol reconfigure`?

cheers,
Marcinres

Comment 22 Torkil Svensgaard 2020-12-02 02:35:16 MST

(In reply to Marcin Stolarek from comment #21)
> Did you restart all slurmd and slurmctld daemons or just executed `scontrol
> reconfigure`?

I restarted everything and even rebooted the boxes. 

Could we perhaps up the priority on this, since the box going into drain is a bit of a showstopper?

Mvh.

Torkil

Comment 23 Marcin Stolarek 2020-12-02 04:53:40 MST

Do you have a GRES type set in slurm.conf too? (Like Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line)

Please share your current slurm.conf, gres.conf slurmctld and slurmd logs.

cheers,
Marcin

Comment 24 Torkil Svensgaard 2020-12-02 05:35:24 MST

Created attachment 16914 [details]
gres.conf

Comment 25 Torkil Svensgaard 2020-12-02 05:35:43 MST

Created attachment 16915 [details]
slurm.conf

Comment 26 Torkil Svensgaard 2020-12-02 05:41:27 MST

Created attachment 16916 [details]
slurmctld.log

Comment 27 Torkil Svensgaard 2020-12-02 05:42:19 MST

Created attachment 16917 [details]
slurmd.log

Comment 28 Torkil Svensgaard 2020-12-02 05:44:14 MST

(In reply to Marcin Stolarek from comment #23)
> Do you have a GRES type set in slurm.conf too? (Like
> Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line)

Currently have this:

"
NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1
"

In gres.conf either of these seem to work, but end up draining:

"
NodeName=bigger9 Type=gforce_rtx_2060 AutoDetect=nvml
NodeName=bigger9 AutoDetect=nvml
"

Mvh.

Torkil

Comment 29 Marcin Stolarek 2020-12-02 05:46:24 MST

Torkil,

I'm looking into the details now. Did you try my suggestion from comment 23:
>Do you have a GRES type set in slurm.conf too? (Like Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line) 

cheers,
Marcin

Comment 30 Torkil Svensgaard 2020-12-02 05:52:26 MST

(In reply to Marcin Stolarek from comment #29)
> Torkil,
> 
> I'm looking into the details now. Did you try my suggestion from comment 23:
> >Do you have a GRES type set in slurm.conf too? (Like Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line) 
> 

Yes, if I do:

"
NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:gforce_rtx_2060:1
"

It fails with this error:

"
[2020-12-02T13:51:36.143] error: _slurm_rpc_node_registration node=bigger9: Invalid argument
"

And it goes drain instantly.

Mvh.

Torkil

Comment 31 Marcin Stolarek 2020-12-02 05:54:14 MST

Just to be 100% sure that we're on the same page you have Type=gforce_rtx_2060 in gres.conf now and all daemons were restarted (eventualy nodes were resumed) after configuration changes?

cheers,
Marcin

Comment 32 Torkil Svensgaard 2020-12-02 05:56:56 MST

(In reply to Marcin Stolarek from comment #31)
> Just to be 100% sure that we're on the same page you have
> Type=gforce_rtx_2060 in gres.conf now and all daemons were restarted
> (eventualy nodes were resumed) after configuration changes?

I have the gforce_rtx_2060 bit in both slurm.conf and gres.conf now and still:

"
[2020-12-02T13:55:51.586] error: _slurm_rpc_node_registration node=bigger9: Invalid argument
"

Mvh.

Torkil

Comment 33 Marcin Stolarek 2020-12-02 06:09:34 MST

If you'll take a look at the few lines above the drain_nodes in your slurmctld.log you'll notice:
>[2020-11-03T22:58:33.856] error: Node bigger9 has low socket*core*thread count (1 < 48)
>[2020-11-03T22:58:33.856] error: Node bigger9 has low cpu count (1 < 48)
>[2020-11-03T22:58:33.856] error: Setting node bigger9 state to DRAIN

Looking to slurmd log we see:
>[2020-10-29T08:37:01.372] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=257552 TmpDisk=226773 Uptime=398 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Is it one CPU node (VM?)? If that's the case you should adjust its configuration in slurm.conf, currently it's:
>NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1

cheers,
Marcin

Comment 34 Torkil Svensgaard 2020-12-02 23:22:35 MST

(In reply to Marcin Stolarek from comment #33)
> If you'll take a look at the few lines above the drain_nodes in your
> slurmctld.log you'll notice:
> >[2020-11-03T22:58:33.856] error: Node bigger9 has low socket*core*thread count (1 < 48)
> >[2020-11-03T22:58:33.856] error: Node bigger9 has low cpu count (1 < 48)
> >[2020-11-03T22:58:33.856] error: Setting node bigger9 state to DRAIN
>

That log line is from 3/11 though, that was sorted back then. 
 
> Looking to slurmd log we see:
> >[2020-10-29T08:37:01.372] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=257552 TmpDisk=226773 Uptime=398 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
> 
> Is it one CPU node (VM?)? If that's the case you should adjust its
> configuration in slurm.conf, currently it's:
> >NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1

The node has a 48 core EPYC CPU. 

This line in slurm.conf:

"
NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:gforce_rtx_2060:1
"

Gives this error:

"
[2020-12-03T07:18:27.350] error: _slurm_rpc_node_registration node=bigger9: Invalid argument
"

This line in slurm.conf:

"
NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1
"

Works for a bit and then goes DRAIN due to:

"
error: Setting node bigger9 state to DRAIN with reason:gres/gpu count reported lower than configured (0 < 1)
"

Both daemons restarted and "scontrol reconfigure" run in each case.

Mvh.

Torkil

Comment 35 Marcin Stolarek 2020-12-03 04:38:22 MST

Created attachment 16945 [details]
debugging patch(v1)

Sorry for missing the dates.

Are you able to apply the attached debugging patch and share slurmd side logs with it?

cheers,
Marcin

Comment 36 Marcin Stolarek 2020-12-03 04:43:43 MST

Created attachment 16946 [details]
debugging patch(v2)

I've pressed enter too early - please ignore the previous attachment.

cheers,
Marcin

Comment 37 Torkil Svensgaard 2020-12-03 05:24:43 MST

(In reply to Torkil Svensgaard from comment #34)

> This line in slurm.conf:
> 
> "
> NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24
> ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1
> "

Actually, it has been running all morning with this version of the configuration without going into drain.

Gres.conf is:

"
NodeName=bigger9 Type=gforce_rtx_2060 AutoDetect=nvml
"

I'm pretty sure I also had that combination yesterday but I might be mistaken. I've also changed the partitions and made other changes, so it could be that.

Either way, please reduce to severity to 4 and I'll get back in a couple days if it keeps being stable?

Mvh.

Torkil

Comment 38 Marcin Stolarek 2020-12-03 05:45:29 MST

>Either way, please reduce to severity to 4 and I'll get back in a couple days if it keeps being stable?

OK. Let's take that path. When you're sure it works correctly please remove Type from gres.conf restart slurmd/slurmctld and let me know if that works as well.

cheers,
Marcin

Comment 39 Torkil Svensgaard 2020-12-10 00:32:21 MST

(In reply to Marcin Stolarek from comment #38)
> >Either way, please reduce to severity to 4 and I'll get back in a couple days if it keeps being stable?
> 
> OK. Let's take that path. When you're sure it works correctly please remove
> Type from gres.conf restart slurmd/slurmctld and let me know if that works
> as well.

Hrm, it has been running just fine until now, but I just added a new HPC node and had to restart slurmctld and bang:

"
[2020-12-10T08:28:08.992] error: Setting node bigger9 state to DRAIN with reason:gres/gpu count reported lower than configured (0 < 1)
[2020-12-10T08:28:08.992] drain_nodes: node bigger9 state set to DRAIN
"

I could resume it with no problem, will see if it sticks.

Mvh.

Torkil

Comment 40 Torkil Svensgaard 2020-12-14 01:24:55 MST

(In reply to Marcin Stolarek from comment #38)

> OK. Let's take that path. When you're sure it works correctly please remove
> Type from gres.conf restart slurmd/slurmctld and let me know if that works
> as well.

The previous error was due to upgrade/driver I think. Running fine now without the Type in gres.conf:

"
[2020-12-14T08:25:43.135] debug:  gres/gpu: init: loaded
[2020-12-14T08:25:43.135] debug:  gres/gpu: node_config_load: Gres GPU plugin: Resetting gres_devices
[2020-12-14T08:25:43.643] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 455.45.01
[2020-12-14T08:25:43.643] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.455.45.01
[2020-12-14T08:25:43.655] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-12-14T08:25:43.718] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected
[2020-12-14T08:25:43.718] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2020-12-14T08:25:43.718] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL
[2020-12-14T08:25:43.718] debug:  gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration:
[2020-12-14T08:25:43.718] debug:      GRES[gpu] Type:(null) Count:1 Cores(48):0-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2020-12-14T08:25:43.718] debug:  Gres GPU plugin: Final normalized gres.conf list:
[2020-12-14T08:25:43.718] debug:      GRES[gpu] Type:(null) Count:1 Cores(48):0-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
[2020-12-14T08:25:43.718] Gres Name=gpu Type=(null) Count=1
"

Mvh.

Torkil

Comment 41 Marcin Stolarek 2020-12-14 05:49:29 MST

Do you want to keep the bug report open for a few days so you'll verify that it works fine for you or should we go ahead and close the case as info given?

You can always reopen if it happens again.

cheers,
Marcin

Comment 42 Torkil Svensgaard 2020-12-14 06:24:36 MST

(In reply to Marcin Stolarek from comment #41)
> Do you want to keep the bug report open for a few days so you'll verify that
> it works fine for you or should we go ahead and close the case as info given?
> 
> You can always reopen if it happens again.

"
[2020-12-14T08:25:43.718] Gres Name=gpu Type=(null) Count=1
"

If that just means we won't be able to target the type specifically and it will otherwise work just fine free to close the ticket.

Mvh.

Torkil

Comment 43 Marcin Stolarek 2020-12-15 03:25:55 MST

Torkil,

Just to make sure we're on the same page. When in gres.conf you have only:
>Autodetect=nvml

1) In case of the node definition(in slurm.conf) being like:
>NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:1
The GPU is discovered correctly and the type for GPU is not set.
>Gres=gpu:1(S:0)

2) In case of node definiton in slurm.conf:
>NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:geforce:1
The GPU is discovered correctly and the type is set to "geforce", scontrol show node shows:
>Gres=gpu:geforce:1(S:0)

3)If you have a type that is not a substring of Type from NVML set in slurm.conf, like:
>NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:tesla:1
The node get's drained with "gres/gpu count reported lower than configured (0 < 1)"

Those 3 are currently expected and sound resonable, don't you agree? Do you see similar behavior or are you seeing something we can do to make things more intuitive?

cheers,
Marcin

Comment 44 Torkil Svensgaard 2020-12-21 02:41:34 MST

(In reply to Marcin Stolarek from comment #43)
> Torkil,
> 
> Just to make sure we're on the same page. When in gres.conf you have only:
> >Autodetect=nvml

I actually had a NodeName line earlier for bigger 9, but now only "Autodetect=nvml"

> 1) In case of the node definition(in slurm.conf) being like:
> >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:1
> The GPU is discovered correctly and the type for GPU is not set.
> >Gres=gpu:1(S:0)

Gres=gpu:1(S:0-23)
 
> 2) In case of node definiton in slurm.conf:
> >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:geforce:1
> The GPU is discovered correctly and the type is set to "geforce", scontrol
> show node shows:
> >Gres=gpu:geforce:1(S:0)

The first time I changed it from 1) to 2) it went  "Drain" after restarting slurmctld and doing "scontrol reconfigure".

"
[2020-12-21T09:57:44.166] debug:  gres/gpu: init: loaded
[2020-12-21T09:57:44.166] debug:  gres/gpu: node_config_load: Gres GPU plugin: Resetting gres_devices
[2020-12-21T09:57:44.683] debug:  gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 455.45.01
[2020-12-21T09:57:44.683] debug:  gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.455.45.01
[2020-12-21T09:57:44.689] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported
[2020-12-21T09:57:44.690] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected
[2020-12-21T09:57:44.690] debug:  Gres GPU plugin: Normalizing gres.conf with system GPUs
[2020-12-21T09:57:44.690] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL
[2020-12-21T09:57:44.690] error: Discarding the following config-only GPU due to lack of File specification:
[2020-12-21T09:57:44.690] error:     GRES[gpu] Type:gforce Count:1 Cores(48):(null)  Links:(null) Flags:HAS_TYPE File:(null)
[2020-12-21T09:57:44.690] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored:
[2020-12-21T09:57:44.690]     GRES[gpu] Type:(null) Count:1 Cores(48):0-23  Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0
"

After resuming once it works though.

> 3)If you have a type that is not a substring of Type from NVML set in
> slurm.conf, like:
> >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:tesla:1
> The node get's drained with "gres/gpu count reported lower than configured
> (0 < 1)"

Hmm also works with tesla. It goes drain after restart/reconfigure but I can resume it and run jobs on it. Seems same as 2)

Nodeinfo is "Gres=gpu:tesla:1
"

And there's this in the log:

"
[2020-12-21T10:37:34.024] [1814.batch] error: We should had got gres_devices, but for some reason none were set in the plugin.
[2020-12-21T10:37:34.100] [1814.batch] debug level is 'error'
"

Mvh.

Torkil

Comment 46 Marcin Stolarek 2021-01-04 06:33:55 MST

Torkil,

>Hmm also works with tesla. It goes drain after restart/reconfigure but I can resume it and run jobs on it.

Yes, that's correct. However, this applies also to other cases of automatically draining nodes. For instance, if you have a lower number of cores on the node than configured, slurmctld can only verify that when the node registration RPC comes and it's an unavoidable race condition between admin action and actual information from slurmd.

I see that it may be confusing but at the current code state the error message:
>[2020-12-21T10:37:34.024] [1814.batch] error: We should had got gres_devices, but for some reason none were set in the plugin.
is all we can do. In general, we don't recommend to resume automatically drained nodes without the underlying issue resolution.

I'm closing the case as "information given". Should you have any questions please don't hesitate to reopen.

cheers,
Marcin