| Summary: | Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Torkil Svensgaard <torkil> |
| Component: | Build System and Packaging | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek, rkv |
| Version: | 20.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DRCMR | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Log from rpmbuild
gres.conf slurm.conf slurmctld.log slurmd.log debugging patch(v1) debugging patch(v2) |
||
Solved that but now looking at this: " [2020-11-09T08:20:07.501] Message aggregation disabled [2020-11-09T08:20:08.041] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-11-09T08:20:08.106] 1 GPU system device(s) detected [2020-11-09T08:20:08.106] WARNING: The following autodetected GPUs are being ignored: [2020-11-09T08:20:08.106] GRES[gpu] Type:geforce_rtx_2060 Count:1 Cores(48):0-47 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 " What seems to be the problem? Mvh. Torkil Added Gres=gpu:1 to slurm.conf and now getting this: " [2020-11-09T08:37:16.139] Message aggregation disabled [2020-11-09T08:37:16.658] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-11-09T08:37:16.727] 1 GPU system device(s) detected [2020-11-09T08:37:16.727] Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL [2020-11-09T08:37:16.727] error: Discarding the following config-only GPU due to lack of File specification: [2020-11-09T08:37:16.727] error: GRES[gpu] Type:(null) Count:1 Cores(48):(null) Links:(null) Flags: File:(null) [2020-11-09T08:37:34.953] Message aggregation disabled [2020-11-09T08:37:35.472] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-11-09T08:37:35.541] 1 GPU system device(s) detected [2020-11-09T08:37:35.541] Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL " Mvh. Torkil Torkil, Could you please add type to device definition in gres.conf like: >AutoDetect=nvml >Name=gpu Type=geforce and add type to slurm.conf, like: >Gres=gpu:geforce:1 and check if this will let you go? cheers, Marcin Hi Marcin Thanks, that seemed to work: " [2020-11-09T11:01:52.936] debug: init: Gres GPU plugin loaded [2020-11-09T11:01:52.936] debug: Gres GPU plugin: Resetting gres_devices [2020-11-09T11:01:53.442] debug: Systems Graphics Driver Version: 450.80.02 [2020-11-09T11:01:53.442] debug: NVML Library Version: 11.450.80.02 [2020-11-09T11:01:53.450] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-11-09T11:01:53.518] 1 GPU system device(s) detected [2020-11-09T11:01:53.518] debug: Gres GPU plugin: Normalizing gres.conf with system GPUs [2020-11-09T11:01:53.518] debug: Including the following GPU matched between system and configuration: [2020-11-09T11:01:53.518] debug: GRES[gpu] Type:geforce Count:1 Cores(48):0-47 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2020-11-09T11:01:53.518] debug: Gres GPU plugin: Final normalized gres.conf list: [2020-11-09T11:01:53.518] debug: GRES[gpu] Type:geforce Count:1 Cores(48):0-47 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2020-11-09T11:01:53.518] Gres Name=gpu Type=geforce Count=1 [2020-11-09T11:01:53.521] debug: AcctGatherEnergy NONE plugin loaded [2020-11-09T11:01:53.521] debug: AcctGatherProfile NONE plugin loaded [2020-11-09T11:01:53.521] debug: AcctGatherInterconnect NONE plugin loaded [2020-11-09T11:01:53.521] debug: AcctGatherFilesystem NONE plugin loaded " " # sinfo -o "%20N %10c %10m %25f %10G " NODELIST CPUS MEMORY AVAIL_FEATURES GRES bigger9 48 257552 (null) gpu:1(S:0) " It was my impression that "Autodetect=nvml" should be sufficient in the gres.conf though? Mvh. Torkil >It was my impression that "Autodetect=nvml" should be sufficient in the gres.conf though?
I'm looking into the code to check how we can better handle it. I agree that `Gres=gpu:1` in slurm.conf should be enough according to our docs, so either documentation or gpu/nvml should be improved.
Are you OK with a severity decrease to 4 for the time being?
cheers,
Marcin
(In reply to Marcin Stolarek from comment #5) > >It was my impression that "Autodetect=nvml" should be sufficient in the gres.conf though? > > I'm looking into the code to check how we can better handle it. I agree that > `Gres=gpu:1` in slurm.conf should be enough according to our docs, so either > documentation or gpu/nvml should be improved. Roger. There's also the "error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported" which seems to be because we have an older GPU architecture. > Are you OK with a severity decrease to 4 for the time being? Yep, that's fine. Mvh. Torkil >There's also the "error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported" which seems to be because we have an older GPU architecture.
I've noticed it and you're reading it correctly - it's just not supported in your configuration and the error is not critical. I'll check if we should decrease the log level for "NVML_ERROR_NOT_SUPPORTED" errno.
I'll keep you posted.
cheers,
Marcin
A somewhat related question, if you don't mind. I'm running configless, which is pretty cool, but this also means that gres.conf is copied to all nodes. Is there some way to have different flavours of the gres.conf so only nodes with a GPU get the "Autodetect=nvml" parameter, or how do I solve that? Thanks, Torkil >Is there some way to have different flavours of the gres.conf so only nodes with a GPU get the "Autodetect=nvml" parameter, or how do I solve that? Generaly we prefer thoes (not really related) questions to be separate tickets, to keep the history clear. Although, it's something addressed by 2b3d8a168b3[1] that we'll be released in Slurm 20.11. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/2b3d8a168b36611d6112c8f2056514bf3bfa4e47 (In reply to Marcin Stolarek from comment #9) > >Is there some way to have different flavours of the gres.conf so only nodes with a GPU get the "Autodetect=nvml" parameter, or how do I solve that? > Generaly we prefer thoes (not really related) questions to be separate > tickets, to keep the history clear. Although, it's something addressed by > 2b3d8a168b3[1] that we'll be released in Slurm 20.11. Appreciated, thanks. Is 20.11 due this month? Mvh. Torkil >Appreciated, thanks. Is 20.11 due this month? Yes, slurm-20.11rc1 (release candidate) was already tagged and can be downloaded from our web page[1]. Please keep in mind that it's rc1, not even a .0 release. cheers, Marcin [1]https://www.schedmd.com/downloads.php Torkil, I can't reproduce the behavior you described in comment 0 using other NVML gpu. Are you able to share reproduce it again and share your slurm.conf, gres.conf and debug3 level slurmd logs? cheers, Marcin (In reply to Marcin Stolarek from comment #15) > I can't reproduce the behavior you described in comment 0 using other NVML > gpu. Are you able to share reproduce it again and share your slurm.conf, > gres.conf and debug3 level slurmd logs? My setup is currently broken (ticket 10193) but I'll try to reproduce as soon as I am able. I am pretty sure I was trying to make the configuration as minimal as possible, so only had "Autodetect=nvml" in gres.conf and no gres configuration for the node in slurm.conf. How do I get debug3 on logs? Mvh. Torkil Hi I'm now on 20.11 and seeing this: " [2020-12-01T10:53:08.072] got reconfigure request [2020-12-01T10:53:08.073] all threads complete [2020-12-01T10:53:08.073] debug: Reading slurm.conf file: /var/spool/slurm/d/conf-cache/slurm.conf [2020-12-01T10:53:08.073] debug: NodeNames=slurmhpc1 setting Sockets=Boards(1) [2020-12-01T10:53:08.073] debug: Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf [2020-12-01T10:53:08.073] debug: Log file re-opened [2020-12-01T10:53:08.074] debug: CPUs:48 Boards:1 Sockets:1 CoresPerSocket:24 ThreadsPerCore:2 [2020-12-01T10:53:08.074] debug: Resource spec: No specialized cores configured by default on this node [2020-12-01T10:53:08.074] debug: Resource spec: Reserved system memory limit not configured for this node [2020-12-01T10:53:08.075] debug: gres/gpu: init: loaded [2020-12-01T10:53:08.075] debug: gres/gpu: node_config_load: Gres GPU plugin: Resetting gres_devices [2020-12-01T10:53:08.579] debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 455.45.01 [2020-12-01T10:53:08.579] debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.455.45.01 [2020-12-01T10:53:08.595] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-12-01T10:53:08.659] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected [2020-12-01T10:53:08.659] debug: Gres GPU plugin: Normalizing gres.conf with system GPUs [2020-12-01T10:53:08.659] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL [2020-12-01T10:53:08.659] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2020-12-01T10:53:08.659] debug: GRES[gpu] Type:(null) Count:1 Cores(48):0-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2020-12-01T10:53:08.659] debug: Gres GPU plugin: Final normalized gres.conf list: [2020-12-01T10:53:08.659] debug: GRES[gpu] Type:(null) Count:1 Cores(48):0-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2020-12-01T10:53:08.659] Gres Name=gpu Type=(null) Count=1 [2020-12-01T10:53:08.668] debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded [2020-12-01T10:53:08.668] debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded [2020-12-01T10:53:08.668] debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded [2020-12-01T10:53:08.668] debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded " From slurm.conf: " NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 " From gres.conf: " NodeName=bigger9 AutoDetect=nvml " Changing gres.conf to this makes it possible to use the gres for a while: " NodeName=bigger9 Type=gforce_rtx_2060 AutoDetect=nvml " But eventually this happens: " [2020-12-01T12:53:28.912] error: Setting node bigger9 state to DRAIN with reason:gres/gpu count reported lower than configured (0 < 1) [2020-12-01T12:53:28.912] drain_nodes: node bigger9 state set to DRAIN [2020-12-01T12:53:28.912] error: _slurm_rpc_node_registration node=bigger9: Invalid argument " Torkil, Did you restart all slurmd and slurmctld daemons or just executed `scontrol reconfigure`? cheers, Marcinres (In reply to Marcin Stolarek from comment #21) > Did you restart all slurmd and slurmctld daemons or just executed `scontrol > reconfigure`? I restarted everything and even rebooted the boxes. Could we perhaps up the priority on this, since the box going into drain is a bit of a showstopper? Mvh. Torkil Do you have a GRES type set in slurm.conf too? (Like Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line) Please share your current slurm.conf, gres.conf slurmctld and slurmd logs. cheers, Marcin Created attachment 16914 [details]
gres.conf
Created attachment 16915 [details]
slurm.conf
Created attachment 16916 [details]
slurmctld.log
Created attachment 16917 [details]
slurmd.log
(In reply to Marcin Stolarek from comment #23) > Do you have a GRES type set in slurm.conf too? (Like > Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line) Currently have this: " NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 " In gres.conf either of these seem to work, but end up draining: " NodeName=bigger9 Type=gforce_rtx_2060 AutoDetect=nvml NodeName=bigger9 AutoDetect=nvml " Mvh. Torkil Torkil, I'm looking into the details now. Did you try my suggestion from comment 23: >Do you have a GRES type set in slurm.conf too? (Like Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line) cheers, Marcin (In reply to Marcin Stolarek from comment #29) > Torkil, > > I'm looking into the details now. Did you try my suggestion from comment 23: > >Do you have a GRES type set in slurm.conf too? (Like Gres=gpu:gforce_rtx_2060:1 in the NodeName=... line) > Yes, if I do: " NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:gforce_rtx_2060:1 " It fails with this error: " [2020-12-02T13:51:36.143] error: _slurm_rpc_node_registration node=bigger9: Invalid argument " And it goes drain instantly. Mvh. Torkil Just to be 100% sure that we're on the same page you have Type=gforce_rtx_2060 in gres.conf now and all daemons were restarted (eventualy nodes were resumed) after configuration changes? cheers, Marcin (In reply to Marcin Stolarek from comment #31) > Just to be 100% sure that we're on the same page you have > Type=gforce_rtx_2060 in gres.conf now and all daemons were restarted > (eventualy nodes were resumed) after configuration changes? I have the gforce_rtx_2060 bit in both slurm.conf and gres.conf now and still: " [2020-12-02T13:55:51.586] error: _slurm_rpc_node_registration node=bigger9: Invalid argument " Mvh. Torkil If you'll take a look at the few lines above the drain_nodes in your slurmctld.log you'll notice: >[2020-11-03T22:58:33.856] error: Node bigger9 has low socket*core*thread count (1 < 48) >[2020-11-03T22:58:33.856] error: Node bigger9 has low cpu count (1 < 48) >[2020-11-03T22:58:33.856] error: Setting node bigger9 state to DRAIN Looking to slurmd log we see: >[2020-10-29T08:37:01.372] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=257552 TmpDisk=226773 Uptime=398 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) Is it one CPU node (VM?)? If that's the case you should adjust its configuration in slurm.conf, currently it's: >NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 cheers, Marcin (In reply to Marcin Stolarek from comment #33) > If you'll take a look at the few lines above the drain_nodes in your > slurmctld.log you'll notice: > >[2020-11-03T22:58:33.856] error: Node bigger9 has low socket*core*thread count (1 < 48) > >[2020-11-03T22:58:33.856] error: Node bigger9 has low cpu count (1 < 48) > >[2020-11-03T22:58:33.856] error: Setting node bigger9 state to DRAIN > That log line is from 3/11 though, that was sorted back then. > Looking to slurmd log we see: > >[2020-10-29T08:37:01.372] CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=257552 TmpDisk=226773 Uptime=398 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) > > Is it one CPU node (VM?)? If that's the case you should adjust its > configuration in slurm.conf, currently it's: > >NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 The node has a 48 core EPYC CPU. This line in slurm.conf: " NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:gforce_rtx_2060:1 " Gives this error: " [2020-12-03T07:18:27.350] error: _slurm_rpc_node_registration node=bigger9: Invalid argument " This line in slurm.conf: " NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 " Works for a bit and then goes DRAIN due to: " error: Setting node bigger9 state to DRAIN with reason:gres/gpu count reported lower than configured (0 < 1) " Both daemons restarted and "scontrol reconfigure" run in each case. Mvh. Torkil Created attachment 16945 [details]
debugging patch(v1)
Sorry for missing the dates.
Are you able to apply the attached debugging patch and share slurmd side logs with it?
cheers,
Marcin
Created attachment 16946 [details]
debugging patch(v2)
I've pressed enter too early - please ignore the previous attachment.
cheers,
Marcin
(In reply to Torkil Svensgaard from comment #34) > This line in slurm.conf: > > " > NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 > ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 > " Actually, it has been running all morning with this version of the configuration without going into drain. Gres.conf is: " NodeName=bigger9 Type=gforce_rtx_2060 AutoDetect=nvml " I'm pretty sure I also had that combination yesterday but I might be mistaken. I've also changed the partitions and made other changes, so it could be that. Either way, please reduce to severity to 4 and I'll get back in a couple days if it keeps being stable? Mvh. Torkil >Either way, please reduce to severity to 4 and I'll get back in a couple days if it keeps being stable?
OK. Let's take that path. When you're sure it works correctly please remove Type from gres.conf restart slurmd/slurmctld and let me know if that works as well.
cheers,
Marcin
(In reply to Marcin Stolarek from comment #38) > >Either way, please reduce to severity to 4 and I'll get back in a couple days if it keeps being stable? > > OK. Let's take that path. When you're sure it works correctly please remove > Type from gres.conf restart slurmd/slurmctld and let me know if that works > as well. Hrm, it has been running just fine until now, but I just added a new HPC node and had to restart slurmctld and bang: " [2020-12-10T08:28:08.992] error: Setting node bigger9 state to DRAIN with reason:gres/gpu count reported lower than configured (0 < 1) [2020-12-10T08:28:08.992] drain_nodes: node bigger9 state set to DRAIN " I could resume it with no problem, will see if it sticks. Mvh. Torkil (In reply to Marcin Stolarek from comment #38) > OK. Let's take that path. When you're sure it works correctly please remove > Type from gres.conf restart slurmd/slurmctld and let me know if that works > as well. The previous error was due to upgrade/driver I think. Running fine now without the Type in gres.conf: " [2020-12-14T08:25:43.135] debug: gres/gpu: init: loaded [2020-12-14T08:25:43.135] debug: gres/gpu: node_config_load: Gres GPU plugin: Resetting gres_devices [2020-12-14T08:25:43.643] debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 455.45.01 [2020-12-14T08:25:43.643] debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.455.45.01 [2020-12-14T08:25:43.655] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-12-14T08:25:43.718] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected [2020-12-14T08:25:43.718] debug: Gres GPU plugin: Normalizing gres.conf with system GPUs [2020-12-14T08:25:43.718] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL [2020-12-14T08:25:43.718] debug: gres/gpu: _normalize_gres_conf: Including the following GPU matched between system and configuration: [2020-12-14T08:25:43.718] debug: GRES[gpu] Type:(null) Count:1 Cores(48):0-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2020-12-14T08:25:43.718] debug: Gres GPU plugin: Final normalized gres.conf list: [2020-12-14T08:25:43.718] debug: GRES[gpu] Type:(null) Count:1 Cores(48):0-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 [2020-12-14T08:25:43.718] Gres Name=gpu Type=(null) Count=1 " Mvh. Torkil Do you want to keep the bug report open for a few days so you'll verify that it works fine for you or should we go ahead and close the case as info given? You can always reopen if it happens again. cheers, Marcin (In reply to Marcin Stolarek from comment #41) > Do you want to keep the bug report open for a few days so you'll verify that > it works fine for you or should we go ahead and close the case as info given? > > You can always reopen if it happens again. " [2020-12-14T08:25:43.718] Gres Name=gpu Type=(null) Count=1 " If that just means we won't be able to target the type specifically and it will otherwise work just fine free to close the ticket. Mvh. Torkil Torkil, Just to make sure we're on the same page. When in gres.conf you have only: >Autodetect=nvml 1) In case of the node definition(in slurm.conf) being like: >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:1 The GPU is discovered correctly and the type for GPU is not set. >Gres=gpu:1(S:0) 2) In case of node definiton in slurm.conf: >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:geforce:1 The GPU is discovered correctly and the type is set to "geforce", scontrol show node shows: >Gres=gpu:geforce:1(S:0) 3)If you have a type that is not a substring of Type from NVML set in slurm.conf, like: >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:tesla:1 The node get's drained with "gres/gpu count reported lower than configured (0 < 1)" Those 3 are currently expected and sound resonable, don't you agree? Do you see similar behavior or are you seeing something we can do to make things more intuitive? cheers, Marcin (In reply to Marcin Stolarek from comment #43) > Torkil, > > Just to make sure we're on the same page. When in gres.conf you have only: > >Autodetect=nvml I actually had a NodeName line earlier for bigger 9, but now only "Autodetect=nvml" > 1) In case of the node definition(in slurm.conf) being like: > >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:1 > The GPU is discovered correctly and the type for GPU is not set. > >Gres=gpu:1(S:0) Gres=gpu:1(S:0-23) > 2) In case of node definiton in slurm.conf: > >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:geforce:1 > The GPU is discovered correctly and the type is set to "geforce", scontrol > show node shows: > >Gres=gpu:geforce:1(S:0) The first time I changed it from 1) to 2) it went "Drain" after restarting slurmctld and doing "scontrol reconfigure". " [2020-12-21T09:57:44.166] debug: gres/gpu: init: loaded [2020-12-21T09:57:44.166] debug: gres/gpu: node_config_load: Gres GPU plugin: Resetting gres_devices [2020-12-21T09:57:44.683] debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 455.45.01 [2020-12-21T09:57:44.683] debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.455.45.01 [2020-12-21T09:57:44.689] error: _nvml_get_mem_freqs: Failed to get supported memory frequencies for the GPU : Not Supported [2020-12-21T09:57:44.690] gpu/nvml: _get_system_gpu_list_nvml: 1 GPU system device(s) detected [2020-12-21T09:57:44.690] debug: Gres GPU plugin: Normalizing gres.conf with system GPUs [2020-12-21T09:57:44.690] gres/gpu: _normalize_sys_gres_types: Could not find an unused configuration record with a GRES type that is a substring of system device `geforce_rtx_2060`. Setting system GRES type to NULL [2020-12-21T09:57:44.690] error: Discarding the following config-only GPU due to lack of File specification: [2020-12-21T09:57:44.690] error: GRES[gpu] Type:gforce Count:1 Cores(48):(null) Links:(null) Flags:HAS_TYPE File:(null) [2020-12-21T09:57:44.690] gres/gpu: _normalize_gres_conf: WARNING: The following autodetected GPUs are being ignored: [2020-12-21T09:57:44.690] GRES[gpu] Type:(null) Count:1 Cores(48):0-23 Links:-1 Flags:HAS_FILE,HAS_TYPE File:/dev/nvidia0 " After resuming once it works though. > 3)If you have a type that is not a substring of Type from NVML set in > slurm.conf, like: > >NodeName=bigger9 CPUs=1 State=... Sockets=... Gres=gpu:tesla:1 > The node get's drained with "gres/gpu count reported lower than configured > (0 < 1)" Hmm also works with tesla. It goes drain after restart/reconfigure but I can resume it and run jobs on it. Seems same as 2) Nodeinfo is "Gres=gpu:tesla:1 " And there's this in the log: " [2020-12-21T10:37:34.024] [1814.batch] error: We should had got gres_devices, but for some reason none were set in the plugin. [2020-12-21T10:37:34.100] [1814.batch] debug level is 'error' " Mvh. Torkil Torkil, >Hmm also works with tesla. It goes drain after restart/reconfigure but I can resume it and run jobs on it. Yes, that's correct. However, this applies also to other cases of automatically draining nodes. For instance, if you have a lower number of cores on the node than configured, slurmctld can only verify that when the node registration RPC comes and it's an unavoidable race condition between admin action and actual information from slurmd. I see that it may be confusing but at the current code state the error message: >[2020-12-21T10:37:34.024] [1814.batch] error: We should had got gres_devices, but for some reason none were set in the plugin. is all we can do. In general, we don't recommend to resume automatically drained nodes without the underlying issue resolution. I'm closing the case as "information given". Should you have any questions please don't hesitate to reopen. cheers, Marcin |
Created attachment 16561 [details] Log from rpmbuild Hi Trying to get "Autodetect=nvml" to work, so I built new SLURM packages on a box with CUDA libraries installed and reinstalled slurmd. The build seems to find CUDA and build NVML just fine, and it installs but I'm still getting that error when starting slurmd. What do I miss? Mvh. Torkil