16131 – fatal: can't stat gres.conf file /dev/nvidia9

Ticket 16131 - fatal: can't stat gres.conf file /dev/nvidia9

Summary: fatal: can't stat gres.conf file /dev/nvidia9

Status:	RESOLVED WONTFIX

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	22.05.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-02-27 03:39 MST by Ole.H.Nielsen@fysik.dtu.dk
Modified:	2023-03-07 00:40 MST (History)
CC List:	1 user (show)

See Also:	14754 8222
Site:	DTU Physics
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ole.H.Nielsen@fysik.dtu.dk 2023-02-27 03:39:14 MST

We have a few nodes with 10 GPUs each, and gres.conf has just this line:

Nodename=s[001-008] Name=gpu Type=RTX3090 File=/dev/nvidia[0-9]

Now one GPU in one of those nodes has failed and is not visible to the BIOS or OS, so this node now has only 9 GPUs in stead of 10.  When the node starts up, the slurmd.log reports an error and crashes with a fatal error:

[2023-02-27T11:01:42.484] error: Waiting for gres.conf file /dev/nvidia3
[2023-02-27T11:01:43.484] gres.conf file /dev/nvidia3 now exists
[2023-02-27T11:01:43.484] error: Waiting for gres.conf file /dev/nvidia4
[2023-02-27T11:01:44.485] gres.conf file /dev/nvidia4 now exists
[2023-02-27T11:01:44.485] error: Waiting for gres.conf file /dev/nvidia5
[2023-02-27T11:01:45.485] gres.conf file /dev/nvidia5 now exists
[2023-02-27T11:01:45.485] error: Waiting for gres.conf file /dev/nvidia6
[2023-02-27T11:01:46.485] gres.conf file /dev/nvidia6 now exists
[2023-02-27T11:01:46.485] error: Waiting for gres.conf file /dev/nvidia7
[2023-02-27T11:01:47.485] gres.conf file /dev/nvidia7 now exists
[2023-02-27T11:01:47.486] error: Waiting for gres.conf file /dev/nvidia8
[2023-02-27T11:01:49.486] gres.conf file /dev/nvidia8 now exists
[2023-02-27T11:01:49.486] error: Waiting for gres.conf file /dev/nvidia9
[2023-02-27T11:02:08.489] fatal: can't stat gres.conf file /dev/nvidia9: No such file or directory

The OS sees only these 9 devices:

$ ls -la /dev/nvidia?
crw-rw-rw- 1 root root 195, 0 Feb 27 11:01 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 27 11:01 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Feb 27 11:01 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Feb 27 11:01 /dev/nvidia3
crw-rw-rw- 1 root root 195, 4 Feb 27 11:01 /dev/nvidia4
crw-rw-rw- 1 root root 195, 5 Feb 27 11:01 /dev/nvidia5
crw-rw-rw- 1 root root 195, 6 Feb 27 11:01 /dev/nvidia6
crw-rw-rw- 1 root root 195, 7 Feb 27 11:01 /dev/nvidia7
crw-rw-rw- 1 root root 195, 8 Feb 27 11:01 /dev/nvidia8

IMHO, the slurmd should not crash with a fatal error in this case.

It would make sense if slurmd sets the node state to FAIL or FAILING when a device in gres.conf is missing, but slurmd should not crash.

Thanks,
Ole

Comment 2 Marcin Stolarek 2023-02-28 02:25:16 MST

Ole,

This way of handling a missing GRES device file was introduced almost 10 years ago (Slurm 14.03.10), by commit d097df2c699. I wouldn't call that "crash", it's a fatal error that happens only in very rare circumstances - like in your case physical failure of the hardware or wrong configuration. 

We didn't notice other customers complaining on the behavior since it was introduced. In cases where they consulted this fatal error with us it was related to unexpected hardware side behavior (for instance need to wait for nvidia devices to become available Bug 8222).

From the code perspective it won't be trivial to rewrite to register the slurmd and set its state to invalid. As every substantial change it's related with a risk of bug introduction.

Taking all of that into consideration I don't think we should change current behavior. Do you agree?

cheers,
Marcin

Comment 3 Ole.H.Nielsen@fysik.dtu.dk 2023-02-28 03:14:01 MST

Hi Marcin,

(In reply to Marcin Stolarek from comment #2)
> This way of handling a missing GRES device file was introduced almost 10
> years ago (Slurm 14.03.10), by commit d097df2c699. I wouldn't call that
> "crash", it's a fatal error that happens only in very rare circumstances -
> like in your case physical failure of the hardware or wrong configuration. 
> 
> We didn't notice other customers complaining on the behavior since it was
> introduced. In cases where they consulted this fatal error with us it was
> related to unexpected hardware side behavior (for instance need to wait for
> nvidia devices to become available Bug 8222).

I noticed that slurmd had stopped on the node and was not reporting to slurmctld.  This is high impact for the node, but low impact for the entire cluster.

> From the code perspective it won't be trivial to rewrite to register the
> slurmd and set its state to invalid. As every substantial change it's
> related with a risk of bug introduction.

It seems to me that this case of a missing GRES device is similar to a node having a failed DIMM module and was rebooted, so that the RAM memory is now less than the configuration in slurm.conf.  In this case the node state becomes INVAL.  This behavior would make sense also for a missing GRES file.

> Taking all of that into consideration I don't think we should change current
> behavior. Do you agree?

If the necessary fix to slurmd is too complex to implement, and there is not a lot of interest from SchedMD's customers, then I think we can abandon the suggested improvement.

Sites will anyhow discover the problem of missing GRES files because the node's slurmd will not be responding.

Best regards,
Ole

Comment 4 Marcin Stolarek 2023-02-28 04:57:12 MST

>[...]so that the RAM memory is now less than the configuration in slurm.conf.  In this case the node state becomes INVAL.  This behavior would make sense also for a missing GRES file.

I agree that it will be more elegant to handle those same way.

>If the necessary fix to slurmd is too complex to implement,[...]
I'll double check the code for eventual improvement here, but if we address that it will be a change introduced in a major version release.

Are you OK with lowering the case severity to 4?

cheers,
Marcin

Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2023-02-28 07:03:51 MST

(In reply to Marcin Stolarek from comment #4)
> >[...]so that the RAM memory is now less than the configuration in slurm.conf.  In this case the node state becomes INVAL.  This behavior would make sense also for a missing GRES file.
> 
> I agree that it will be more elegant to handle those same way.
> 
> >If the necessary fix to slurmd is too complex to implement,[...]
> I'll double check the code for eventual improvement here, but if we address
> that it will be a change introduced in a major version release.
> 
> Are you OK with lowering the case severity to 4?

Yes, that's fine.

/Ole

Comment 6 Marcin Stolarek 2023-03-06 07:45:47 MST

Ole,

Looking at the code of _parse_gres_config[1] there are a few cases where GRES configuration issues end up with a fatal error. Those are for instance:
- wrong combination of flags
- single file given for "MultipleFiles" GRES, this relies on the previous check o file existence (reported fatal), which further complicates potential behavior change.
- "Count" out of supported ranges.

Since it's kind of a standard approach in that part of code I don't think we can justify changes here without a clear bug, but just for more elegant behavior.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/common/gres.c#L1275-L1495

Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2023-03-07 00:29:53 MST

Hi Marcin,

(In reply to Marcin Stolarek from comment #6)
> Looking at the code of _parse_gres_config[1] there are a few cases where
> GRES configuration issues end up with a fatal error. Those are for instance:
> - wrong combination of flags
> - single file given for "MultipleFiles" GRES, this relies on the previous
> check o file existence (reported fatal), which further complicates potential
> behavior change.
> - "Count" out of supported ranges.
> 
> Since it's kind of a standard approach in that part of code I don't think we
> can justify changes here without a clear bug, but just for more elegant
> behavior.

OK, I agree with your analysis of the problem.  Please close this case.

Best regards,
Ole