| Summary: | fatal: can't stat gres.conf file /dev/nvidia9 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Ole.H.Nielsen <Ole.H.Nielsen> |
| Component: | slurmd | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 22.05.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=14754 https://bugs.schedmd.com/show_bug.cgi?id=8222 |
||
| Site: | DTU Physics | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Ole.H.Nielsen@fysik.dtu.dk
2023-02-27 03:39:14 MST
Ole, This way of handling a missing GRES device file was introduced almost 10 years ago (Slurm 14.03.10), by commit d097df2c699. I wouldn't call that "crash", it's a fatal error that happens only in very rare circumstances - like in your case physical failure of the hardware or wrong configuration. We didn't notice other customers complaining on the behavior since it was introduced. In cases where they consulted this fatal error with us it was related to unexpected hardware side behavior (for instance need to wait for nvidia devices to become available Bug 8222). From the code perspective it won't be trivial to rewrite to register the slurmd and set its state to invalid. As every substantial change it's related with a risk of bug introduction. Taking all of that into consideration I don't think we should change current behavior. Do you agree? cheers, Marcin Hi Marcin, (In reply to Marcin Stolarek from comment #2) > This way of handling a missing GRES device file was introduced almost 10 > years ago (Slurm 14.03.10), by commit d097df2c699. I wouldn't call that > "crash", it's a fatal error that happens only in very rare circumstances - > like in your case physical failure of the hardware or wrong configuration. > > We didn't notice other customers complaining on the behavior since it was > introduced. In cases where they consulted this fatal error with us it was > related to unexpected hardware side behavior (for instance need to wait for > nvidia devices to become available Bug 8222). I noticed that slurmd had stopped on the node and was not reporting to slurmctld. This is high impact for the node, but low impact for the entire cluster. > From the code perspective it won't be trivial to rewrite to register the > slurmd and set its state to invalid. As every substantial change it's > related with a risk of bug introduction. It seems to me that this case of a missing GRES device is similar to a node having a failed DIMM module and was rebooted, so that the RAM memory is now less than the configuration in slurm.conf. In this case the node state becomes INVAL. This behavior would make sense also for a missing GRES file. > Taking all of that into consideration I don't think we should change current > behavior. Do you agree? If the necessary fix to slurmd is too complex to implement, and there is not a lot of interest from SchedMD's customers, then I think we can abandon the suggested improvement. Sites will anyhow discover the problem of missing GRES files because the node's slurmd will not be responding. Best regards, Ole >[...]so that the RAM memory is now less than the configuration in slurm.conf. In this case the node state becomes INVAL. This behavior would make sense also for a missing GRES file. I agree that it will be more elegant to handle those same way. >If the necessary fix to slurmd is too complex to implement,[...] I'll double check the code for eventual improvement here, but if we address that it will be a change introduced in a major version release. Are you OK with lowering the case severity to 4? cheers, Marcin (In reply to Marcin Stolarek from comment #4) > >[...]so that the RAM memory is now less than the configuration in slurm.conf. In this case the node state becomes INVAL. This behavior would make sense also for a missing GRES file. > > I agree that it will be more elegant to handle those same way. > > >If the necessary fix to slurmd is too complex to implement,[...] > I'll double check the code for eventual improvement here, but if we address > that it will be a change introduced in a major version release. > > Are you OK with lowering the case severity to 4? Yes, that's fine. /Ole Ole, Looking at the code of _parse_gres_config[1] there are a few cases where GRES configuration issues end up with a fatal error. Those are for instance: - wrong combination of flags - single file given for "MultipleFiles" GRES, this relies on the previous check o file existence (reported fatal), which further complicates potential behavior change. - "Count" out of supported ranges. Since it's kind of a standard approach in that part of code I don't think we can justify changes here without a clear bug, but just for more elegant behavior. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/common/gres.c#L1275-L1495 Hi Marcin, (In reply to Marcin Stolarek from comment #6) > Looking at the code of _parse_gres_config[1] there are a few cases where > GRES configuration issues end up with a fatal error. Those are for instance: > - wrong combination of flags > - single file given for "MultipleFiles" GRES, this relies on the previous > check o file existence (reported fatal), which further complicates potential > behavior change. > - "Count" out of supported ranges. > > Since it's kind of a standard approach in that part of code I don't think we > can justify changes here without a clear bug, but just for more elegant > behavior. OK, I agree with your analysis of the problem. Please close this case. Best regards, Ole |