We have a few nodes with 10 GPUs each, and gres.conf has just this line: Nodename=s[001-008] Name=gpu Type=RTX3090 File=/dev/nvidia[0-9] Now one GPU in one of those nodes has failed and is not visible to the BIOS or OS, so this node now has only 9 GPUs in stead of 10. When the node starts up, the slurmd.log reports an error and crashes with a fatal error: [2023-02-27T11:01:42.484] error: Waiting for gres.conf file /dev/nvidia3 [2023-02-27T11:01:43.484] gres.conf file /dev/nvidia3 now exists [2023-02-27T11:01:43.484] error: Waiting for gres.conf file /dev/nvidia4 [2023-02-27T11:01:44.485] gres.conf file /dev/nvidia4 now exists [2023-02-27T11:01:44.485] error: Waiting for gres.conf file /dev/nvidia5 [2023-02-27T11:01:45.485] gres.conf file /dev/nvidia5 now exists [2023-02-27T11:01:45.485] error: Waiting for gres.conf file /dev/nvidia6 [2023-02-27T11:01:46.485] gres.conf file /dev/nvidia6 now exists [2023-02-27T11:01:46.485] error: Waiting for gres.conf file /dev/nvidia7 [2023-02-27T11:01:47.485] gres.conf file /dev/nvidia7 now exists [2023-02-27T11:01:47.486] error: Waiting for gres.conf file /dev/nvidia8 [2023-02-27T11:01:49.486] gres.conf file /dev/nvidia8 now exists [2023-02-27T11:01:49.486] error: Waiting for gres.conf file /dev/nvidia9 [2023-02-27T11:02:08.489] fatal: can't stat gres.conf file /dev/nvidia9: No such file or directory The OS sees only these 9 devices: $ ls -la /dev/nvidia? crw-rw-rw- 1 root root 195, 0 Feb 27 11:01 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Feb 27 11:01 /dev/nvidia1 crw-rw-rw- 1 root root 195, 2 Feb 27 11:01 /dev/nvidia2 crw-rw-rw- 1 root root 195, 3 Feb 27 11:01 /dev/nvidia3 crw-rw-rw- 1 root root 195, 4 Feb 27 11:01 /dev/nvidia4 crw-rw-rw- 1 root root 195, 5 Feb 27 11:01 /dev/nvidia5 crw-rw-rw- 1 root root 195, 6 Feb 27 11:01 /dev/nvidia6 crw-rw-rw- 1 root root 195, 7 Feb 27 11:01 /dev/nvidia7 crw-rw-rw- 1 root root 195, 8 Feb 27 11:01 /dev/nvidia8 IMHO, the slurmd should not crash with a fatal error in this case. It would make sense if slurmd sets the node state to FAIL or FAILING when a device in gres.conf is missing, but slurmd should not crash. Thanks, Ole
Ole, This way of handling a missing GRES device file was introduced almost 10 years ago (Slurm 14.03.10), by commit d097df2c699. I wouldn't call that "crash", it's a fatal error that happens only in very rare circumstances - like in your case physical failure of the hardware or wrong configuration. We didn't notice other customers complaining on the behavior since it was introduced. In cases where they consulted this fatal error with us it was related to unexpected hardware side behavior (for instance need to wait for nvidia devices to become available Bug 8222). From the code perspective it won't be trivial to rewrite to register the slurmd and set its state to invalid. As every substantial change it's related with a risk of bug introduction. Taking all of that into consideration I don't think we should change current behavior. Do you agree? cheers, Marcin
Hi Marcin, (In reply to Marcin Stolarek from comment #2) > This way of handling a missing GRES device file was introduced almost 10 > years ago (Slurm 14.03.10), by commit d097df2c699. I wouldn't call that > "crash", it's a fatal error that happens only in very rare circumstances - > like in your case physical failure of the hardware or wrong configuration. > > We didn't notice other customers complaining on the behavior since it was > introduced. In cases where they consulted this fatal error with us it was > related to unexpected hardware side behavior (for instance need to wait for > nvidia devices to become available Bug 8222). I noticed that slurmd had stopped on the node and was not reporting to slurmctld. This is high impact for the node, but low impact for the entire cluster. > From the code perspective it won't be trivial to rewrite to register the > slurmd and set its state to invalid. As every substantial change it's > related with a risk of bug introduction. It seems to me that this case of a missing GRES device is similar to a node having a failed DIMM module and was rebooted, so that the RAM memory is now less than the configuration in slurm.conf. In this case the node state becomes INVAL. This behavior would make sense also for a missing GRES file. > Taking all of that into consideration I don't think we should change current > behavior. Do you agree? If the necessary fix to slurmd is too complex to implement, and there is not a lot of interest from SchedMD's customers, then I think we can abandon the suggested improvement. Sites will anyhow discover the problem of missing GRES files because the node's slurmd will not be responding. Best regards, Ole
>[...]so that the RAM memory is now less than the configuration in slurm.conf. In this case the node state becomes INVAL. This behavior would make sense also for a missing GRES file. I agree that it will be more elegant to handle those same way. >If the necessary fix to slurmd is too complex to implement,[...] I'll double check the code for eventual improvement here, but if we address that it will be a change introduced in a major version release. Are you OK with lowering the case severity to 4? cheers, Marcin
(In reply to Marcin Stolarek from comment #4) > >[...]so that the RAM memory is now less than the configuration in slurm.conf. In this case the node state becomes INVAL. This behavior would make sense also for a missing GRES file. > > I agree that it will be more elegant to handle those same way. > > >If the necessary fix to slurmd is too complex to implement,[...] > I'll double check the code for eventual improvement here, but if we address > that it will be a change introduced in a major version release. > > Are you OK with lowering the case severity to 4? Yes, that's fine. /Ole
Ole, Looking at the code of _parse_gres_config[1] there are a few cases where GRES configuration issues end up with a fatal error. Those are for instance: - wrong combination of flags - single file given for "MultipleFiles" GRES, this relies on the previous check o file existence (reported fatal), which further complicates potential behavior change. - "Count" out of supported ranges. Since it's kind of a standard approach in that part of code I don't think we can justify changes here without a clear bug, but just for more elegant behavior. cheers, Marcin [1]https://github.com/SchedMD/slurm/blob/slurm-22-05-8-1/src/common/gres.c#L1275-L1495
Hi Marcin, (In reply to Marcin Stolarek from comment #6) > Looking at the code of _parse_gres_config[1] there are a few cases where > GRES configuration issues end up with a fatal error. Those are for instance: > - wrong combination of flags > - single file given for "MultipleFiles" GRES, this relies on the previous > check o file existence (reported fatal), which further complicates potential > behavior change. > - "Count" out of supported ranges. > > Since it's kind of a standard approach in that part of code I don't think we > can justify changes here without a clear bug, but just for more elegant > behavior. OK, I agree with your analysis of the problem. Please close this case. Best regards, Ole