| Summary: | No error message when node is INVALID_REG due to /tmp size | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Gordon Dexter <gmdexter> |
| Component: | slurmd | Assignee: | Director of Support <support> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Johns Hopkins Univ. HLTCOE | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 22.05.0 23.02.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Gordon Dexter
2022-03-21 12:44:21 MDT
Just to make sure, we understand your concern here. You have something like the following configured: > NodeName=node1 TmpDisk=82345 https://slurm.schedmd.com/slurm.conf.html#OPT_TmpDisk When you tried to bring back up a node with a new image (smaller tmp), it drained with "INVALID_REG" due to the lower TmpDisk value? I will have one of our support staff look into this and see what would be involved in making this situation more clear. Yes, that's correct. The node was drained (and couldn't be resumed), but gave no indication as to why. A log message in slurmd.log would have been enough to figure it out quickly, rather than bang my head against it for hours. > A log message in slurmd.log would have been enough to figure it out quickly, rather than bang my head against it for hours.
Understood. I will have Oriol see what options we may have to help with this situation.
(In reply to Gordon Dexter from comment #2) > Yes, that's correct. The node was drained (and couldn't be resumed), but > gave no indication as to why. A log message in slurmd.log would have been > enough to figure it out quickly, rather than bang my head against it for > hours. Hello Gordon, The error message you are looking for is in the controller, the reason why you did not saw it is because it is in the log level debug. If you'd have had the log level in debug level you would have seen something like this: [2022-03-23T12:18:00.472] debug: Node node1 has low tmp_disk size (5 < 15) [2022-03-23T12:18:00.472] Node node1 now responding [2022-03-23T12:18:00.472] error: _slurm_rpc_node_registration node=node1: Invalid argument Also the reason why your drain message was not reflecting the low tmp space is because whenever we get a configuration error (like not having the correct tmp size), and the node is already drained or down or in fail state(similar to drain), we do not set that to not override the original message. We'll discuss internally whether to increase the log level of those messages so that debug is not needed for them to be visible. Either way my recommendation is to always increase SlurmctldDebug and SlurmdDebug to "debug" whenever you face any issues, those give a lot of useful details. Greetings. Hello Gordon, thanks for logging the bug. We found a way to put the error level message in the logs just once when the state goes from valid to invalid, without reversing Bug 9035 (This issue is what caused those error messages to be downgraded to debug messages). We always need to preserve any "reason" a user sets, so we chose to show an error in the logs rather than mess with their reason. This commit (dffc3dd2ba) has this fix. Oriol had mentioned that you could see the message in the logs if you set the log level to debug, but with the previously mentioned commit you should now see a helpful error message appear in the logs for the specific circumstance you encountered dealing with manually setting a "reason" without having to change the log level. If you have any other questions, feel free to open a new bug. |