Ticket 13668 - No error message when node is INVALID_REG due to /tmp size
Summary: No error message when node is INVALID_REG due to /tmp size
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 21.08.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-03-21 12:44 MDT by Gordon Dexter
Modified: 2022-05-20 14:22 MDT (History)
0 users

See Also:
Site: Johns Hopkins Univ. HLTCOE
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.0 23.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Gordon Dexter 2022-03-21 12:44:21 MDT
I set a node as down in order to re-image it.

The new image included a slightly smaller /tmp partition than before, and as a result the node came up in DOWN+DRAIN+INVALID_REG state, but there wasn't a single indicator that TmpDisk was the cause of this state.

I reviewed the Reason= field in scontrol show node and it only showed the reason that we set manually before draining the node.

I checked the slurmd.log but it showed nothing about this issue, even with -vvv.

I checked slurmctld.log, and it only showed the message "_slurm_rpc_node_registration Node=mynode02: Invalid argument" but gave no indication which argument was invalid.

I manually verified CPU, memory, and GRES, but didn't think about TmpDisk, and as a result I wasted an awful lot of time on this issue, which could have been solved with a simple error message in slurmd.log or slurmctld.log.

Is there supposed to be a more helpful error message in this case?  If so, where should it appear?  Was it somewhere that I didn't look?  Or is there a reason it didn't appear in this case?
Comment 1 Jason Booth 2022-03-21 13:30:00 MDT
Just to make sure, we understand your concern here. You have something like the following configured:

> NodeName=node1 TmpDisk=82345

https://slurm.schedmd.com/slurm.conf.html#OPT_TmpDisk

When you tried to bring back up a node with a new image (smaller tmp), it drained with "INVALID_REG" due to the lower TmpDisk value?

I will have one of our support staff look into this and see what would be involved in making this situation more clear.
Comment 2 Gordon Dexter 2022-03-22 08:01:22 MDT
Yes, that's correct.  The node was drained (and couldn't be resumed), but gave no indication as to why.  A log message in slurmd.log would have been enough to figure it out quickly, rather than bang my head against it for hours.
Comment 3 Jason Booth 2022-03-22 12:04:34 MDT
> A log message in slurmd.log would have been enough to figure it out quickly, rather than bang my head against it for hours.

Understood. I will have Oriol see what options we may have to help with this situation.
Comment 4 Oriol Vilarrubi 2022-03-23 06:58:45 MDT
(In reply to Gordon Dexter from comment #2)
> Yes, that's correct.  The node was drained (and couldn't be resumed), but
> gave no indication as to why.  A log message in slurmd.log would have been
> enough to figure it out quickly, rather than bang my head against it for
> hours.

Hello Gordon,

The error message you are looking for is in the controller, the reason why you did not saw it is because it is in the log level debug. If you'd have had the log level in debug level you would have seen something like this:

[2022-03-23T12:18:00.472] debug:  Node node1 has low tmp_disk size (5 < 15)
[2022-03-23T12:18:00.472] Node node1 now responding
[2022-03-23T12:18:00.472] error: _slurm_rpc_node_registration node=node1: Invalid argument

Also the reason why your drain message was not reflecting the low tmp space is because whenever we get a configuration error (like not having the correct tmp size), and the node is already drained or down or in fail state(similar to drain), we do not set that to not override the original message.

We'll discuss internally whether to increase the log level of those messages so that debug is not needed for them to be visible.

Either way my recommendation is to always increase SlurmctldDebug and SlurmdDebug to "debug" whenever you face any issues, those give a lot of useful details.

Greetings.
Comment 8 Caden Ellis 2022-05-20 14:22:53 MDT
Hello Gordon, thanks for logging the bug. 

We found a way to put the error level message in the logs just once when the state goes from valid to invalid, without reversing Bug 9035 (This issue is what caused those error messages to be downgraded to debug messages). We always need to preserve any "reason" a user sets, so we chose to show an error in the logs rather than mess with their reason. 

This commit (dffc3dd2ba) has this fix. Oriol had mentioned that you could see the message in the logs if you set the log level to debug, but with the previously mentioned commit you should now see a helpful error message appear in the logs for the specific circumstance you encountered dealing with manually setting a "reason" without having to change the log level.

If you have any other questions, feel free to open a new bug.