Ticket 10631

Summary:	Registration Invalid Argument
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmctld	Assignee:	Nate Rini <nate>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	rod
Version:	20.11.2
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=10641
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	21.08pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.comf

Description Paul Edmon 2021-01-14 11:06:19 MST

We discussed this back with bug 9035 but I'm still getting lots of node_registration alerts that are spamming my logs:

Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: _slurm_rpc_node_registration node=holy7c16102: Invalid argument
Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: _slurm_rpc_node_registration node=holy7c12112: Invalid argument
Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: _slurm_rpc_node_registration node=aagk80gpu51: Invalid argument
Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: _slurm_rpc_node_registration node=holy7c10505: Invalid argument
Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: _slurm_rpc_node_registration node=holy2b09101: Invalid argument
Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: _slurm_rpc_node_registration node=holy7c04108: Invalid argument

I'm aware that these nodes are misconfigured and I will not be fixing them as they are legitimate hardware problems that I cannot resolve immediately.  I don't want to change my slurm config nor do I wish to remove them from the scheduler.

Is there a way to quiet these alerts or at least make them no repeating?  I'm going to have to reset config_override again in order to stop it from being super chatty.

Comment 2 Nate Rini 2021-01-15 09:46:20 MST

Paul,

Can you please provide your current slurm.conf and then 'slurmd -C' from all of these nodes?

Thanks,
--Nate

Comment 3 Paul Edmon 2021-01-15 11:30:43 MST

Created attachment 17509 [details]
slurm.comf

Comment 4 Paul Edmon 2021-01-15 11:33:04 MST

[root@holy7c16102 ~]# slurmd -C
NodeName=holy7c16102 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176791
UpTime=3-04:14:59
[root@holy7c12112 ~]# slurmd -C
NodeName=holy7c12112 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176799
UpTime=3-04:17:25
[root@aagk80gpu51 ~]# slurmd -C
NodeName=aagk80gpu51 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128666
UpTime=3-04:33:07
[root@holy7c10505 ~]# slurmd -C
NodeName=holy7c10505 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176791
UpTime=3-04:28:15
[root@holy2b09101 ~]# slurmd -C
NodeName=holy2b09101 slurmd: Considering each NUMA node as a socket
CPUs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=257754
UpTime=29-04:25:45
[root@holy7c04108 ~]# slurmd -C
NodeName=holy7c04108 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176791
UpTime=3-04:46:28

Comment 6 Nate Rini 2021-01-15 12:34:55 MST

(In reply to Paul Edmon from comment #4)
> [root@holy7c16102 ~]# slurmd -C
> NodeName=holy7c16102 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24
> ThreadsPerCore=1 RealMemory=176791

Is hyperthreading is disabled in the BIOS on these nodes (when working)?

Comment 7 Paul Edmon 2021-01-15 12:42:19 MST

Yes.  By default for all our nodes except a few we have Hyperthreading off.

In this case the reason why its complaining is that RealMemory=176791 is 
smaller than what was defined in slurm.conf which is:

[root@holy-slurm02 log]# scontrol show node holy7c16102
NodeName=holy7c16102 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUTot=48 CPULoad=0.01
    AvailableFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512
    ActiveFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512
    Gres=(null)
    NodeAddr=holy7c16102 NodeHostName=holy7c16102 Version=20.11.2
    OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020
    RealMemory=192892 AllocMem=0 FreeMem=169105 Sockets=2 Boards=1
    MemSpecLimit=4096
    State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=70265 Weight=1 Owner=N/A 
MCS_label=N/A
    Partitions=emergency,serial_requeue,shared
    BootTime=2021-01-12T09:16:06 SlurmdStartTime=2021-01-15T09:51:46
    CfgTRES=cpu=48,mem=192892M,billing=95
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=Low RealMemory [root@2021-01-13T15:43:07]
    Comment=(null)

This is due to some of DIMM's being busted.  slurm did correctly close 
the node but when I restart slurm it now starts throwing that 
registration error all over the logs.  The only way to get rid of that 
error is:

a. correct the size of the RAM in slurm.conf: Not going to do that as 
the DIMM is busted, is known to be busted, and will be repaired.

b. remove the node from slurm.conf: Not going to do that as the DIMM 
will be repaired and I don't want to keep removing and adding nodes all 
the time when they break.

c. keep the node down: Not going to do that because we need to node to 
remain up to maintain configuration for when we do repair it, and also 
to be able to troubleshoot and run diagnostics.

So the node needs to remain up in the slurm.conf as is.  The alert 
though about the node registration error needs to stop dropping in the 
logs as it happens every couple of seconds making the log unreadable.  
Instead it would be best to downgrade that error to a different debug 
level so that its not spamming all over the place.

-Paul Edmon-

On 1/15/2021 2:34 PM, bugs@schedmd.com wrote:
>
> *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c6> on 
> bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #4  <show_bug.cgi?id=10631#c4>)
> > [root@holy7c16102 ~]# slurmd -C > NodeName=holy7c16102 CPUs=48 Boards=1 SocketsPerBoard=2 
> CoresPerSocket=24 > ThreadsPerCore=1 RealMemory=176791
>
> Is hyperthreading is disabled in the BIOS on these nodes (when working)?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 8 Paul Edmon 2021-01-15 12:45:13 MST

Another option would be to have the registration error only report once and subsequent errors be demoted in severity.  Also marking a node as down or not responding would have the same effects as I would see it in the the node report.

Comment 11 Nate Rini 2021-01-15 12:54:57 MST

(In reply to Paul Edmon from comment #0)
> Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error:
> _slurm_rpc_node_registration node=holy7c16102: Invalid argument

If the node is set to STATE=DOWN, then this error should not be printed. 

Looks like they are currently set to this:
> State=IDLE+DRAIN

Is there a reason to not set them as DOWN?

Comment 12 Paul Edmon 2021-01-15 13:00:16 MST

Not at all other than if there are extant jobs on the node.  If that is 
the case then typically we let them drain.  However if we want to 
consider this sort of mismatch a node failure then it should be set to 
DOWN.  I will note that it was slurm that put it in drain state, it was 
nothing that I or our other scripts did. So it may be good to change the 
functionality to set a node to DOWN when it hits this sort of error.

-Paul Edmon-

On 1/15/2021 2:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c11> on 
> bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #0  <show_bug.cgi?id=10631#c0>)
> > Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: > _slurm_rpc_node_registration node=holy7c16102: Invalid argument
>
> If the node is set to STATE=DOWN, then this error should not be printed.
>
> Looks like they are currently set to this:
> > State=IDLE+DRAIN
>
> Is there a reason to not set them as DOWN?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 13 Nate Rini 2021-01-15 13:13:16 MST

(In reply to Paul Edmon from comment #12)
> I will note that it was slurm that put it in drain state, it was 
> nothing that I or our other scripts did. So it may be good to change the 
> functionality to set a node to DOWN when it hits this sort of error.
Slurm places the node in DRAIN as a node-set to DOWN will kill any running jobs and Slurm does everything it can to avoid killing (running) jobs.

> Not at all other than if there are extant jobs on the node.  If that is 
> the case then typically we let them drain.  However if we want to 
> consider this sort of mismatch a node failure then it should be set to 
> DOWN.
Is there any issue with just setting these nodes down when your monitoring alerts you to these trouble nodes?

(In reply to Paul Edmon from comment #0)
> I'm aware that these nodes are misconfigured and I will not be fixing them
> as they are legitimate hardware problems that I cannot resolve immediately. 
> I don't want to change my slurm config nor do I wish to remove them from the
> scheduler.
Setting the nodes to DOWN would stop the errors from getting logged when an admin has taken notice of the issue.

Comment 14 Paul Edmon 2021-01-15 13:15:59 MST

No issue with that.  I wasn't aware that setting the node to DOWN would 
stop the alert.  If that is the case then that's a workable solution.

-Paul Edmon-

On 1/15/2021 3:13 PM, bugs@schedmd.com wrote:
>
> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c13> on 
> bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #12  <show_bug.cgi?id=10631#c12>)
> > I will note that it was slurm that put it in drain state, it was > nothing that I or our other scripts did. So it may be good to change 
> the > functionality to set a node to DOWN when it hits this sort of error.
> Slurm places the node in DRAIN as a node-set to DOWN will kill any running jobs
> and Slurm does everything it can to avoid killing (running) jobs.
>
> > Not at all other than if there are extant jobs on the node.  If that is > the case then typically we let them drain.  However if we want to > 
> consider this sort of mismatch a node failure then it should be set to 
> > DOWN.
> Is there any issue with just setting these nodes down when your monitoring
> alerts you to these trouble nodes?
>
> (In reply to Paul Edmon fromcomment #0  <show_bug.cgi?id=10631#c0>)
> > I'm aware that these nodes are misconfigured and I will not be fixing them > as they are legitimate hardware problems that I cannot resolve 
> immediately. > I don't want to change my slurm config nor do I wish to 
> remove them from the > scheduler.
> Setting the nodes to DOWN would stop the errors from getting logged when an
> admin has taken notice of the issue.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 15 Nate Rini 2021-01-15 13:24:04 MST

(In reply to Paul Edmon from comment #14)
> No issue with that.  I wasn't aware that setting the node to DOWN would 
> stop the alert.  If that is the case then that's a workable solution.

I'm going the close the ticket per your response.

Please reply if you have any more questions/issues or that doesn't work for some reason.

--Nate

Comment 16 Paul Edmon 2021-01-15 13:41:20 MST

Actually even when set to down the error still is thrown:

Jan 15 15:40:04 holy-slurm02 slurmctld[22097]: error: 
_slurm_rpc_node_registration node=holy7c16102: Invalid argument

[root@holy7c22501 ~]# scontrol show node holy7c16102
NodeName=holy7c16102 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUTot=48 CPULoad=0.01
    AvailableFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512
    ActiveFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512
    Gres=(null)
    NodeAddr=holy7c16102 NodeHostName=holy7c16102 Version=20.11.2
    OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020
    RealMemory=192892 AllocMem=0 FreeMem=169105 Sockets=2 Boards=1
    MemSpecLimit=4096
    State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=70265 Weight=1 Owner=N/A 
MCS_label=N/A
    Partitions=emergency,serial_requeue,shared
    BootTime=2021-01-12T09:16:01 SlurmdStartTime=2021-01-15T15:39:48
    CfgTRES=cpu=48,mem=192892M,billing=95
    AllocTRES=
    CapWatts=n/a
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
    Reason=Low RealMemory [root@2021-01-15T15:38:09]
    Comment=(null)

So that did not work.  I take it is supposed to, so that's a bug that 
should be fixed.

-Paul Edmon-

On 1/15/2021 3:24 PM, bugs@schedmd.com wrote:
> Nate Rini <mailto:nate@schedmd.com> changed bug 10631 
> <https://bugs.schedmd.com/show_bug.cgi?id=10631>
> What 	Removed 	Added
> Resolution 	--- 	INFOGIVEN
> Status 	OPEN 	RESOLVED
>
> *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c15> on 
> bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #14  <show_bug.cgi?id=10631#c14>)
> > No issue with that.  I wasn't aware that setting the node to DOWN would > stop the alert.  If that is the case then that's a workable solution.
>
> I'm going the close the ticket per your response.
>
> Please reply if you have any more questions/issues or that doesn't work for
> some reason.
>
> --Nate
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 17 Nate Rini 2021-01-15 13:42:35 MST

(In reply to Paul Edmon from comment #16)
> Actually even when set to down the error still is thrown:
> So that did not work.  I take it is supposed to, so that's a bug that 
> should be fixed.

Looking into it.

Comment 18 Nate Rini 2021-01-19 12:32:22 MST

(In reply to Nate Rini from comment #17)
> (In reply to Paul Edmon from comment #16)
> > Actually even when set to down the error still is thrown:
> > So that did not work.  I take it is supposed to, so that's a bug that 
> > should be fixed.
> 
> Looking into it.

Is the node sticking in the "DRAIN" state even after assigning it to down?

Comment 20 Paul Edmon 2021-01-19 12:49:44 MST

Yes, even if I set state=down it says DOWN+DRAIN instead of just plain DOWN.

-Paul Edmon-

On 1/19/2021 2:32 PM, bugs@schedmd.com wrote:
>
> *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c18> on 
> bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Nate Rini fromcomment #17  <show_bug.cgi?id=10631#c17>)
> > (In reply to Paul Edmon from comment #16 <show_bug.cgi?id=10631#c16>) > > Actually even when set to 
> down the error still is thrown: > > So that did not work.  I take it 
> is supposed to, so that's a bug that > > should be fixed. > > Looking 
> into it.
>
> Is the node sticking in the "DRAIN" state even after assigning it to down?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 22 Nate Rini 2021-01-19 13:09:49 MST

(In reply to Paul Edmon from comment #20)
> Yes, even if I set state=down it says DOWN+DRAIN instead of just plain DOWN.

Staring at the source code, looks like I misread it when replying with comment #13. There currently doesn't appear to be any way to avoid the error without config_override. We are going to discuss internally what is preferred.

Comment 31 Rodney Mach 2021-05-12 09:15:07 MDT

We would like our vote to have this fixed. Ideally, NHC drains the SLURM node due to memory mismatch, no other action should be needed to avoid logs filling.

Comment 37 Nate Rini 2021-06-01 17:04:16 MDT

(In reply to Nate Rini from comment #22)
>  We are going to discuss internally what is preferred.

Slurm 21.08 will now avoid the issue of attempting to revalidate a node non-stop:

> https://github.com/SchedMD/slurm/compare/c89853e00b0c1cb1a44d85400f6e3be31b7c0f8b...8276648733e6d3cda9053590c7032b422b8bb0bd

Please reply if there are any more related questions or issues.

Thanks,
--Nate