| Summary: | Registration Invalid Argument | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | rod |
| Version: | 20.11.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=10641 | ||
| Site: | Harvard University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 21.08pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurm.comf | ||
|
Description
Paul Edmon
2021-01-14 11:06:19 MST
Paul, Can you please provide your current slurm.conf and then 'slurmd -C' from all of these nodes? Thanks, --Nate Created attachment 17509 [details]
slurm.comf
[root@holy7c16102 ~]# slurmd -C NodeName=holy7c16102 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176791 UpTime=3-04:14:59 [root@holy7c12112 ~]# slurmd -C NodeName=holy7c12112 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176799 UpTime=3-04:17:25 [root@aagk80gpu51 ~]# slurmd -C NodeName=aagk80gpu51 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128666 UpTime=3-04:33:07 [root@holy7c10505 ~]# slurmd -C NodeName=holy7c10505 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176791 UpTime=3-04:28:15 [root@holy2b09101 ~]# slurmd -C NodeName=holy2b09101 slurmd: Considering each NUMA node as a socket CPUs=64 Boards=1 SocketsPerBoard=8 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=257754 UpTime=29-04:25:45 [root@holy7c04108 ~]# slurmd -C NodeName=holy7c04108 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=176791 UpTime=3-04:46:28 (In reply to Paul Edmon from comment #4) > [root@holy7c16102 ~]# slurmd -C > NodeName=holy7c16102 CPUs=48 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 > ThreadsPerCore=1 RealMemory=176791 Is hyperthreading is disabled in the BIOS on these nodes (when working)? Yes. By default for all our nodes except a few we have Hyperthreading off. In this case the reason why its complaining is that RealMemory=176791 is smaller than what was defined in slurm.conf which is: [root@holy-slurm02 log]# scontrol show node holy7c16102 NodeName=holy7c16102 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=48 CPULoad=0.01 AvailableFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512 ActiveFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512 Gres=(null) NodeAddr=holy7c16102 NodeHostName=holy7c16102 Version=20.11.2 OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 RealMemory=192892 AllocMem=0 FreeMem=169105 Sockets=2 Boards=1 MemSpecLimit=4096 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=70265 Weight=1 Owner=N/A MCS_label=N/A Partitions=emergency,serial_requeue,shared BootTime=2021-01-12T09:16:06 SlurmdStartTime=2021-01-15T09:51:46 CfgTRES=cpu=48,mem=192892M,billing=95 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2021-01-13T15:43:07] Comment=(null) This is due to some of DIMM's being busted. slurm did correctly close the node but when I restart slurm it now starts throwing that registration error all over the logs. The only way to get rid of that error is: a. correct the size of the RAM in slurm.conf: Not going to do that as the DIMM is busted, is known to be busted, and will be repaired. b. remove the node from slurm.conf: Not going to do that as the DIMM will be repaired and I don't want to keep removing and adding nodes all the time when they break. c. keep the node down: Not going to do that because we need to node to remain up to maintain configuration for when we do repair it, and also to be able to troubleshoot and run diagnostics. So the node needs to remain up in the slurm.conf as is. The alert though about the node registration error needs to stop dropping in the logs as it happens every couple of seconds making the log unreadable. Instead it would be best to downgrade that error to a different debug level so that its not spamming all over the place. -Paul Edmon- On 1/15/2021 2:34 PM, bugs@schedmd.com wrote: > > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c6> on > bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #4 <show_bug.cgi?id=10631#c4>) > > [root@holy7c16102 ~]# slurmd -C > NodeName=holy7c16102 CPUs=48 Boards=1 SocketsPerBoard=2 > CoresPerSocket=24 > ThreadsPerCore=1 RealMemory=176791 > > Is hyperthreading is disabled in the BIOS on these nodes (when working)? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Another option would be to have the registration error only report once and subsequent errors be demoted in severity. Also marking a node as down or not responding would have the same effects as I would see it in the the node report. (In reply to Paul Edmon from comment #0) > Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: > _slurm_rpc_node_registration node=holy7c16102: Invalid argument If the node is set to STATE=DOWN, then this error should not be printed. Looks like they are currently set to this: > State=IDLE+DRAIN Is there a reason to not set them as DOWN? Not at all other than if there are extant jobs on the node. If that is the case then typically we let them drain. However if we want to consider this sort of mismatch a node failure then it should be set to DOWN. I will note that it was slurm that put it in drain state, it was nothing that I or our other scripts did. So it may be good to change the functionality to set a node to DOWN when it hits this sort of error. -Paul Edmon- On 1/15/2021 2:54 PM, bugs@schedmd.com wrote: > > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c11> on > bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #0 <show_bug.cgi?id=10631#c0>) > > Jan 14 13:04:32 holy-slurm02 slurmctld[48717]: error: > _slurm_rpc_node_registration node=holy7c16102: Invalid argument > > If the node is set to STATE=DOWN, then this error should not be printed. > > Looks like they are currently set to this: > > State=IDLE+DRAIN > > Is there a reason to not set them as DOWN? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #12) > I will note that it was slurm that put it in drain state, it was > nothing that I or our other scripts did. So it may be good to change the > functionality to set a node to DOWN when it hits this sort of error. Slurm places the node in DRAIN as a node-set to DOWN will kill any running jobs and Slurm does everything it can to avoid killing (running) jobs. > Not at all other than if there are extant jobs on the node. If that is > the case then typically we let them drain. However if we want to > consider this sort of mismatch a node failure then it should be set to > DOWN. Is there any issue with just setting these nodes down when your monitoring alerts you to these trouble nodes? (In reply to Paul Edmon from comment #0) > I'm aware that these nodes are misconfigured and I will not be fixing them > as they are legitimate hardware problems that I cannot resolve immediately. > I don't want to change my slurm config nor do I wish to remove them from the > scheduler. Setting the nodes to DOWN would stop the errors from getting logged when an admin has taken notice of the issue. No issue with that. I wasn't aware that setting the node to DOWN would stop the alert. If that is the case then that's a workable solution. -Paul Edmon- On 1/15/2021 3:13 PM, bugs@schedmd.com wrote: > > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c13> on > bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #12 <show_bug.cgi?id=10631#c12>) > > I will note that it was slurm that put it in drain state, it was > nothing that I or our other scripts did. So it may be good to change > the > functionality to set a node to DOWN when it hits this sort of error. > Slurm places the node in DRAIN as a node-set to DOWN will kill any running jobs > and Slurm does everything it can to avoid killing (running) jobs. > > > Not at all other than if there are extant jobs on the node. If that is > the case then typically we let them drain. However if we want to > > consider this sort of mismatch a node failure then it should be set to > > DOWN. > Is there any issue with just setting these nodes down when your monitoring > alerts you to these trouble nodes? > > (In reply to Paul Edmon fromcomment #0 <show_bug.cgi?id=10631#c0>) > > I'm aware that these nodes are misconfigured and I will not be fixing them > as they are legitimate hardware problems that I cannot resolve > immediately. > I don't want to change my slurm config nor do I wish to > remove them from the > scheduler. > Setting the nodes to DOWN would stop the errors from getting logged when an > admin has taken notice of the issue. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #14) > No issue with that. I wasn't aware that setting the node to DOWN would > stop the alert. If that is the case then that's a workable solution. I'm going the close the ticket per your response. Please reply if you have any more questions/issues or that doesn't work for some reason. --Nate Actually even when set to down the error still is thrown: Jan 15 15:40:04 holy-slurm02 slurmctld[22097]: error: _slurm_rpc_node_registration node=holy7c16102: Invalid argument [root@holy7c22501 ~]# scontrol show node holy7c16102 NodeName=holy7c16102 Arch=x86_64 CoresPerSocket=24 CPUAlloc=0 CPUTot=48 CPULoad=0.01 AvailableFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512 ActiveFeatures=intel,holyhdr,cascadelake,avx,avx2,avx512 Gres=(null) NodeAddr=holy7c16102 NodeHostName=holy7c16102 Version=20.11.2 OS=Linux 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 RealMemory=192892 AllocMem=0 FreeMem=169105 Sockets=2 Boards=1 MemSpecLimit=4096 State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=70265 Weight=1 Owner=N/A MCS_label=N/A Partitions=emergency,serial_requeue,shared BootTime=2021-01-12T09:16:01 SlurmdStartTime=2021-01-15T15:39:48 CfgTRES=cpu=48,mem=192892M,billing=95 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2021-01-15T15:38:09] Comment=(null) So that did not work. I take it is supposed to, so that's a bug that should be fixed. -Paul Edmon- On 1/15/2021 3:24 PM, bugs@schedmd.com wrote: > Nate Rini <mailto:nate@schedmd.com> changed bug 10631 > <https://bugs.schedmd.com/show_bug.cgi?id=10631> > What Removed Added > Resolution --- INFOGIVEN > Status OPEN RESOLVED > > *Comment # 15 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c15> on > bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #14 <show_bug.cgi?id=10631#c14>) > > No issue with that. I wasn't aware that setting the node to DOWN would > stop the alert. If that is the case then that's a workable solution. > > I'm going the close the ticket per your response. > > Please reply if you have any more questions/issues or that doesn't work for > some reason. > > --Nate > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #16) > Actually even when set to down the error still is thrown: > So that did not work. I take it is supposed to, so that's a bug that > should be fixed. Looking into it. (In reply to Nate Rini from comment #17) > (In reply to Paul Edmon from comment #16) > > Actually even when set to down the error still is thrown: > > So that did not work. I take it is supposed to, so that's a bug that > > should be fixed. > > Looking into it. Is the node sticking in the "DRAIN" state even after assigning it to down? Yes, even if I set state=down it says DOWN+DRAIN instead of just plain DOWN. -Paul Edmon- On 1/19/2021 2:32 PM, bugs@schedmd.com wrote: > > *Comment # 18 <https://bugs.schedmd.com/show_bug.cgi?id=10631#c18> on > bug 10631 <https://bugs.schedmd.com/show_bug.cgi?id=10631> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Nate Rini fromcomment #17 <show_bug.cgi?id=10631#c17>) > > (In reply to Paul Edmon from comment #16 <show_bug.cgi?id=10631#c16>) > > Actually even when set to > down the error still is thrown: > > So that did not work. I take it > is supposed to, so that's a bug that > > should be fixed. > > Looking > into it. > > Is the node sticking in the "DRAIN" state even after assigning it to down? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #20) > Yes, even if I set state=down it says DOWN+DRAIN instead of just plain DOWN. Staring at the source code, looks like I misread it when replying with comment #13. There currently doesn't appear to be any way to avoid the error without config_override. We are going to discuss internally what is preferred. We would like our vote to have this fixed. Ideally, NHC drains the SLURM node due to memory mismatch, no other action should be needed to avoid logs filling. (In reply to Nate Rini from comment #22) > We are going to discuss internally what is preferred. Slurm 21.08 will now avoid the issue of attempting to revalidate a node non-stop: > https://github.com/SchedMD/slurm/compare/c89853e00b0c1cb1a44d85400f6e3be31b7c0f8b...8276648733e6d3cda9053590c7032b422b8bb0bd Please reply if there are any more related questions or issues. Thanks, --Nate |