Ticket 13674

Summary: Please document node state INVALID_REG
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: DocumentationAssignee: Ben Roberts <ben>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.7, 22.05.pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2022-03-22 02:47:36 MDT
Please document in more detail the meaning of a node state INVALID_REG.

Our observations: We have a compute node with hardware errors, and we have removed a processor and DIMMs so that we're down to 1 processor and 1 DIMM:

$ slurmd -C
NodeName=b008 CPUs=20 Boards=1 SocketsPerBoard=1 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=31469

The normal state configured in slurm.conf would be the same as for this node:

$ slurmd -C
NodeName=b007 slurmd: Considering each NUMA node as a socket
CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=772298

In the current state, Slurm assigns a node state INVALID_REG, see these outputs:

$ sinfo -N -n b008
NODELIST   NODES  PARTITION STATE 
b008           1     xeon40 inval 
b008           1 xeon40_768 inval 

$ scontrol show node b008
NodeName=b008 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=0 CPUTot=40 CPULoad=0.01
   AvailableFeatures=xeon6148v5,opa,xeon40
   ActiveFeatures=xeon6148v5,opa,xeon40
   Gres=(null)
   NodeAddr=b008 NodeHostName=b008 Version=21.08.6
   OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 
   RealMemory=768000 AllocMem=0 FreeMem=30403 Sockets=2 Boards=1
   State=DOWN+INVALID_REG ThreadsPerCore=1 TmpDisk=140000 Weight=10735 Owner=N/A MCS_label=N/A
   Partitions=xeon40,xeon40_768 
   BootTime=2022-03-21T15:14:28 SlurmdStartTime=2022-03-21T20:44:37
   LastBusyTime=2022-03-21T10:56:56
   CfgTRES=cpu=40,mem=750G,billing=66
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=49 AveWatts=41
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Testing [root@2022-03-21T21:05:01]

In the scontrol man-page there is a really terse explanation:

* INVALID_REG The node did not register correctly with the controller.

The sinfo man-page is a little more detailed:

* INVAL The node registered with an invalid configuration. The node will clear from this state with a valid registration (ie. a slurmd restart is required).

Request: 

1. Please define the meaning of "invalid configuration" and "valid registration".  In the present case it would be related to processors and memory, but there may be other reasons such as shown in Bug 13668.

2. Please make the scontrol man-page in sync with the sinfo man-page.

Thanks,
Ole
Comment 2 Ben Roberts 2022-03-22 13:41:58 MDT
Hi Ole,

Thanks for bringing this to our attention.  I've put together a patch to expound on the definition of an invalid registration in the scontrol and sinfo pages.  I'll let you know as there is progress on this change.

Thanks,
Ben
Comment 6 Ben Roberts 2022-03-23 14:02:26 MDT
Hi Ole,

An update to the documentation has been checked in to clarify what these states mean.  You can view the commit here:
https://github.com/SchedMD/slurm/commit/2955f497de9a67840ef3eed401f12dfa0b313812

This change will be available in our online docs with the release of 21.08.7.

Thanks,
Ben
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2022-03-23 14:14:07 MDT
Hi Ben,

(In reply to Ben Roberts from comment #6)
> Hi Ole,
> 
> An update to the documentation has been checked in to clarify what these
> states mean.  You can view the commit here:
> https://github.com/SchedMD/slurm/commit/
> 2955f497de9a67840ef3eed401f12dfa0b313812
> 
> This change will be available in our online docs with the release of 21.08.7.

Thanks a lot, the addition to the man-pages is very clear now.

Best regards,
Ole