Please document in more detail the meaning of a node state INVALID_REG. Our observations: We have a compute node with hardware errors, and we have removed a processor and DIMMs so that we're down to 1 processor and 1 DIMM: $ slurmd -C NodeName=b008 CPUs=20 Boards=1 SocketsPerBoard=1 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=31469 The normal state configured in slurm.conf would be the same as for this node: $ slurmd -C NodeName=b007 slurmd: Considering each NUMA node as a socket CPUs=40 Boards=1 SocketsPerBoard=4 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=772298 In the current state, Slurm assigns a node state INVALID_REG, see these outputs: $ sinfo -N -n b008 NODELIST NODES PARTITION STATE b008 1 xeon40 inval b008 1 xeon40_768 inval $ scontrol show node b008 NodeName=b008 Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUTot=40 CPULoad=0.01 AvailableFeatures=xeon6148v5,opa,xeon40 ActiveFeatures=xeon6148v5,opa,xeon40 Gres=(null) NodeAddr=b008 NodeHostName=b008 Version=21.08.6 OS=Linux 3.10.0-1160.59.1.el7.x86_64 #1 SMP Wed Feb 23 16:47:03 UTC 2022 RealMemory=768000 AllocMem=0 FreeMem=30403 Sockets=2 Boards=1 State=DOWN+INVALID_REG ThreadsPerCore=1 TmpDisk=140000 Weight=10735 Owner=N/A MCS_label=N/A Partitions=xeon40,xeon40_768 BootTime=2022-03-21T15:14:28 SlurmdStartTime=2022-03-21T20:44:37 LastBusyTime=2022-03-21T10:56:56 CfgTRES=cpu=40,mem=750G,billing=66 AllocTRES= CapWatts=n/a CurrentWatts=49 AveWatts=41 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Testing [root@2022-03-21T21:05:01] In the scontrol man-page there is a really terse explanation: * INVALID_REG The node did not register correctly with the controller. The sinfo man-page is a little more detailed: * INVAL The node registered with an invalid configuration. The node will clear from this state with a valid registration (ie. a slurmd restart is required). Request: 1. Please define the meaning of "invalid configuration" and "valid registration". In the present case it would be related to processors and memory, but there may be other reasons such as shown in Bug 13668. 2. Please make the scontrol man-page in sync with the sinfo man-page. Thanks, Ole
Hi Ole, Thanks for bringing this to our attention. I've put together a patch to expound on the definition of an invalid registration in the scontrol and sinfo pages. I'll let you know as there is progress on this change. Thanks, Ben
Hi Ole, An update to the documentation has been checked in to clarify what these states mean. You can view the commit here: https://github.com/SchedMD/slurm/commit/2955f497de9a67840ef3eed401f12dfa0b313812 This change will be available in our online docs with the release of 21.08.7. Thanks, Ben
Hi Ben, (In reply to Ben Roberts from comment #6) > Hi Ole, > > An update to the documentation has been checked in to clarify what these > states mean. You can view the commit here: > https://github.com/SchedMD/slurm/commit/ > 2955f497de9a67840ef3eed401f12dfa0b313812 > > This change will be available in our online docs with the release of 21.08.7. Thanks a lot, the addition to the man-pages is very clear now. Best regards, Ole