Ticket 17148

Summary:	Node state states Invalid and does not allow to change
Product:	Slurm	Reporter:	Arjun <ajg17>
Component:	slurmd	Assignee:	Jason Booth <jbooth>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	22.05.0
Hardware:	Linux
OS:	Linux
Site:	Queen's University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.0
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurmctld log

Description Arjun 2023-07-07 08:13:17 MDT

Created attachment 31135 [details]
Slurmctld log

Hi, 

I had to fix GPU driver issue to one of the Northstart compute node. I changed the state of the compute node using : 

Scontrol update NodeName=northstar State=DOWN 

Once driver issue fixed, I tried to update the status to Resume using: 

scontrol update NodeName=northstar state= Resume 

It gives following error: 
---------------------------------------
root@aurora:/# scontrol update NodeName=northstar State=Resume
slurm_update error: Invalid node state specified

----------------------------------------
Result of Sinfo: 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Aurora       up   infinite      1    mix aurora
External     up   infinite      1    mix aurora
Sasquatch    up   infinite      1    mix sasquatch
Northstar    up   infinite      1  inval northstar
Combined*    up   infinite      2    mix aurora,sasquatch

------------------------------------------

--> slurmd status on Northstar compute node is active

But on controlnode (sl-controller) following error can be seen: 
-----------------------------------------------------
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: Node northstar appears to have a different slurm.conf than the slurmctld.  This could cause issues with c>
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: debug:  Node northstar has low real_memory size (1031861 / 1031873) < 100.00%
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: _slurm_rpc_node_registration node=northstar: Invalid argument
Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug:  sched/backfill: _attempt_backfill: no jobs to backfill
---------------------------------------------------------

Northstart Slurmd -C output:
--------------------------------------------------------------------------
NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861
--------------------------------------------------------------------------


Can you please help me to bring up the compute node? 


Thanks,
Arjun

Comment 1 Jason Booth 2023-07-07 10:36:32 MDT

> NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861


Please verify that the "RealMemory=1031861" matches the slurm.conf settings you 
have in your slurm.conf. The slurm.conf should also be in sync across the cluster, 
so the compute node and controller should show "RealMemory=1031861" for this node.

Comment 2 Jason Booth 2023-07-11 09:36:28 MDT

Arjun was my comment#1 helpful and were you able to correct the RealMemory in your configuration file?

Comment 3 Arjun 2023-07-11 11:01:08 MDT

Hi Jason,

Thanks for suggestion. It worked. 

You can close this ticket. 

Thanks,
Arjun