Ticket 17148

Summary: Node state states Invalid and does not allow to change
Product: Slurm Reporter: Arjun <ajg17>
Component: slurmdAssignee: Jason Booth <jbooth>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 22.05.0   
Hardware: Linux   
OS: Linux   
Site: Queen's University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 22.05.0 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Slurmctld log

Description Arjun 2023-07-07 08:13:17 MDT
Created attachment 31135 [details]
Slurmctld log

Hi, 

I had to fix GPU driver issue to one of the Northstart compute node. I changed the state of the compute node using : 

Scontrol update NodeName=northstar State=DOWN 

Once driver issue fixed, I tried to update the status to Resume using: 

scontrol update NodeName=northstar state= Resume 

It gives following error: 
---------------------------------------
root@aurora:/# scontrol update NodeName=northstar State=Resume
slurm_update error: Invalid node state specified

----------------------------------------
Result of Sinfo: 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Aurora       up   infinite      1    mix aurora
External     up   infinite      1    mix aurora
Sasquatch    up   infinite      1    mix sasquatch
Northstar    up   infinite      1  inval northstar
Combined*    up   infinite      2    mix aurora,sasquatch

------------------------------------------

--> slurmd status on Northstar compute node is active

But on controlnode (sl-controller) following error can be seen: 
-----------------------------------------------------
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: Node northstar appears to have a different slurm.conf than the slurmctld.  This could cause issues with c>
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: debug:  Node northstar has low real_memory size (1031861 / 1031873) < 100.00%
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: _slurm_rpc_node_registration node=northstar: Invalid argument
Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug:  sched/backfill: _attempt_backfill: no jobs to backfill
---------------------------------------------------------

Northstart Slurmd -C output:
--------------------------------------------------------------------------
NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861
--------------------------------------------------------------------------


Can you please help me to bring up the compute node? 


Thanks,
Arjun
Comment 1 Jason Booth 2023-07-07 10:36:32 MDT
> NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861


Please verify that the "RealMemory=1031861" matches the slurm.conf settings you 
have in your slurm.conf. The slurm.conf should also be in sync across the cluster, 
so the compute node and controller should show "RealMemory=1031861" for this node.
Comment 2 Jason Booth 2023-07-11 09:36:28 MDT
Arjun was my comment#1 helpful and were you able to correct the RealMemory in your configuration file?
Comment 3 Arjun 2023-07-11 11:01:08 MDT
Hi Jason,

Thanks for suggestion. It worked. 

You can close this ticket. 

Thanks,
Arjun