Ticket 17148 - Node state states Invalid and does not allow to change
Summary: Node state states Invalid and does not allow to change
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 22.05.0
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Jason Booth
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-07-07 08:13 MDT by Arjun
Modified: 2023-07-11 11:01 MDT (History)
0 users

See Also:
Site: Queen's University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 22.05.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurmctld log (7.03 KB, text/plain)
2023-07-07 08:13 MDT, Arjun
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Arjun 2023-07-07 08:13:17 MDT
Created attachment 31135 [details]
Slurmctld log

Hi, 

I had to fix GPU driver issue to one of the Northstart compute node. I changed the state of the compute node using : 

Scontrol update NodeName=northstar State=DOWN 

Once driver issue fixed, I tried to update the status to Resume using: 

scontrol update NodeName=northstar state= Resume 

It gives following error: 
---------------------------------------
root@aurora:/# scontrol update NodeName=northstar State=Resume
slurm_update error: Invalid node state specified

----------------------------------------
Result of Sinfo: 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
Aurora       up   infinite      1    mix aurora
External     up   infinite      1    mix aurora
Sasquatch    up   infinite      1    mix sasquatch
Northstar    up   infinite      1  inval northstar
Combined*    up   infinite      2    mix aurora,sasquatch

------------------------------------------

--> slurmd status on Northstar compute node is active

But on controlnode (sl-controller) following error can be seen: 
-----------------------------------------------------
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: Node northstar appears to have a different slurm.conf than the slurmctld.  This could cause issues with c>
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: debug:  Node northstar has low real_memory size (1031861 / 1031873) < 100.00%
Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: _slurm_rpc_node_registration node=northstar: Invalid argument
Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug:  sched/backfill: _attempt_backfill: beginning
Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug:  sched/backfill: _attempt_backfill: no jobs to backfill
---------------------------------------------------------

Northstart Slurmd -C output:
--------------------------------------------------------------------------
NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861
--------------------------------------------------------------------------


Can you please help me to bring up the compute node? 


Thanks,
Arjun
Comment 1 Jason Booth 2023-07-07 10:36:32 MDT
> NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861


Please verify that the "RealMemory=1031861" matches the slurm.conf settings you 
have in your slurm.conf. The slurm.conf should also be in sync across the cluster, 
so the compute node and controller should show "RealMemory=1031861" for this node.
Comment 2 Jason Booth 2023-07-11 09:36:28 MDT
Arjun was my comment#1 helpful and were you able to correct the RealMemory in your configuration file?
Comment 3 Arjun 2023-07-11 11:01:08 MDT
Hi Jason,

Thanks for suggestion. It worked. 

You can close this ticket. 

Thanks,
Arjun