Created attachment 31135 [details] Slurmctld log Hi, I had to fix GPU driver issue to one of the Northstart compute node. I changed the state of the compute node using : Scontrol update NodeName=northstar State=DOWN Once driver issue fixed, I tried to update the status to Resume using: scontrol update NodeName=northstar state= Resume It gives following error: --------------------------------------- root@aurora:/# scontrol update NodeName=northstar State=Resume slurm_update error: Invalid node state specified ---------------------------------------- Result of Sinfo: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Aurora up infinite 1 mix aurora External up infinite 1 mix aurora Sasquatch up infinite 1 mix sasquatch Northstar up infinite 1 inval northstar Combined* up infinite 2 mix aurora,sasquatch ------------------------------------------ --> slurmd status on Northstar compute node is active But on controlnode (sl-controller) following error can be seen: ----------------------------------------------------- Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: Node northstar appears to have a different slurm.conf than the slurmctld. This could cause issues with c> Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: debug: Node northstar has low real_memory size (1031861 / 1031873) < 100.00% Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: _slurm_rpc_node_registration node=northstar: Invalid argument Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug: sched/backfill: _attempt_backfill: beginning Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill --------------------------------------------------------- Northstart Slurmd -C output: -------------------------------------------------------------------------- NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861 -------------------------------------------------------------------------- Can you please help me to bring up the compute node? Thanks, Arjun
> NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861 Please verify that the "RealMemory=1031861" matches the slurm.conf settings you have in your slurm.conf. The slurm.conf should also be in sync across the cluster, so the compute node and controller should show "RealMemory=1031861" for this node.
Arjun was my comment#1 helpful and were you able to correct the RealMemory in your configuration file?
Hi Jason, Thanks for suggestion. It worked. You can close this ticket. Thanks, Arjun