| Summary: | Node state states Invalid and does not allow to change | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Arjun <ajg17> |
| Component: | slurmd | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Queen's University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 22.05.0 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Slurmctld log | ||
> NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861
Please verify that the "RealMemory=1031861" matches the slurm.conf settings you
have in your slurm.conf. The slurm.conf should also be in sync across the cluster,
so the compute node and controller should show "RealMemory=1031861" for this node.
Arjun was my comment#1 helpful and were you able to correct the RealMemory in your configuration file? Hi Jason, Thanks for suggestion. It worked. You can close this ticket. Thanks, Arjun |
Created attachment 31135 [details] Slurmctld log Hi, I had to fix GPU driver issue to one of the Northstart compute node. I changed the state of the compute node using : Scontrol update NodeName=northstar State=DOWN Once driver issue fixed, I tried to update the status to Resume using: scontrol update NodeName=northstar state= Resume It gives following error: --------------------------------------- root@aurora:/# scontrol update NodeName=northstar State=Resume slurm_update error: Invalid node state specified ---------------------------------------- Result of Sinfo: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST Aurora up infinite 1 mix aurora External up infinite 1 mix aurora Sasquatch up infinite 1 mix sasquatch Northstar up infinite 1 inval northstar Combined* up infinite 2 mix aurora,sasquatch ------------------------------------------ --> slurmd status on Northstar compute node is active But on controlnode (sl-controller) following error can be seen: ----------------------------------------------------- Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: Node northstar appears to have a different slurm.conf than the slurmctld. This could cause issues with c> Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: debug: Node northstar has low real_memory size (1031861 / 1031873) < 100.00% Jul 07 09:41:31 sl-controller slurmctld[2363225]: slurmctld: error: _slurm_rpc_node_registration node=northstar: Invalid argument Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug: sched/backfill: _attempt_backfill: beginning Jul 07 09:41:32 sl-controller slurmctld[2363225]: slurmctld: debug: sched/backfill: _attempt_backfill: no jobs to backfill --------------------------------------------------------- Northstart Slurmd -C output: -------------------------------------------------------------------------- NodeName=northstar CPUs=256 Boards=1 SocketsPerBoard=2 CoresPerSocket=64 ThreadsPerCore=2 RealMemory=1031861 -------------------------------------------------------------------------- Can you please help me to bring up the compute node? Thanks, Arjun