Hello, this night our Slurm packages were updated via EPEL without our knowledge from our current version 19.05.6 to 20.11.2. It came to our knowledge this morning. Slurm had updated on DB node and Head node of one of our cluster as well as on some client nodes. Compute nodes were not updated. I can see that slurmctld and slurmdbd were unable to start after the update. I downgraded immediately when I saw this because Slurm was not working, everything seems to be operational at the moment. I however see these error messages: "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL" Is there anything that needs to be checked specially because of this incident, it would be good to know as soon as possible. best regards, Hjalti Sveinsson
It seems that "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL" went away after "systemctl restart slurmctld". However we have alot of nodes with zombie processes after this, draining them at the moment and will then reboot and resume them.
Hjalti - The automated update (19.05.6 to 20.11.2) is unfortunate. Although we support upgrading there is no graceful way to downgrade the cluster without loosing jobs. Can you let us know the current status of the cluster? Which services are running and did rebooting your nodes clear their bad state? For future updates you may want to consider blacklisting packages, and upgrading by from source setting up your own repository with a higher priority. > "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL" This appears to be fixed in more recent versions of Slurm. commit aafe360ff097379b6f61613654692848409d9749 Author: Albert Gil <albert.gil@schedmd.com> Commit: Danny Auble <da@schedmd.com> Fix grp_node_bitmap error when slurmctld started before slurmdbd Fix _addto_used_info() to also update the grp_node_bitmap and grp_node_job_cnt. Bug 8067 Co-authored-by: Felip Moll <felip.moll@schedmd.com> Co-authored-by: Danny Auble <da@schedmd.com> Also we have improved the related code for the future 20.11 on: commit 025e79f6e7ea1796b2b811fd3abe0ebdaef307d9 Author: Danny Auble <da@schedmd.com> Commit: Danny Auble <da@schedmd.com> Merge like code created in commit aafe360ff0973 But 8067 We highly suggest that you do plan an upgrade to 20.11.3. 19.05 has reached its end of life for support and has several issues that have been fixed in later versions of Slurm.
It has been a few days since you logged this issue. I am downgrading it to a sev3. Please let me know if you still need this at a high severity. If so, please let me know the current status of the cluster.
I am timing this bug out since I have not heard from you in sometime. You are welcome to re-open this bug at anytime by replying to it.