Ticket 10759

Summary: Slurm was automatically updated by error
Product: Slurm Reporter: Hjalti Sveinsson <hjalti.sveinsson>
Component: slurmdbdAssignee: Jason Booth <jbooth>
Status: RESOLVED TIMEDOUT QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: - Unsupported Older Versions   
Hardware: Linux   
OS: Linux   
Site: deCODE Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Hjalti Sveinsson 2021-02-02 02:33:20 MST
Hello,

this night our Slurm packages were updated via EPEL without our knowledge from our current version 19.05.6 to 20.11.2. It came to our knowledge this morning. 
Slurm had updated on DB node and Head node of one of our cluster as well as on some client nodes. Compute nodes were not updated.

I can see that slurmctld and slurmdbd were unable to start after the update. 

I downgraded immediately when I saw this because Slurm was not working, everything seems to be operational at the moment. 

I however see these error messages: "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL"

Is there anything that needs to be checked specially because of this incident, it would be good to know as soon as possible. 

best regards,
Hjalti Sveinsson
Comment 1 Hjalti Sveinsson 2021-02-02 08:03:09 MST
It seems that "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL" went away after "systemctl restart slurmctld".

However we have alot of nodes with zombie processes after this, draining them at the moment and will then reboot and resume them.
Comment 2 Jason Booth 2021-02-02 09:17:08 MST
Hjalti - The automated update (19.05.6 to 20.11.2) is unfortunate. Although we support upgrading there is no graceful way to downgrade the cluster without loosing jobs. Can you let us know the current status of the cluster? Which services are running and did rebooting your nodes clear their bad state?

For future updates you may want to consider blacklisting packages, and upgrading by from source setting up your own repository with a higher priority.


> "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL"

This appears to be fixed in more recent versions of Slurm.

commit aafe360ff097379b6f61613654692848409d9749
Author: Albert Gil <albert.gil@schedmd.com>
Commit: Danny Auble <da@schedmd.com>

    Fix grp_node_bitmap error when slurmctld started before slurmdbd
    
    Fix _addto_used_info() to also update the grp_node_bitmap and
    grp_node_job_cnt.
    
    Bug 8067
    
    Co-authored-by: Felip Moll <felip.moll@schedmd.com>
    Co-authored-by: Danny Auble <da@schedmd.com>

Also we have improved the related code for the future 20.11 on:

commit 025e79f6e7ea1796b2b811fd3abe0ebdaef307d9
Author: Danny Auble <da@schedmd.com>
Commit: Danny Auble <da@schedmd.com>

    Merge like code created in commit aafe360ff0973
    
    But 8067


We highly suggest that you do plan an upgrade to 20.11.3. 19.05 has reached its end of life for support and has several issues that have been fixed in later versions of Slurm.
Comment 3 Jason Booth 2021-02-04 15:20:26 MST
It has been a few days since you logged this issue. I am downgrading it to a sev3. Please let me know if you still need this at a high severity. If so, please let me know the current status of the cluster.
Comment 4 Jason Booth 2021-02-25 11:44:24 MST
I am timing this bug out since I have not heard from you in sometime. You are welcome to re-open this bug at anytime by replying to it.