| Summary: | Slurm was automatically updated by error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
| Component: | slurmdbd | Assignee: | Jason Booth <jbooth> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Hjalti Sveinsson
2021-02-02 02:33:20 MST
It seems that "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL" went away after "systemctl restart slurmctld". However we have alot of nodes with zombie processes after this, draining them at the moment and will then reboot and resume them. Hjalti - The automated update (19.05.6 to 20.11.2) is unfortunate. Although we support upgrading there is no graceful way to downgrade the cluster without loosing jobs. Can you let us know the current status of the cluster? Which services are running and did rebooting your nodes clear their bad state? For future updates you may want to consider blacklisting packages, and upgrading by from source setting up your own repository with a higher priority. > "error: _rm_usage_node_bitmap: grp_node_bitmap is NULL" This appears to be fixed in more recent versions of Slurm. commit aafe360ff097379b6f61613654692848409d9749 Author: Albert Gil <albert.gil@schedmd.com> Commit: Danny Auble <da@schedmd.com> Fix grp_node_bitmap error when slurmctld started before slurmdbd Fix _addto_used_info() to also update the grp_node_bitmap and grp_node_job_cnt. Bug 8067 Co-authored-by: Felip Moll <felip.moll@schedmd.com> Co-authored-by: Danny Auble <da@schedmd.com> Also we have improved the related code for the future 20.11 on: commit 025e79f6e7ea1796b2b811fd3abe0ebdaef307d9 Author: Danny Auble <da@schedmd.com> Commit: Danny Auble <da@schedmd.com> Merge like code created in commit aafe360ff0973 But 8067 We highly suggest that you do plan an upgrade to 20.11.3. 19.05 has reached its end of life for support and has several issues that have been fixed in later versions of Slurm. It has been a few days since you logged this issue. I am downgrading it to a sev3. Please let me know if you still need this at a high severity. If so, please let me know the current status of the cluster. I am timing this bug out since I have not heard from you in sometime. You are welcome to re-open this bug at anytime by replying to it. |