| Summary: | "grp_node_bitmap is NULL" errors if slurmctld and slurmdbd is restarted at the same time | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Pär Lindfors <par.lindfors> |
| Component: | slurmctld | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 19.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=8066 | ||
| Site: | SNIC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | UPPMAX | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 19.05.7 20.02.2 20.11.0pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Pär Lindfors
2019-11-07 08:18:25 MST
Hi Pär, Could you do this? 1) Increase the debug log level of both servers (slurmctld and slurmdbd) to debug2 2) Reproduce the issue 3) Decrease the debug log level as you usually have it 4) Attach both slurmctld and slurmdbd logs 5) Also attach the slurmd log for the nodes where the failing jobs run Thanks, Albert Hi Pär, As Felip and Danny discussed with you at SC19, we think that we have an idea about how to handle this. I'll let you know, Albert Hi Pär, I just want to give you a quick update and let you know that we have a first version of a patch that fixes the error. It was not exactly as you discussed in SC19 because the issue is not related to SaveSatate, that is OK, but about how some info is updated when slurmctld is started without the slurmdbd and later on slurmdbd is started. It neither happen if slurmdbd is restarted while slurmctld keeps running. It only when the controller is started without the slurmdbd, and then it is started. Patch still needs to go further in our review and testing workflow, but I wanted to update you. Regards, And Happy New Year! ;-) Albert Hi Pär, I'm glad to let you know that this has been fixed on the following commit and will be released as part of the 19.05.7 version, as well as for 20.02.2: commit aafe360ff097379b6f61613654692848409d9749 Author: Albert Gil <albert.gil@schedmd.com> Commit: Danny Auble <da@schedmd.com> Fix grp_node_bitmap error when slurmctld started before slurmdbd Fix _addto_used_info() to also update the grp_node_bitmap and grp_node_job_cnt. Bug 8067 Co-authored-by: Felip Moll <felip.moll@schedmd.com> Co-authored-by: Danny Auble <da@schedmd.com> Also we have improved the related code for the future 20.11 on: commit 025e79f6e7ea1796b2b811fd3abe0ebdaef307d9 Author: Danny Auble <da@schedmd.com> Commit: Danny Auble <da@schedmd.com> Merge like code created in commit aafe360ff0973 But 8067 Thanks, Albert Hi Albert, (In reply to Albert Gil from comment #28) > I'm glad to let you know that this has been fixed on the following commit > and will be released as part of the 19.05.7 version, as well as for 20.02.2: > > commit aafe360ff097379b6f61613654692848409d9749 > Author: Albert Gil <albert.gil@schedmd.com> > Commit: Danny Auble <da@schedmd.com> ... > Co-authored-by: Felip Moll <felip.moll@schedmd.com> > Co-authored-by: Danny Auble <da@schedmd.com> Nice team effort. :-) I just did a quick build using the latest 19.05 branch including this commit on my test cluster. With this build I can no longer reproduce the problem, so the fix seems to work. Thanks, Pär I'm glad too hear that. Thanks! |