If slurmctld and slurmdbd are restarted at the same time, race conditions during startup (bug 8066) cause communication problems between the daemons. This cause errors when jobs finish, even after communication is restored. All jobs that were RUNNING during the restart cause "grp_node_bitmap is NULL" errors to be logged when they finish. I have not seen any ill effects other than the log messages. Jobs started after the slurmctld restart does not log errors when finishing. Restarting slurmctld (without restarting slurmdbd at the same time) makes the problem go away. Tentatively setting this as medium impact, as it is triggered every single time the server running both slurmctld and slurmdbd is rebooted. The bug is also triggered if both services are restarted manually or automatically when Slurm RPM packages are upgraded. For us the systemd overrides (see bug 8066) will hopefully prevent this from happening. Output from slurmctld.log, after restarting slurmdbd and slurmctld at the same time: [2019-11-07T15:32:49.263] slurmctld version 19.05.3-2 started on cluster rackham [2019-11-07T15:32:49.351] job_submit.lua: initialized [2019-11-07T15:32:49.378] error: slurm_persist_conn_open_without_init: failed to open persistent connection to rackham-q:7031: Connection refused [2019-11-07T15:32:49.378] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-11-07T15:32:49.378] error: Association database appears down, reading from state file. [2019-11-07T15:32:49.379] error: slurmdbd: Sending PersistInit msg: Connection refused ... [2019-11-07T15:32:50.012] Recovered information about 5731 jobs ... [2019-11-07T15:32:50.091] Registering slurmctld at port 6817 with slurmdbd. [2019-11-07T15:32:50.091] error: slurmdbd: Sending PersistInit msg: Connection refused [2019-11-07T15:32:50.092] error: slurmdbd: Sending PersistInit msg: Connection refused ... Then a successful (?) connection to slurmdbd: [2019-11-07T15:32:51.906] Registering slurmctld at port 6817 with slurmdbd. ... Error messages are logged when jobs finish: [2019-11-07T15:33:13.245] _job_complete: JobId=10571170 WEXITSTATUS 0 [2019-11-07T15:33:13.249] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:13.249] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:13.249] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:13.249] _job_complete: JobId=10571170 done [2019-11-07T15:33:19.752] _job_complete: JobId=10569593 WEXITSTATUS 0 [2019-11-07T15:33:19.756] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:19.756] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:19.756] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:19.756] _job_complete: JobId=10569593 done [2019-11-07T15:33:22.361] _job_complete: JobId=10570092 WEXITSTATUS 0 [2019-11-07T15:33:22.364] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:22.364] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:22.364] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL [2019-11-07T15:33:22.365] _job_complete: JobId=10570092 done Restarting slurmctld (without restarting slurmdbd) makes these errors go away.
Hi Pär, Could you do this? 1) Increase the debug log level of both servers (slurmctld and slurmdbd) to debug2 2) Reproduce the issue 3) Decrease the debug log level as you usually have it 4) Attach both slurmctld and slurmdbd logs 5) Also attach the slurmd log for the nodes where the failing jobs run Thanks, Albert
Hi Pär, As Felip and Danny discussed with you at SC19, we think that we have an idea about how to handle this. I'll let you know, Albert
Hi Pär, I just want to give you a quick update and let you know that we have a first version of a patch that fixes the error. It was not exactly as you discussed in SC19 because the issue is not related to SaveSatate, that is OK, but about how some info is updated when slurmctld is started without the slurmdbd and later on slurmdbd is started. It neither happen if slurmdbd is restarted while slurmctld keeps running. It only when the controller is started without the slurmdbd, and then it is started. Patch still needs to go further in our review and testing workflow, but I wanted to update you. Regards, And Happy New Year! ;-) Albert
Hi Pär, I'm glad to let you know that this has been fixed on the following commit and will be released as part of the 19.05.7 version, as well as for 20.02.2: commit aafe360ff097379b6f61613654692848409d9749 Author: Albert Gil <albert.gil@schedmd.com> Commit: Danny Auble <da@schedmd.com> Fix grp_node_bitmap error when slurmctld started before slurmdbd Fix _addto_used_info() to also update the grp_node_bitmap and grp_node_job_cnt. Bug 8067 Co-authored-by: Felip Moll <felip.moll@schedmd.com> Co-authored-by: Danny Auble <da@schedmd.com> Also we have improved the related code for the future 20.11 on: commit 025e79f6e7ea1796b2b811fd3abe0ebdaef307d9 Author: Danny Auble <da@schedmd.com> Commit: Danny Auble <da@schedmd.com> Merge like code created in commit aafe360ff0973 But 8067 Thanks, Albert
Hi Albert, (In reply to Albert Gil from comment #28) > I'm glad to let you know that this has been fixed on the following commit > and will be released as part of the 19.05.7 version, as well as for 20.02.2: > > commit aafe360ff097379b6f61613654692848409d9749 > Author: Albert Gil <albert.gil@schedmd.com> > Commit: Danny Auble <da@schedmd.com> ... > Co-authored-by: Felip Moll <felip.moll@schedmd.com> > Co-authored-by: Danny Auble <da@schedmd.com> Nice team effort. :-) I just did a quick build using the latest 19.05 branch including this commit on my test cluster. With this build I can no longer reproduce the problem, so the fix seems to work. Thanks, Pär
I'm glad too hear that. Thanks!