Ticket 8067

Summary:	"grp_node_bitmap is NULL" errors if slurmctld and slurmdbd is restarted at the same time
Product:	Slurm	Reporter:	Pär Lindfors <par.lindfors>
Component:	slurmctld	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	felip.moll
Version:	19.05.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8066
Site:	SNIC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	UPPMAX	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19.05.7 20.02.2 20.11.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Pär Lindfors 2019-11-07 08:18:25 MST

If slurmctld and slurmdbd are restarted at the same time, race
conditions during startup (bug 8066) cause communication problems
between the daemons. This cause errors when jobs finish, even
after communication is restored.

All jobs that were RUNNING during the restart
cause "grp_node_bitmap is NULL" errors to be logged when they
finish.

I have not seen any ill effects other than the log messages.

Jobs started after the slurmctld restart does not log errors when
finishing.

Restarting slurmctld (without restarting slurmdbd at the same
time) makes the problem go away.

Tentatively setting this as medium impact, as it is triggered
every single time the server running both slurmctld and slurmdbd
is rebooted. The bug is also triggered if both services are
restarted manually or automatically when Slurm RPM packages are
upgraded.

For us the systemd overrides (see bug 8066) will hopefully
prevent this from happening.


Output from slurmctld.log, after restarting slurmdbd and
slurmctld at the same time:

  [2019-11-07T15:32:49.263] slurmctld version 19.05.3-2 started on cluster rackham
  [2019-11-07T15:32:49.351] job_submit.lua: initialized
  [2019-11-07T15:32:49.378] error: slurm_persist_conn_open_without_init: failed to open persistent connection to rackham-q:7031: Connection refused
  [2019-11-07T15:32:49.378] error: slurmdbd: Sending PersistInit msg: Connection refused
  [2019-11-07T15:32:49.378] error: Association database appears down, reading from state file.
  [2019-11-07T15:32:49.379] error: slurmdbd: Sending PersistInit msg: Connection refused
  ...
  [2019-11-07T15:32:50.012] Recovered information about 5731 jobs
  ...
  [2019-11-07T15:32:50.091] Registering slurmctld at port 6817 with slurmdbd.
  [2019-11-07T15:32:50.091] error: slurmdbd: Sending PersistInit msg: Connection refused
  [2019-11-07T15:32:50.092] error: slurmdbd: Sending PersistInit msg: Connection refused
  ...

Then a successful (?) connection to slurmdbd:

  [2019-11-07T15:32:51.906] Registering slurmctld at port 6817 with slurmdbd.
  ...

Error messages are logged when jobs finish:

  [2019-11-07T15:33:13.245] _job_complete: JobId=10571170 WEXITSTATUS 0
  [2019-11-07T15:33:13.249] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:13.249] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:13.249] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:13.249] _job_complete: JobId=10571170 done
  [2019-11-07T15:33:19.752] _job_complete: JobId=10569593 WEXITSTATUS 0
  [2019-11-07T15:33:19.756] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:19.756] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:19.756] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:19.756] _job_complete: JobId=10569593 done
  [2019-11-07T15:33:22.361] _job_complete: JobId=10570092 WEXITSTATUS 0
  [2019-11-07T15:33:22.364] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:22.364] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:22.364] error: _rm_usage_node_bitmap: grp_node_bitmap is NULL
  [2019-11-07T15:33:22.365] _job_complete: JobId=10570092 done

Restarting slurmctld (without restarting slurmdbd) makes these errors go away.

Comment 1 Albert Gil 2019-11-12 12:51:05 MST

Hi Pär,

Could you do this?
1) Increase the debug log level of both servers (slurmctld and slurmdbd) to debug2
2) Reproduce the issue
3) Decrease the debug log level as you usually have it
4) Attach both slurmctld and slurmdbd logs
5) Also attach the slurmd log for the nodes where the failing jobs run

Thanks,
Albert

Comment 4 Albert Gil 2019-11-26 07:44:30 MST

Hi Pär,

As Felip and Danny discussed with you at SC19, we think that we have an idea about how to handle this.

I'll let you know,
Albert

Comment 20 Albert Gil 2019-12-30 12:53:48 MST

Hi Pär,

I just want to give you a quick update and let you know that we have a first version of a patch that fixes the error.
It was not exactly as you discussed in SC19 because the issue is not related to SaveSatate, that is OK, but about how some info is updated when slurmctld is started without the slurmdbd and later on slurmdbd is started.
It neither happen if slurmdbd is restarted while slurmctld keeps running.
It only when the controller is started without the slurmdbd, and then it is started.

Patch still needs to go further in our review and testing workflow, but I wanted to update you.

Regards,
And Happy New Year! ;-)
Albert

Comment 28 Albert Gil 2020-04-02 02:00:45 MDT

Hi Pär,

I'm glad to let you know that this has been fixed on the following commit and will be released as part of the 19.05.7 version, as well as for 20.02.2:

commit aafe360ff097379b6f61613654692848409d9749
Author: Albert Gil <albert.gil@schedmd.com>
Commit: Danny Auble <da@schedmd.com>

    Fix grp_node_bitmap error when slurmctld started before slurmdbd
    
    Fix _addto_used_info() to also update the grp_node_bitmap and
    grp_node_job_cnt.
    
    Bug 8067
    
    Co-authored-by: Felip Moll <felip.moll@schedmd.com>
    Co-authored-by: Danny Auble <da@schedmd.com>

Also we have improved the related code for the future 20.11 on:

commit 025e79f6e7ea1796b2b811fd3abe0ebdaef307d9
Author: Danny Auble <da@schedmd.com>
Commit: Danny Auble <da@schedmd.com>

    Merge like code created in commit aafe360ff0973
    
    But 8067

Thanks,
Albert

Comment 29 Pär Lindfors 2020-04-03 07:45:50 MDT

Hi Albert,

(In reply to Albert Gil from comment #28)
> I'm glad to let you know that this has been fixed on the following commit
> and will be released as part of the 19.05.7 version, as well as for 20.02.2:
> 
> commit aafe360ff097379b6f61613654692848409d9749
> Author: Albert Gil <albert.gil@schedmd.com>
> Commit: Danny Auble <da@schedmd.com>
...
>     Co-authored-by: Felip Moll <felip.moll@schedmd.com>
>     Co-authored-by: Danny Auble <da@schedmd.com>

Nice team effort. :-)

I just did a quick build using the latest 19.05 branch including this commit on my test cluster. With this build I can no longer reproduce the problem, so the fix seems to work.

Thanks,
Pär

Comment 30 Albert Gil 2020-04-03 09:48:51 MDT

I'm glad too hear that.
Thanks!