Ticket 13706

Summary: after upgrade we get Association database appears down, reading from state file
Product: Slurm Reporter: Ali Siavosh <Ali.Siavosh-haghighi>
Component: slurmctldAssignee: Director of Support <support>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: NYUMC Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Ali Siavosh 2022-03-27 11:09:06 MDT
Hi,
We are in process of upgrading slurm from 20.11.7 to 21.08.6
 the slurmdbd is up but slurmctld gives:

[root@bigpurple-hn1 slurm]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2022-03-27 12:49:12 EDT; 15min ago
  Process: 187525 ExecStart=/cm/shared/apps/slurm/current/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 187527 (slurmctld)
    Tasks: 20
   Memory: 10.8M
   CGroup: /system.slice/slurmctld.service
           ├─187527 /cm/shared/apps/slurm/current/sbin/slurmctld
           └─187529 slurmctld: slurmscriptd

Mar 27 12:49:12 bigpurple-hn1 systemd[1]: Started Slurm controller daemon.
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: slurmctld version 21.08.6 started on cluster slurm_cluster
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: error: _shutdown_bu_thread:send/recv bigpurple-hn2: Connection refused
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:bigpurple-hn1:7920: ...on refused
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: error: Sending PersistInit msg: Connection refused
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: error: Sending PersistInit msg: Connection refused
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: error: Sending PersistInit msg: Connection refused
Mar 27 12:49:12 bigpurple-hn1 slurmctld[187527]: error: Association database appears down, reading from state file.
Mar 27 12:49:22 bigpurple-hn1 slurmctld[187527]: error: Sending PersistInit msg: Connection refused
Hint: Some lines were ellipsized, use -l to show in full.
[
Comment 1 Ali Siavosh 2022-03-27 11:10:07 MDT
Please help it is urgent.
Comment 2 Jason Booth 2022-03-27 17:49:06 MDT
Please upload a copy of your slurmdbd.log and the output of sdiag. Also, please let us know if you see any errors when you start the slurmdbd process.
Comment 3 Ali Siavosh 2022-03-28 08:02:52 MDT
Hi Jason,
The slurmdbd took quite a long time to convert the database and start communication with the port. and thus the delay with slurmctld to come up clean. our database is about 10GB and this is why.
Comment 4 Jason Booth 2022-03-28 08:16:33 MDT
I am lowering the severity based on your last reply. Do you require any further assistance?
Comment 5 Ali Siavosh 2022-03-28 08:25:31 MDT
I dont think so. Thank you

Thanks

Ali Siavosh-Haghighi, PhD
Sr. HPC System Administrator, High-Performance Computing

NYU Langone Health
Medical Center Information Technology
227 E 30th St, #7-738
New York, NY 10016

O: 646-524-0860
C: 347-843-2357
siavoa01@nyumc.org<mailto:siavoa01@nyumc.org>
nyulangone.org

On Mar 28, 2022, at 10:16 AM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote:


[EXTERNAL]

Jason Booth<mailto:jbooth@schedmd.com> changed bug 13706<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=13706__;!!MXfaZl3l!I26HYo0_WOPhRquV-61UvQAQ-u4mFKijBREDM2lp3g-edk0cwXJ2U3s1035Tcckm65wbzc9LtZOH$>
What    Removed Added
Severity        1 - System not usable   4 - Minor Issue

Comment # 4<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=13706*c4__;Iw!!MXfaZl3l!I26HYo0_WOPhRquV-61UvQAQ-u4mFKijBREDM2lp3g-edk0cwXJ2U3s1035Tcckm65wbzQvqiC0v$> on bug 13706<https://urldefense.com/v3/__https://bugs.schedmd.com/show_bug.cgi?id=13706__;!!MXfaZl3l!I26HYo0_WOPhRquV-61UvQAQ-u4mFKijBREDM2lp3g-edk0cwXJ2U3s1035Tcckm65wbzc9LtZOH$> from Jason Booth<mailto:jbooth@schedmd.com>

I am lowering the severity based on your last reply. Do you require any further
assistance?

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 6 Jason Booth 2022-03-28 08:27:21 MDT
Resolved