12747 – Slurm DB malfunction

Ticket 12747 - Slurm DB malfunction

Summary: Slurm DB malfunction

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	- Unsupported Older Versions
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Albert Gil
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-10-26 11:44 MDT by ar
Modified:	2021-11-15 14:20 MST (History)
CC List:	0 users

See Also:
Site:	Columbia University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctld (9.17 MB, application/octet-stream) 2021-10-27 14:22 MDT, ar	Details
slurmdbd (13.36 MB, application/octet-stream) 2021-10-28 08:32 MDT, ar	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description ar 2021-10-26 11:44:28 MDT

Hi,

We observed some malfunctioning in Slurm DB.


$ sacct -u mv2640
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
25473996      lib2st3-5       dsi2        dsi         24    RUNNING      0:0
25485488      himem_st5  dsi1,dsi2        dsi          1    PENDING      0:0
25485489        lib2st5 dsi1,dsi2+        dsi         24    PENDING      0:0
25486925        lib2st5 dsi1,dsi2+        dsi         24    PENDING      0:0


$ squeue -u mv2640
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          25485488 dsi1,dsi2 himem_st   mv2640 PD       0:00      1 (Priority)
(base) [ab2080@viz ~]$


So: job 25473996 is no longer running, and the jobs 25485489 and 25486925 and are no longer pending (they already ran and ended as well).


For example 25473996 is not runnning 


[ar2667@holmes ~]$ sacct -j 25473996 -u mv2640 --format=User,JobID,jobname,state,time,start,end,elapsed,ReqTRE,nodelist 
     User        JobID    JobName      State  Timelimit               Start                 End    Elapsed    ReqTRES        NodeList 
--------- ------------ ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- --------------- 
   mv2640 25473996      lib2st3-5    RUNNING 5-00:00:00 2021-10-25T14:01:10             Unknown   22:11:17 cpu=24,me+         node215 


[ar2667@node215 ~]$ sudo less  /var/log/slurmd |grep 25473996
[2021-10-25T14:01:13.115] _run_prolog: prolog with lock for job 25473996 ran for 0 seconds
[2021-10-25T14:01:13.115] Launching batch job 25473996 for UID 453243
[2021-10-25T14:01:13.177] [25473996.batch] task/cgroup: /slurm/uid_453243/job_25473996: alloc=96000MB mem.limit=96000MB memsw.limit=96000MB
[2021-10-25T14:01:13.179] [25473996.batch] task/cgroup: /slurm/uid_453243/job_25473996/step_batch: alloc=96000MB mem.limit=96000MB memsw.limit=96000MB
[2021-10-26T05:16:28.459] [25473996.batch] error: Step 25473996.4294967294 hit memory+swap limit at least once during execution. This may or may not result in some failure.
[2021-10-26T05:16:30.928] [25473996.batch] Defering sending signal, processes in job are currently core dumping
[2021-10-26T05:17:01.986] [25473996.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 34560
[2021-10-26T05:17:01.989] [25473996.batch] done with job


Do you have any sugestions how to reset the DB?

Comment 1 Jason Booth 2021-10-26 13:08:44 MDT

Based on what you have posted there seems to be a communication issue between the two. Would please attach the following to this bug.


> slurmdbd.log
> slurmctld.log
> "sacctmgr show cluster format=Cluster,ControlHost,ControlPort,RPC"

If you restart the slurmctld does the issues resolve? 

Do you have any runaways on this cluster? 

> "sacctmgr show runaway"

Comment 3 ar 2021-10-26 14:51:27 MDT

 slurmctld
<https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/view?usp=drive_web>
 slurmdbd-20210612
<https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/view?usp=drive_web>
Hi,

Thank you for your reply.

The slurmctld logs and slurmdbd logs until June 12th are attached. We are
gathering the slurmdbd logs, however the size of the slurmdbd logs is  0
now.

The slurmdbd.service has been running since June, 8th and the log file was
created on  Jun 12.

[ar2667@roll ~]$ systemctl status slurmdbd.service

*●* slurmdbd.service - Slurm DBD accounting daemon

   Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled;
vendor preset: disabled)

   Active: *active (running)* since Tue 2021-06-08 18:36:21 EDT; 4 months
18 days ago

  Process: 32682 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd
$SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)

 Main PID: 32837 (slurmdbd)

   Memory: 200.1M

   CGroup: /system.slice/slurmdbd.service

           └─32837 /cm/shared/apps/slurm/17.11.2/sbin/slurmdbd


[ar2667@roll ~]$ sudo ls -ltr   /var/log/slurmdbd

-rw-r----- 1 slurm root 0 Jun 12 03:35 /var/log/slurmdbd


We restarted slurmctld.service on September 30th but I do not think this
solved the issue.


[ar2667@roll ~]$ systemctl status slurmctld.service

*●* slurmctld.service - Slurm controller daemon

   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled;
vendor preset: disabled)

   Active: *active (running)* since Thu 2021-09-30 12:33:10 EDT; 3 weeks 5
days ago

  Process: 10661 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld
$SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

 Main PID: 10663 (slurmctld)

   Memory: 3.1G

   CGroup: /system.slice/slurmctld.service

           └─10663 /cm/shared/apps/slurm/17.11.2/sbin/slurmctld



[ar2667@roll ~]$ sacctmgr show cluster
format=Cluster,ControlHost,ControlPort,RPC

   Cluster     ControlHost  ControlPort   RPC

---------- --------------- ------------ -----

  habanero       10.43.4.2         6817  8192

slurm_clu+                            0  7680



The "sacctmgr show runaway" hangs out.



[ar2667@roll ~]$ sacctmgr show runaway


Would restarting slurmdbd service be helpful in this situation and what
will be the effect on the currently running jobs?


Thank you,

Axinia



*---*
Axinia Radeva
Manager, Research Computing Services




On Tue, Oct 26, 2021 at 3:08 PM <bugs@schedmd.com> wrote:

> *Comment # 1
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c1&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=WgpYwhrTKPGbkNtmW9ZiI5143EQ70tyWY1riDHC40-8&s=U0vmu0Jwco4yWkPxAsmMommCTJnUGWQ6dI4RsWYsBgg&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=WgpYwhrTKPGbkNtmW9ZiI5143EQ70tyWY1riDHC40-8&s=lYtV7GK7GYwWldtA9caStvaLHgkUx3AMf8RKDTn0cag&e=>
> from Jason Booth <jbooth@schedmd.com> *
>
> Based on what you have posted there seems to be a communication issue between
> the two. Would please attach the following to this bug.
>
> > slurmdbd.log
> > slurmctld.log
> > "sacctmgr show cluster format=Cluster,ControlHost,ControlPort,RPC"
>
> If you restart the slurmctld does the issues resolve?
>
> Do you have any runaways on this cluster?
> > "sacctmgr show runaway"
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 4 Albert Gil 2021-10-27 12:00:17 MDT

Hi Axinia,

>  slurmctld
> <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/
> view?usp=drive_web>
>  slurmdbd-20210612
> <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/
> view?usp=drive_web>

I tried to access, but wasn't able.
I requested access thought gdocs, but you can attach then here if you want.

> [ar2667@roll ~]$ sacctmgr show runaway

So, no output at all?
What version are you using?

Can you also attach the output of:

$ sdiag
$ sacctmgr show stats

> Would restarting slurmdbd service be helpful in this situation

It depends on the actual problem you are facing, but it's a simple test.

> and what
> will be the effect on the currently running jobs?

No affect at all.
Note that Slurm is designed in a fault-tolerant way, so it's able to keep working perfectly fine even when slurmdbd is down (for a while caching data, then accounting data is discarded).
Actually running jobs keep running also with slurmctld is down, but no more jobs can be submitted or will be launched.

So, you can restart the slurmdbd without any issue.


Regards,
Albert

Comment 5 ar 2021-10-27 14:22:09 MDT

Created attachment 21980 [details]
slurmctld

Hi Albert,

I just gave you permissions to slurmctld and slurmdbd-20210612 logs.

Can you please try again and let me know if you have any issues accessing
the files.

~axinia



*---*
Axinia Radeva
Manager, Research Computing Services




On Wed, Oct 27, 2021 at 2:00 PM <bugs@schedmd.com> wrote:

> *Comment # 4
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c4&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=iG-AOlmFNdo7Dm5K8XdTTzMoPIcA2UoLNKTV3vdjAPM&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=k2UTHEfpXxZPaIJBGwF9dOh79Di8UVS1s_BjiLA_Dls&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> >  slurmctld
> > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=rcLHVp5R1AgHk7cDsPm3cSlSM8eyDsQqNrTsK1pb_lI&e=>
> > view?usp=drive_web>
> >  slurmdbd-20210612
> > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1vjnPIqj-5Fqb4LZf3oZfVK-5F-5Fhbz432kl94_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=HkHzYBwHZDSRzJiixNivMv_KOBlYvcSszHxDGewyitA&e=>
> > view?usp=drive_web>
>
> I tried to access, but wasn't able.
> I requested access thought gdocs, but you can attach then here if you want.
> > [ar2667@roll ~]$ sacctmgr show runaway
>
> So, no output at all?
> What version are you using?
>
> Can you also attach the output of:
>
> $ sdiag
> $ sacctmgr show stats
> > Would restarting slurmdbd service be helpful in this situation
>
> It depends on the actual problem you are facing, but it's a simple test.
> > and what
> > will be the effect on the currently running jobs?
>
> No affect at all.
> Note that Slurm is designed in a fault-tolerant way, so it's able to keep
> working perfectly fine even when slurmdbd is down (for a while caching data,
> then accounting data is discarded).
> Actually running jobs keep running also with slurmctld is down, but no more
> jobs can be submitted or will be launched.
>
> So, you can restart the slurmdbd without any issue.
>
>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 ar 2021-10-27 14:22:23 MDT

*---*
Axinia Radeva
Manager, Research Computing Services




On Wed, Oct 27, 2021 at 4:21 PM Axinia Radeva <aradeva@columbia.edu> wrote:

> Hi Albert,
>
> I just gave you permissions to slurmctld and slurmdbd-20210612 logs.
>
> Can you please try again and let me know if you have any issues accessing
> the files.
>
> ~axinia
>
>
>
> *---*
> Axinia Radeva
> Manager, Research Computing Services
>
>
>
>
> On Wed, Oct 27, 2021 at 2:00 PM <bugs@schedmd.com> wrote:
>
>> *Comment # 4
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c4&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=iG-AOlmFNdo7Dm5K8XdTTzMoPIcA2UoLNKTV3vdjAPM&e=>
>> on bug 12747
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=k2UTHEfpXxZPaIJBGwF9dOh79Di8UVS1s_BjiLA_Dls&e=>
>> from Albert Gil <albert.gil@schedmd.com> *
>>
>> Hi Axinia,
>> >  slurmctld
>> > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=rcLHVp5R1AgHk7cDsPm3cSlSM8eyDsQqNrTsK1pb_lI&e=>
>> > view?usp=drive_web>
>> >  slurmdbd-20210612
>> > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1vjnPIqj-5Fqb4LZf3oZfVK-5F-5Fhbz432kl94_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=HkHzYBwHZDSRzJiixNivMv_KOBlYvcSszHxDGewyitA&e=>
>> > view?usp=drive_web>
>>
>> I tried to access, but wasn't able.
>> I requested access thought gdocs, but you can attach then here if you want.
>> > [ar2667@roll ~]$ sacctmgr show runaway
>>
>> So, no output at all?
>> What version are you using?
>>
>> Can you also attach the output of:
>>
>> $ sdiag
>> $ sacctmgr show stats
>> > Would restarting slurmdbd service be helpful in this situation
>>
>> It depends on the actual problem you are facing, but it's a simple test.
>> > and what
>> > will be the effect on the currently running jobs?
>>
>> No affect at all.
>> Note that Slurm is designed in a fault-tolerant way, so it's able to keep
>> working perfectly fine even when slurmdbd is down (for a while caching data,
>> then accounting data is discarded).
>> Actually running jobs keep running also with slurmctld is down, but no more
>> jobs can be submitted or will be launched.
>>
>> So, you can restart the slurmdbd without any issue.
>>
>>
>> Regards,
>> Albert
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 7 ar 2021-10-27 15:29:21 MDT

Can you please see my responses below?


*---*
Axinia Radeva
Manager, Research Computing Services




On Wed, Oct 27, 2021 at 2:00 PM <bugs@schedmd.com> wrote:

> *Comment # 4
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c4&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=iG-AOlmFNdo7Dm5K8XdTTzMoPIcA2UoLNKTV3vdjAPM&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=k2UTHEfpXxZPaIJBGwF9dOh79Di8UVS1s_BjiLA_Dls&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> >  slurmctld
> > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=rcLHVp5R1AgHk7cDsPm3cSlSM8eyDsQqNrTsK1pb_lI&e=>
> > view?usp=drive_web>
> >  slurmdbd-20210612
> > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1vjnPIqj-5Fqb4LZf3oZfVK-5F-5Fhbz432kl94_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=HkHzYBwHZDSRzJiixNivMv_KOBlYvcSszHxDGewyitA&e=>
> > view?usp=drive_web>
>
> I tried to access, but wasn't able.
> I requested access thought gdocs, but you can attach then here if you want.
> > [ar2667@roll ~]$ sacctmgr show runaway
>
> no

> So, no output at all?
> What version are you using?
>
>
We have slurm 17.11.2

> Can you also attach the output of:
>
> $ sdiag
>
>  [ar2667@roll ~]$ sdiag

*******************************************************

sdiag output at Wed Oct 27 16:36:15 2021 (1635366975)

Data since      Tue Oct 26 20:00:00 2021 (1635292800)

*******************************************************

Server thread count:  3

Agent queue size:     0

DBD Agent queue size: 40553


Jobs submitted: 245

Jobs started:   516

Jobs completed: 1403

Jobs canceled:  6

Jobs failed:    0


Jobs running:    297

Jobs running ts: Wed Oct 27 16:35:51 2021 (1635366951)


Main schedule statistics (microseconds):

Last cycle:   1739

Max cycle:    173709

Total cycles: 3049

Mean cycle:   3888

Mean depth cycle:  22

Cycles per minute: 2

Last queue length: 4


Backfilling stats

Total backfilled jobs (since last slurm start): 22707

Total backfilled jobs (since last stats cycle start): 27

Total backfilled heterogeneous job components: 0

Total cycles: 2247

Last cycle when: Wed Oct 27 16:36:05 2021 (1635366965)

Last cycle: 21708

Max cycle:  17654554

Mean cycle: 1728779

Last depth cycle: 5

Last depth cycle (try sched): 4

Depth Mean: 36

Depth Mean (try depth): 33

Last queue length: 4

Queue length mean: 13


Remote Procedure Call statistics by message type

REQUEST_PARTITION_INFO                  ( 2009) count:813036 ave_time:1967
total_time:1599431435

REQUEST_JOB_INFO                        ( 2003) count:396657
ave_time:171717 total_time:68113042699

REQUEST_NODE_INFO_SINGLE                ( 2040) count:315594
ave_time:181072 total_time:57145431559

MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:283896 ave_time:57581
total_time:16347141542

MESSAGE_EPILOG_COMPLETE                 ( 6012) count:206419
ave_time:168857 total_time:34855343419

REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:195367
ave_time:814519 total_time:159130204413

REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:144407 ave_time:89256
total_time:12889214894

REQUEST_FED_INFO                        ( 2049) count:103471 ave_time:445
  total_time:46063494

REQUEST_JOB_USER_INFO                   ( 2039) count:93247  ave_time:66961
total_time:6243924165

REQUEST_PING                            ( 1008) count:83486  ave_time:587
  total_time:49082340

REQUEST_NODE_INFO                       ( 2007) count:49710  ave_time:10925
total_time:543102582

REQUEST_KILL_JOB                        ( 5032) count:27008  ave_time:63347
total_time:1710881768

REQUEST_UPDATE_JOB                      ( 3001) count:17744  ave_time:13718
total_time:243417555

REQUEST_JOB_INFO_SINGLE                 ( 2021) count:10143  ave_time:41705
total_time:423019514

REQUEST_JOB_STEP_CREATE                 ( 5001) count:2562   ave_time:1508
total_time:3865672

REQUEST_STEP_COMPLETE                   ( 5016) count:2042   ave_time:103350
total_time:211041984

REQUEST_JOB_READY                       ( 4019) count:1550   ave_time:52569
total_time:81482015

REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:1433   ave_time:118185
total_time:169360333

REQUEST_SHARE_INFO                      ( 2022) count:1343   ave_time:6551
total_time:8799096

REQUEST_CANCEL_JOB_STEP                 ( 5005) count:739    ave_time:41977
total_time:31021520

REQUEST_RESOURCE_ALLOCATION             ( 4001) count:673    ave_time:189879
total_time:127788965

REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:671    ave_time:175800
total_time:117961935

REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:47     ave_time:237879
total_time:11180339

ACCOUNTING_UPDATE_MSG                   (10001) count:21     ave_time:3054223
total_time:64138693

REQUEST_RECONFIGURE                     ( 1003) count:19     ave_time:4520842
total_time:85895999

REQUEST_JOB_NOTIFY                      ( 4022) count:13     ave_time:490
  total_time:6376

REQUEST_UPDATE_NODE                     ( 3002) count:12     ave_time:176752
total_time:2121031

REQUEST_PRIORITY_FACTORS                ( 2026) count:10     ave_time:75375
total_time:753750

REQUEST_JOB_STEP_INFO                   ( 2005) count:4      ave_time:814
  total_time:3259

REQUEST_TOP_JOB                         ( 5038) count:1      ave_time:636
  total_time:636

REQUEST_STATS_INFO                      ( 2035) count:0      ave_time:0
  total_time:0


Remote Procedure Call statistics by user

root            (       0) count:1528109 ave_time:181229
total_time:276938293972

rec2111         (  169996) count:776364 ave_time:79997
total_time:62107356803

de2356          (  466630) count:262940 ave_time:19786
total_time:5202663824

rcc2167         (  497658) count:97623  ave_time:78925
total_time:7704912315

mrd2165         (  456704) count:39139  ave_time:90827
total_time:3554897920

hzz2000         (  476402) count:10550  ave_time:96118
total_time:1014046590

rl2226          (  124636) count:6533   ave_time:100609 total_time:657278646

kk3291          (  496278) count:2068   ave_time:150099 total_time:310405049

mam2556         (  497945) count:1949   ave_time:96117  total_time:187333844

zp2221          (  546238) count:1920   ave_time:46497  total_time:89275799

mi2493          (  546357) count:1669   ave_time:50929  total_time:85001409

ls3759          (  544216) count:1491   ave_time:49866  total_time:74351335

adp2164         (  534798) count:1093   ave_time:59129  total_time:64628798

aeh2213         (  525442) count:1025   ave_time:67385  total_time:69070417

yx2625          (  544217) count:1014   ave_time:56383  total_time:57172456

mv2640          (  453243) count:931    ave_time:80170  total_time:74638895

ls3326          (  441245) count:926    ave_time:28078  total_time:26000676

dl2860          (  358507) count:900    ave_time:97916  total_time:88125149

lmk2202         (  476448) count:885    ave_time:85170  total_time:75375897

msd2202         (  545520) count:834    ave_time:83247  total_time:69428080

pcd2120         (  475692) count:823    ave_time:107029 total_time:88085136

ab4689          (  495992) count:784    ave_time:112730 total_time:88380903

ma3631          (  462767) count:655    ave_time:76656  total_time:50209911

az2604          (  532427) count:601    ave_time:211855 total_time:127325085

nobody          (  486473) count:503    ave_time:124392 total_time:62569554

zx2250          (  495594) count:479    ave_time:48664  total_time:23310421

al4188          (  545556) count:477    ave_time:63677  total_time:30374144

sb4601          (  546356) count:446    ave_time:97455  total_time:43465066

kmx2000         (  538398) count:432    ave_time:55456  total_time:23957062

lma2197         (  545978) count:420    ave_time:93863  total_time:39422850

ad3395          (  457738) count:412    ave_time:141428 total_time:58268672

rh2845          (  472268) count:376    ave_time:78339  total_time:29455533

tk2757          (  475100) count:373    ave_time:148094 total_time:55239112

yva2000         (  446751) count:366    ave_time:63237  total_time:23144945

yw3376          (  521854) count:296    ave_time:557442 total_time:165003015

slh2181         (  485956) count:275    ave_time:77101  total_time:21202791

taf2109         (  250023) count:263    ave_time:239869 total_time:63085689

zw2105          (   80893) count:256    ave_time:64065  total_time:16400820

mcb2270         (  485162) count:247    ave_time:44855  total_time:11079282

ar2667          (  217300) count:238    ave_time:76016  total_time:18091848

ll3450          (  546422) count:227    ave_time:63897  total_time:14504685

jic2121         (  463630) count:216    ave_time:39114  total_time:8448678

yg2811          (  545936) count:214    ave_time:72923  total_time:15605609

yy2865          (  492680) count:207    ave_time:57179  total_time:11836154

mt3197          (  473089) count:179    ave_time:64292  total_time:11508437

jdn2133         (  470892) count:165    ave_time:77947  total_time:12861324

os2328          (  493602) count:159    ave_time:39265  total_time:6243245

am5328          (  525450) count:150    ave_time:49954  total_time:7493181

am5284          (  519168) count:148    ave_time:397578 total_time:58841596

mts2188         (  546099) count:145    ave_time:83144  total_time:12055925

ab2080          (  110745) count:139    ave_time:98768  total_time:13728832

flw2113         (  489455) count:132    ave_time:110236 total_time:14551159

yp2602          (  546118) count:129    ave_time:109131 total_time:14077980

mea2200         (  524843) count:127    ave_time:30659  total_time:3893818

hmm2183         (  543254) count:127    ave_time:81463  total_time:10345822

sh3972          (  524988) count:125    ave_time:89528  total_time:11191059

mad2314         (  545522) count:124    ave_time:87364  total_time:10833220

gt2453          (  545916) count:123    ave_time:8074   total_time:993106

arr47           (  543516) count:119    ave_time:56511  total_time:6724840

bc212           (   30094) count:109    ave_time:7083   total_time:772063

st3107          (  474846) count:104    ave_time:19534  total_time:2031568

htr2104         (  411716) count:96     ave_time:54954  total_time:5275619

pfm2119         (  470991) count:93     ave_time:76138  total_time:7080912

jb4493          (  546296) count:90     ave_time:46526  total_time:4187423

ca2783          (  488827) count:85     ave_time:149016 total_time:12666414

kz2303          (  477679) count:84     ave_time:74047  total_time:6220030

as4525          (  365586) count:81     ave_time:1029267 total_time:83370670

cx2204          (  477023) count:75     ave_time:187986 total_time:14099019

gjc14           (  488676) count:75     ave_time:18452  total_time:1383904

ik2496          (  543457) count:74     ave_time:604775 total_time:44753384

ja3170          (  463425) count:71     ave_time:64624  total_time:4588329

kat2193         (  496066) count:68     ave_time:147765 total_time:10048080

sj2787          (  453295) count:65     ave_time:165285 total_time:10743552

hl2902          (  414743) count:49     ave_time:490264 total_time:24022960

nt2560          (  544457) count:41     ave_time:161329 total_time:6614525

sw3203          (  479850) count:39     ave_time:178515 total_time:6962087

mgz2110         (  546421) count:36     ave_time:75339  total_time:2712235

fw2366          (  546279) count:27     ave_time:125658 total_time:3392777

sb3378          (  314424) count:23     ave_time:8064164
total_time:185475792

wz2543          (  545958) count:21     ave_time:71778  total_time:1507344

slurm           (     450) count:21     ave_time:3054223 total_time:64138693

pab2170         (  423948) count:20     ave_time:82432  total_time:1648658

ns3316          (  498343) count:18     ave_time:62053  total_time:1116960

el2545          (  261905) count:16     ave_time:6983   total_time:111735

jab2443         (  496774) count:14     ave_time:7292   total_time:102099

jeg2228         (  497608) count:14     ave_time:97146  total_time:1360048

qz2280          (  451169) count:12     ave_time:352823 total_time:4233881

kl2792          (  389785) count:12     ave_time:131466 total_time:1577598

ms5924          (  533600) count:12     ave_time:11046  total_time:132552

iu2153          (  465048) count:12     ave_time:59375  total_time:712510

ia2337          (  423640) count:11     ave_time:4349   total_time:47843

rl3149          (  546516) count:11     ave_time:2815   total_time:30967

as5460          (  478898) count:11     ave_time:3335   total_time:36688

pab2163         (  363846) count:9      ave_time:21269  total_time:191421

yr2322          (  470274) count:9      ave_time:13798  total_time:124185

mz2778          (  527313) count:8      ave_time:144421 total_time:1155371

xl3041          (  546419) count:8      ave_time:167814 total_time:1342519

fg2465          (  498193) count:7      ave_time:251875 total_time:1763125

sx2220          (  476094) count:7      ave_time:270912 total_time:1896390

rc3362          (  545696) count:7      ave_time:8925   total_time:62478

jnt2136         (  518724) count:7      ave_time:77231  total_time:540618

mc4138          (  545636) count:6      ave_time:15008  total_time:90051

ags2198         (  502018) count:5      ave_time:7624   total_time:38121

yg2607          (  496576) count:4      ave_time:37494  total_time:149977

reg2171         (  542795) count:4      ave_time:5529   total_time:22116

nb2869          (  494059) count:3      ave_time:14663  total_time:43990

hwp2108         (  527947) count:3      ave_time:1534   total_time:4603

new2128         (  546417) count:3      ave_time:3694   total_time:11083

da2709          (  446758) count:2      ave_time:27576  total_time:55153

yj2650          (  545416) count:2      ave_time:197    total_time:394

dp264           (   36357) count:1      ave_time:4085   total_time:4085

> $ sacctmgr show stats
>
> [root@roll ar2667]# sacctmgr show stats

Rollup statistics

Hour       count:0      ave_time:0      max_time:0            total_time:0


Day        count:0      ave_time:0      max_time:0            total_time:0


Month      count:0      ave_time:0      max_time:0            total_time:0



Remote Procedure Call statistics by message type

DBD_JOB_COMPLETE         ( 1424) count:13260  ave_time:1023
total_time:13573082

DBD_STEP_START           ( 1442) count:10062  ave_time:1079
total_time:10865749

DBD_STEP_COMPLETE        ( 1441) count:9984   ave_time:1185
total_time:11836405

DBD_JOB_START            ( 1425) count:3744   ave_time:1107
total_time:4145054

DBD_NODE_STATE           ( 1432) count:1989   ave_time:783
total_time:1558136

DBD_SEND_MULT_JOB_START  ( 1472) count:142    ave_time:5390
total_time:765402

DBD_SEND_MULT_MSG        ( 1474) count:39     ave_time:1086964
total_time:42391599

MsgType=6500             ( 6500) count:3      ave_time:4563
total_time:13691

DBD_CLUSTER_TRES         ( 1407) count:2      ave_time:1713
total_time:3426

DBD_FINI                 ( 1401) count:2      ave_time:163    total_time:326

DBD_REGISTER_CTLD        ( 1434) count:1      ave_time:15990
total_time:15990

DBD_GET_STATS            ( 1489) count:1      ave_time:178    total_time:178


Remote Procedure Call statistics by user

slurm               (       450) count:39224  ave_time:2171
total_time:85155669

ar2667              (    217300) count:3      ave_time:521
total_time:1565

root                (         0) count:2      ave_time:5902
total_time:11804
[root@roll ar2667]#

> > Would restarting slurmdbd service be helpful in this situation
>
> It depends on the actual problem you are facing, but it's a simple test.
> > and what
> > will be the effect on the currently running jobs?
>
> No affect at all.
> Note that Slurm is designed in a fault-tolerant way, so it's able to keep
> working perfectly fine even when slurmdbd is down (for a while caching data,
> then accounting data is discarded).
> Actually running jobs keep running also with slurmctld is down, but no more
> jobs can be submitted or will be launched.
>
> So, you can restart the slurmdbd without any issue.
>
> We restarted slurmdb and now we can see the logs in slurmdb log file.

Best,
Axinia


>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 11 Albert Gil 2021-10-28 05:05:27 MDT

Hi Axinia,

> I just gave you permissions to slurmctld and slurmdbd-20210612 logs.
> 
> Can you please try again and let me know if you have any issues accessing
> the files.

Yes, I've been able to access and digest them.

> > What version are you using?
> >
> We have slurm 17.11.2

Right I don't think that your problem is related to being in an old version, but once we fix your current issue I would recommend to upgrade to get a better Slurm, and also a better Slurm support.


> >  [ar2667@roll ~]$ sdiag
> 
> *******************************************************
> sdiag output at Wed Oct 27 16:36:15 2021 (1635366975)
> Data since      Tue Oct 26 20:00:00 2021 (1635292800)
> *******************************************************
> Server thread count:  3
> Agent queue size:     0
> DBD Agent queue size: 40553

This 40553 is bad, it should be close to 0.
This means that slurmctld has not been able to communicate with slurmdbd for a long while.

> > So, you can restart the slurmdbd without any issue.
> >
> We restarted slurmdb and now we can see the logs in slurmdb log file.

Good!
From the logs it's quite clear that you are facing some comm issues.
There are several issues, but I think that the most important is related to the SQL backend (MariaDB).
It seems that slurmdbd is not able to connect with it since 2021-10-25T09:48:

[2021-10-25T09:48:55.248] error: It looks like the storage has gone away trying to reconnect

Since that point it seems that slurmdbd is not able to interact properly with the SQL backend, so it return errors to slurmctld, so they get out of sync as you saw (while slurmctld tries to keep the right info in that DBD Agent queue size mentioned above).
Once slurmdbd is able to interact again with MariaDB, then slurmctld will be able to send the updated info to slurmdbd -> mariadb, that DBD Agent queue size should be reduced at good rate, and then the info between slurmctld and slurmdbd will be on sync again.
But we need slurmdbd to be able ot interact with MariDB properly again for all that to happen.

The last slurmdbd that I see in your logs is from 2021-06-08, so I cannot see if after your last restart thing are going better or not.
Could you attach a newer slurmdbd.log, plus a new output of "sdiag" to see if the DBD Agent queue size is going down or not?

Could you also attach your slurm.conf and slurmdbd.conf (without the passwd)?
And finally, could you verify that you sql backend is up and running normally?

Thanks,
Albert

Comment 12 ar 2021-10-28 08:32:50 MDT

Created attachment 21993 [details]
slurmdbd

Hi Albert,

Thank you so much for looking into this.

The queue size is still 44365. And the logs do not look good :(

Please find the slurmdbd logs attached.


[ar2667@roll ~]$ sdiag

*******************************************************

sdiag output at Thu Oct 28 10:23:54 2021 (1635431034)

Data since      Wed Oct 27 20:00:00 2021 (1635379200)

*******************************************************

Server thread count:  3

Agent queue size:     0

DBD Agent queue size: 44365


Jobs submitted: 3085

Jobs started:   1288

Jobs completed: 249

Jobs canceled:  7

Jobs failed:    0


Jobs running:    1270

Jobs running ts: Thu Oct 28 10:23:51 2021 (1635431031)


Main schedule statistics (microseconds):

Last cycle:   1317

Max cycle:    36750

Total cycles: 1221

Mean cycle:   2633

Mean depth cycle:  5

Cycles per minute: 1

Last queue length: 12


Backfilling stats

Total backfilled jobs (since last slurm start): 23715

Total backfilled jobs (since last stats cycle start): 1006

Total backfilled heterogeneous job components: 0

Total cycles: 1720

Last cycle when: Thu Oct 28 10:23:50 2021 (1635431030)

Last cycle: 93291

Max cycle:  1431332

Mean cycle: 37931

Last depth cycle: 11

Last depth cycle (try sched): 9

Depth Mean: 8

Depth Mean (try depth): 7

Last queue length: 12

Queue length mean: 4


Remote Procedure Call statistics by message type

REQUEST_PARTITION_INFO                  ( 2009) count:831552 ave_time:1952
total_time:1623594330

REQUEST_JOB_INFO                        ( 2003) count:406512
ave_time:171044 total_time:69531675747

REQUEST_NODE_INFO_SINGLE                ( 2040) count:323927
ave_time:176427 total_time:57149598015

MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:291516 ave_time:56131
total_time:16363286259

MESSAGE_EPILOG_COMPLETE                 ( 6012) count:206864
ave_time:168497 total_time:34856149931

REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:195699
ave_time:813156 total_time:159133881639

REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:144479 ave_time:89942
total_time:12994735290

REQUEST_FED_INFO                        ( 2049) count:103740 ave_time:457
  total_time:47453433

REQUEST_JOB_USER_INFO                   ( 2039) count:93403  ave_time:68546
total_time:6402433658

REQUEST_PING                            ( 1008) count:85760  ave_time:591
  total_time:50726628

REQUEST_NODE_INFO                       ( 2007) count:50925  ave_time:10755
total_time:547729035

REQUEST_KILL_JOB                        ( 5032) count:27028  ave_time:63304
total_time:1711003402

REQUEST_UPDATE_JOB                      ( 3001) count:17924  ave_time:13591
total_time:243614165

REQUEST_JOB_STEP_CREATE                 ( 5001) count:10733  ave_time:5657
total_time:60717172

REQUEST_JOB_INFO_SINGLE                 ( 2021) count:10254  ave_time:56508
total_time:579434129

REQUEST_STEP_COMPLETE                   ( 5016) count:3400   ave_time:71048
total_time:241564287

REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:2772   ave_time:61473
total_time:170404510

REQUEST_JOB_READY                       ( 4019) count:1637   ave_time:50788
total_time:83141324

REQUEST_SHARE_INFO                      ( 2022) count:1379   ave_time:6501
total_time:8965464

REQUEST_CANCEL_JOB_STEP                 ( 5005) count:770    ave_time:40307
total_time:31036992

REQUEST_RESOURCE_ALLOCATION             ( 4001) count:710    ave_time:180220
total_time:127956586

REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:709    ave_time:166439
total_time:118005616

REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:52     ave_time:215023
total_time:11181207

ACCOUNTING_UPDATE_MSG                   (10001) count:21     ave_time:3054223
total_time:64138693

REQUEST_RECONFIGURE                     ( 1003) count:19     ave_time:4520842
total_time:85895999

REQUEST_UPDATE_NODE                     ( 3002) count:13     ave_time:163184
total_time:2121392

REQUEST_JOB_NOTIFY                      ( 4022) count:13     ave_time:490
  total_time:6376

REQUEST_PRIORITY_FACTORS                ( 2026) count:10     ave_time:75375
total_time:753750

REQUEST_JOB_STEP_INFO                   ( 2005) count:6      ave_time:913
  total_time:5482

REQUEST_TOP_JOB                         ( 5038) count:1      ave_time:636
  total_time:636

REQUEST_STATS_INFO                      ( 2035) count:1      ave_time:407
  total_time:407

ACCOUNTING_REGISTER_CTLD                (10003) count:1      ave_time:99160
total_time:99160


Remote Procedure Call statistics by user

root            (       0) count:1560149 ave_time:177771
total_time:277349410174

rec2111         (  169996) count:794324 ave_time:79564
total_time:63199851716

de2356          (  466630) count:262940 ave_time:19786
total_time:5202663824

rcc2167         (  497658) count:97623  ave_time:78925
total_time:7704912315

mrd2165         (  456704) count:39142  ave_time:90820
total_time:3554899599

zp2221          (  546238) count:11424  ave_time:26717  total_time:305219161

hzz2000         (  476402) count:10550  ave_time:96118
total_time:1014046590

rl2226          (  124636) count:6711   ave_time:98085  total_time:658254663

kk3291          (  496278) count:2080   ave_time:149244 total_time:310429166

mi2493          (  546357) count:2074   ave_time:41405  total_time:85875958

mam2556         (  497945) count:1949   ave_time:96117  total_time:187333844

ls3759          (  544216) count:1491   ave_time:49866  total_time:74351335

adp2164         (  534798) count:1093   ave_time:59129  total_time:64628798

yx2625          (  544217) count:1065   ave_time:53826  total_time:57325529

aeh2213         (  525442) count:1025   ave_time:67385  total_time:69070417

ls3326          (  441245) count:953    ave_time:27497  total_time:26205520

mv2640          (  453243) count:937    ave_time:79661  total_time:74642907

dl2860          (  358507) count:900    ave_time:97916  total_time:88125149

lmk2202         (  476448) count:885    ave_time:85170  total_time:75375897

msd2202         (  545520) count:836    ave_time:271654 total_time:227102845

pcd2120         (  475692) count:823    ave_time:107029 total_time:88085136

ab4689          (  495992) count:784    ave_time:112730 total_time:88380903

ma3631          (  462767) count:714    ave_time:70916  total_time:50634248

az2604          (  532427) count:606    ave_time:210642 total_time:127649628

nobody          (  486473) count:503    ave_time:124392 total_time:62569554

zx2250          (  495594) count:482    ave_time:48364  total_time:23311688

al4188          (  545556) count:477    ave_time:63677  total_time:30374144

sb4601          (  546356) count:446    ave_time:97455  total_time:43465066

kmx2000         (  538398) count:432    ave_time:55456  total_time:23957062

lma2197         (  545978) count:420    ave_time:93863  total_time:39422850

yva2000         (  446751) count:416    ave_time:57401  total_time:23878992

ad3395          (  457738) count:412    ave_time:141428 total_time:58268672

rh2845          (  472268) count:376    ave_time:78339  total_time:29455533

tk2757          (  475100) count:373    ave_time:148094 total_time:55239112

taf2109         (  250023) count:306    ave_time:206495 total_time:63187509

yw3376          (  521854) count:296    ave_time:557442 total_time:165003015

mcb2270         (  485162) count:276    ave_time:40250  total_time:11109249

slh2181         (  485956) count:275    ave_time:77101  total_time:21202791

zw2105          (   80893) count:256    ave_time:64065  total_time:16400820

ar2667          (  217300) count:251    ave_time:72292  total_time:18145345

ll3450          (  546422) count:227    ave_time:63897  total_time:14504685

jic2121         (  463630) count:216    ave_time:39114  total_time:8448678

yg2811          (  545936) count:214    ave_time:72923  total_time:15605609

yy2865          (  492680) count:207    ave_time:57179  total_time:11836154

am5328          (  525450) count:207    ave_time:36530  total_time:7561807

mt3197          (  473089) count:179    ave_time:64292  total_time:11508437

jdn2133         (  470892) count:165    ave_time:77947  total_time:12861324

os2328          (  493602) count:159    ave_time:39265  total_time:6243245

am5284          (  519168) count:148    ave_time:397578 total_time:58841596

mts2188         (  546099) count:145    ave_time:83144  total_time:12055925

ab2080          (  110745) count:139    ave_time:98768  total_time:13728832

flw2113         (  489455) count:132    ave_time:110236 total_time:14551159

yp2602          (  546118) count:129    ave_time:109131 total_time:14077980

mea2200         (  524843) count:127    ave_time:30659  total_time:3893818

hmm2183         (  543254) count:127    ave_time:81463  total_time:10345822

sh3972          (  524988) count:125    ave_time:89528  total_time:11191059

mad2314         (  545522) count:124    ave_time:87364  total_time:10833220

gt2453          (  545916) count:123    ave_time:8074   total_time:993106

arr47           (  543516) count:119    ave_time:56511  total_time:6724840

bc212           (   30094) count:109    ave_time:7083   total_time:772063

ca2783          (  488827) count:104    ave_time:122669 total_time:12757626

st3107          (  474846) count:104    ave_time:19534  total_time:2031568

cx2204          (  477023) count:98     ave_time:144847 total_time:14195088

htr2104         (  411716) count:96     ave_time:54954  total_time:5275619

pfm2119         (  470991) count:93     ave_time:76138  total_time:7080912

jb4493          (  546296) count:90     ave_time:46526  total_time:4187423

kz2303          (  477679) count:84     ave_time:74047  total_time:6220030

as4525          (  365586) count:81     ave_time:1029267 total_time:83370670

ik2496          (  543457) count:78     ave_time:573887 total_time:44763197

gjc14           (  488676) count:75     ave_time:18452  total_time:1383904

ja3170          (  463425) count:71     ave_time:64624  total_time:4588329

kat2193         (  496066) count:68     ave_time:147765 total_time:10048080

sj2787          (  453295) count:65     ave_time:165285 total_time:10743552

hl2902          (  414743) count:49     ave_time:490264 total_time:24022960

nt2560          (  544457) count:41     ave_time:161329 total_time:6614525

sw3203          (  479850) count:39     ave_time:178515 total_time:6962087

mgz2110         (  546421) count:36     ave_time:75339  total_time:2712235

fw2366          (  546279) count:27     ave_time:125658 total_time:3392777

sb3378          (  314424) count:26     ave_time:11173851
total_time:290520130

slurm           (     450) count:22     ave_time:2919902 total_time:64237853

wz2543          (  545958) count:21     ave_time:71778  total_time:1507344

pab2170         (  423948) count:20     ave_time:82432  total_time:1648658

ns3316          (  498343) count:18     ave_time:62053  total_time:1116960

el2545          (  261905) count:16     ave_time:6983   total_time:111735

fg2465          (  498193) count:15     ave_time:120308 total_time:1804628

jab2443         (  496774) count:14     ave_time:7292   total_time:102099

jeg2228         (  497608) count:14     ave_time:97146  total_time:1360048

kl2792          (  389785) count:12     ave_time:131466 total_time:1577598

iu2153          (  465048) count:12     ave_time:59375  total_time:712510

ms5924          (  533600) count:12     ave_time:11046  total_time:132552

qz2280          (  451169) count:12     ave_time:352823 total_time:4233881

ia2337          (  423640) count:11     ave_time:4349   total_time:47843

rl3149          (  546516) count:11     ave_time:2815   total_time:30967

as5460          (  478898) count:11     ave_time:3335   total_time:36688

yr2322          (  470274) count:9      ave_time:13798  total_time:124185

pab2163         (  363846) count:9      ave_time:21269  total_time:191421

xl3041          (  546419) count:8      ave_time:167814 total_time:1342519

mz2778          (  527313) count:8      ave_time:144421 total_time:1155371

sx2220          (  476094) count:7      ave_time:270912 total_time:1896390

rc3362          (  545696) count:7      ave_time:8925   total_time:62478

jnt2136         (  518724) count:7      ave_time:77231  total_time:540618

mc4138          (  545636) count:6      ave_time:15008  total_time:90051

ags2198         (  502018) count:5      ave_time:7624   total_time:38121

yg2607          (  496576) count:4      ave_time:37494  total_time:149977

reg2171         (  542795) count:4      ave_time:5529   total_time:22116

nb2869          (  494059) count:3      ave_time:14663  total_time:43990

hwp2108         (  527947) count:3      ave_time:1534   total_time:4603

new2128         (  546417) count:3      ave_time:3694   total_time:11083

da2709          (  446758) count:2      ave_time:27576  total_time:55153

yj2650          (  545416) count:2      ave_time:197    total_time:394

dp264           (   36357) count:1      ave_time:4085   total_time:4085


Best,
Axinia
*---*
Axinia Radeva
Manager, Research Computing Services




On Thu, Oct 28, 2021 at 7:05 AM <bugs@schedmd.com> wrote:

> *Comment # 11
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c11&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=tqQlpaN0udlsX775V5GRVrj8U3SNw_Wcq4U0Rz5gkmU&s=c6SbDjEL80emGfjUC-YzL3-Q6u6zn_e91KMGBBmYHFw&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=tqQlpaN0udlsX775V5GRVrj8U3SNw_Wcq4U0Rz5gkmU&s=bWBmBfhTvTyQ6twg_Zba_MNmQMckk0cyOXQQWD7Tow4&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> > I just gave you permissions to slurmctld and slurmdbd-20210612 logs.
> >
> > Can you please try again and let me know if you have any issues accessing
> > the files.
>
> Yes, I've been able to access and digest them.
> > > What version are you using?
> > >
> > We have slurm 17.11.2
>
> Right I don't think that your problem is related to being in an old version,
> but once we fix your current issue I would recommend to upgrade to get a better
> Slurm, and also a better Slurm support.
>
> > >  [ar2667@roll ~]$ sdiag
> >
> > *******************************************************
> > sdiag output at Wed Oct 27 16:36:15 2021 (1635366975)
> > Data since      Tue Oct 26 20:00:00 2021 (1635292800)
> > *******************************************************
> > Server thread count:  3
> > Agent queue size:     0
> > DBD Agent queue size: 40553
>
> This 40553 is bad, it should be close to 0.
> This means that slurmctld has not been able to communicate with slurmdbd for a
> long while.
> > > So, you can restart the slurmdbd without any issue.
> > >
> > We restarted slurmdb and now we can see the logs in slurmdb log file.
>
> Good!
> From the logs it's quite clear that you are facing some comm issues.
> There are several issues, but I think that the most important is related to the
> SQL backend (MariaDB).
> It seems that slurmdbd is not able to connect with it since 2021-10-25T09:48:
>
> [2021-10-25T09:48:55.248] error: It looks like the storage has gone away trying
> to reconnect
>
> Since that point it seems that slurmdbd is not able to interact properly with
> the SQL backend, so it return errors to slurmctld, so they get out of sync as
> you saw (while slurmctld tries to keep the right info in that DBD Agent queue
> size mentioned above).
> Once slurmdbd is able to interact again with MariaDB, then slurmctld will be
> able to send the updated info to slurmdbd -> mariadb, that DBD Agent queue size
> should be reduced at good rate, and then the info between slurmctld and
> slurmdbd will be on sync again.
> But we need slurmdbd to be able ot interact with MariDB properly again for all
> that to happen.
>
> The last slurmdbd that I see in your logs is from 2021-06-08, so I cannot see
> if after your last restart thing are going better or not.
> Could you attach a newer slurmdbd.log, plus a new output of "sdiag" to see if
> the DBD Agent queue size is going down or not?
>
> Could you also attach your slurm.conf and slurmdbd.conf (without the passwd)?
> And finally, could you verify that you sql backend is up and running normally?
>
> Thanks,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 16 Albert Gil 2021-10-28 10:57:46 MDT

Hi Axinia,

I think that I know where is the problem, and how to fix it.
The origin of the issue is that the job 25482027 was launched in this work_dir:

/rigel/sscc/projects/measure_pov/temp cluster_infos/2_Code/5_country datasets creation/Country scripts/Cote d'Ivoire

Note that this directory contains some special characters like spaces, and specially the apostrophe '.
On your old Slurm version those special characters were an issue, and in your case this apostrophe is leading to MariaDB to return an error leading to slurmdbd thinking that it's not running properly and trying again.

Actually, that issue was an actual CVE issue:
https://lists.schedmd.com/pipermail/slurm-announce/2018/000006.html

In your case you are not hitting any security problem, but your old version has that vulnerability too.

Fortunately, as you can see, it was already fixed for versions 17.11.5, so you don't need to do (for the moment) a major version upgrade to fix it, but just a minor upgrade.

Therefore, to fix your issue you should upgrade at least your slurmdbd to the latest 17.11 release (17.11.13).
Please read the upgrade notes before:
https://slurm.schedmd.com/quickstart_admin.html#upgrade

I recommend you to don't hesitate to much on doing this upgrade because slurmdbd is really stuck and this is bad.
Note that this is an upgrade of a minor release, so fortunately you won't need to change any config or similar.

Hope this helps,
Albert

PS: For more details, this is the commit fixing the CVE/your issue:

- https://github.com/SchedMD/slurm/commit/db468895240ad6817628d07054fe54e71273b2fe

Comment 17 Albert Gil 2021-10-28 11:04:36 MDT

Axinia,

I'm downgrading the severity of the issue as we already found the problem and the solution.
Let me know how the upgrade worked, and please try to keep your Slurm version updated with the supported ones so you can have a better Slurm and we can provide a better support.

Regards,
Albert

Comment 18 ar 2021-10-28 11:35:33 MDT

Hi Albert,



Thank you so much for figuring out what is the cause for slurmdb issue.



This cluster was supposed to be retired last year but because of the
pandemic was extended for another year. We were not planning to perform any
upgrades but it looks we need to upgrade at least the DB.



I have the following questions:



1) Do we need downtime to upgrade the DB or we can do it on an active
cluster?



2) Do we need to do anything in addition to clear the DBD Agent Queue?



In the documentation that you sent I can see:

“The slurmctld daemon must also be upgraded before or at the same time as
the slurmd daemons on the compute nodes. Generally, upgrading Slurm on all
of the login and compute nodes is recommended, although rolling upgrades
are also possible (i.e. upgrading the head node(s) first then upgrading the
compute and login nodes later at various times). Also see the note above
about reverse compatibility.”

3) Do we need to upgrade the slurmctld daemon and the slurmd daemons on the
compute nodes?

Best,

Axinia


*---*
Axinia Radeva
Manager, Research Computing Services




On Thu, Oct 28, 2021 at 12:57 PM <bugs@schedmd.com> wrote:

> *Comment # 16
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c16&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=zLfCbFuzXSzrGtJJsEmcHwEMGAVSR4YYIPczd4HYV2I&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=fxeRF7GtJ2kSEI0bLnwA4dTXJ3OboeKTo70jct0NLv4&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
>
> I think that I know where is the problem, and how to fix it.
> The origin of the issue is that the job 25482027 was launched in this work_dir:
>
> /rigel/sscc/projects/measure_pov/temp cluster_infos/2_Code/5_country datasets
> creation/Country scripts/Cote d'Ivoire
>
> Note that this directory contains some special characters like spaces, and
> specially the apostrophe '.
> On your old Slurm version those special characters were an issue, and in your
> case this apostrophe is leading to MariaDB to return an error leading to
> slurmdbd thinking that it's not running properly and trying again.
>
> Actually, that issue was an actual CVE issue:https://lists.schedmd.com/pipermail/slurm-announce/2018/000006.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.schedmd.com_pipermail_slurm-2Dannounce_2018_000006.html&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=rfXuQ9EpchxJcYj0l07L4l06AIn2_Yeh1C84A7rr4Ds&e=>
>
> In your case you are not hitting any security problem, but your old version has
> that vulnerability too.
>
> Fortunately, as you can see, it was already fixed for versions 17.11.5, so you
> don't need to do (for the moment) a major version upgrade to fix it, but just a
> minor upgrade.
>
> Therefore, to fix your issue you should upgrade at least your slurmdbd to the
> latest 17.11 release (17.11.13).
> Please read the upgrade notes before:https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23upgrade&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=9BFJDJo1eiu8bRAJO3xf6jPuEDQKr9KCRs3TyEo_AeM&e=>
>
> I recommend you to don't hesitate to much on doing this upgrade because
> slurmdbd is really stuck and this is bad.
> Note that this is an upgrade of a minor release, so fortunately you won't need
> to change any config or similar.
>
> Hope this helps,
> Albert
>
> PS: For more details, this is the commit fixing the CVE/your issue:
>
> -https://github.com/SchedMD/slurm/commit/db468895240ad6817628d07054fe54e71273b2fe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm_commit_db468895240ad6817628d07054fe54e71273b2fe&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=0Ztj_NuPirpcfrf5VtVfSiuMtZyMNTVWZISti4TFDOw&e=>
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 19 ar 2021-10-28 12:13:33 MDT

Hi Albert,

I have another question.

Can we use any of the backups that we have before Oct. 25th to restore the
slurmdb?

$ sudo ls -l /var/spool/cmd/backup/
total 19816
-rw------- 1 root root 2244656 Oct 22 03:19 backup-Fri.sql.gz
-rw------- 1 root root  338859 Oct 22 03:19 backup-monitor-Fri.sql.gz
-rw------- 1 root root  338873 Oct 25 03:45 backup-monitor-Mon.sql.gz
-rw------- 1 root root  338858 Oct 23 03:12 backup-monitor-Sat.sql.gz
-rw------- 1 root root  338873 Oct 24 03:24 backup-monitor-Sun.sql.gz
-rw------- 1 root root  339541 Oct 28 03:45 backup-monitor-Thu.sql.gz
-rw------- 1 root root  339429 Oct 26 03:08 backup-monitor-Tue.sql.gz
-rw------- 1 root root  339505 Oct 27 03:09 backup-monitor-Wed.sql.gz
-rw------- 1 root root 2244656 Oct 25 03:45 backup-Mon.sql.gz
-rw------- 1 root root 2244656 Oct 23 03:12 backup-Sat.sql.gz
-rw------- 1 root root 2244656 Oct 24 03:24 backup-Sun.sql.gz
-rw------- 1 root root 2244655 Oct 28 03:45 backup-Thu.sql.gz
-rw------- 1 root root 2244658 Oct 26 03:08 backup-Tue.sql.gz
-rw------- 1 root root 2244656 Oct 27 03:09 backup-Wed.sql.gz
drwxr-xr-x 5 root root      84 Nov 15  2017 certificates
-rw------- 1 root root 2108333 Feb 13  2017
pre-upgrade-17-02-13_01-49-20_Mon.sql.gz
-rw------- 1 root root   57946 Feb 13  2017
pre-upgrade-monitor-17-02-13_01-49-20_Mon.sql.gz

Best,
Axinia




*---*
Axinia Radeva
Manager, Research Computing Services




On Thu, Oct 28, 2021 at 1:34 PM Axinia Radeva <aradeva@columbia.edu> wrote:

> Hi Albert,
>
>
>
> Thank you so much for figuring out what is the cause for slurmdb issue.
>
>
>
> This cluster was supposed to be retired last year but because of the
> pandemic was extended for another year. We were not planning to perform any
> upgrades but it looks we need to upgrade at least the DB.
>
>
>
> I have the following questions:
>
>
>
> 1) Do we need downtime to upgrade the DB or we can do it on an active
> cluster?
>
>
>
> 2) Do we need to do anything in addition to clear the DBD Agent Queue?
>
>
>
> In the documentation that you sent I can see:
>
> “The slurmctld daemon must also be upgraded before or at the same time as
> the slurmd daemons on the compute nodes. Generally, upgrading Slurm on all
> of the login and compute nodes is recommended, although rolling upgrades
> are also possible (i.e. upgrading the head node(s) first then upgrading the
> compute and login nodes later at various times). Also see the note above
> about reverse compatibility.”
>
> 3) Do we need to upgrade the slurmctld daemon and the slurmd daemons on
> the compute nodes?
>
> Best,
>
> Axinia
>
>
> *---*
> Axinia Radeva
> Manager, Research Computing Services
>
>
>
>
> On Thu, Oct 28, 2021 at 12:57 PM <bugs@schedmd.com> wrote:
>
>> *Comment # 16
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c16&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=zLfCbFuzXSzrGtJJsEmcHwEMGAVSR4YYIPczd4HYV2I&e=>
>> on bug 12747
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=fxeRF7GtJ2kSEI0bLnwA4dTXJ3OboeKTo70jct0NLv4&e=>
>> from Albert Gil <albert.gil@schedmd.com> *
>>
>> Hi Axinia,
>>
>> I think that I know where is the problem, and how to fix it.
>> The origin of the issue is that the job 25482027 was launched in this work_dir:
>>
>> /rigel/sscc/projects/measure_pov/temp cluster_infos/2_Code/5_country datasets
>> creation/Country scripts/Cote d'Ivoire
>>
>> Note that this directory contains some special characters like spaces, and
>> specially the apostrophe '.
>> On your old Slurm version those special characters were an issue, and in your
>> case this apostrophe is leading to MariaDB to return an error leading to
>> slurmdbd thinking that it's not running properly and trying again.
>>
>> Actually, that issue was an actual CVE issue:https://lists.schedmd.com/pipermail/slurm-announce/2018/000006.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.schedmd.com_pipermail_slurm-2Dannounce_2018_000006.html&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=rfXuQ9EpchxJcYj0l07L4l06AIn2_Yeh1C84A7rr4Ds&e=>
>>
>> In your case you are not hitting any security problem, but your old version has
>> that vulnerability too.
>>
>> Fortunately, as you can see, it was already fixed for versions 17.11.5, so you
>> don't need to do (for the moment) a major version upgrade to fix it, but just a
>> minor upgrade.
>>
>> Therefore, to fix your issue you should upgrade at least your slurmdbd to the
>> latest 17.11 release (17.11.13).
>> Please read the upgrade notes before:https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23upgrade&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=9BFJDJo1eiu8bRAJO3xf6jPuEDQKr9KCRs3TyEo_AeM&e=>
>>
>> I recommend you to don't hesitate to much on doing this upgrade because
>> slurmdbd is really stuck and this is bad.
>> Note that this is an upgrade of a minor release, so fortunately you won't need
>> to change any config or similar.
>>
>> Hope this helps,
>> Albert
>>
>> PS: For more details, this is the commit fixing the CVE/your issue:
>>
>> -https://github.com/SchedMD/slurm/commit/db468895240ad6817628d07054fe54e71273b2fe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm_commit_db468895240ad6817628d07054fe54e71273b2fe&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=0Ztj_NuPirpcfrf5VtVfSiuMtZyMNTVWZISti4TFDOw&e=>
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>

Comment 20 Albert Gil 2021-10-29 06:18:01 MDT

Hi Axinia,

> Can we use any of the backups that we have before Oct. 25th to restore the
> slurmdb?

I'm not sure if I fully understand your question.
First of all let me clarify that the information on MariaDB is correct, so you don't need to use any backup.
The problem is that slurmctld is trying to send the information of one job (with special characters) to slurmdbd to be saved to MariaDB, but due the special characters slurmdbd is not able to do so because MariaDB is returning an error, but nothing else is happening, the data in MariaDB is fine.

Therefore,
If you mean that using a backup can solve your problem, then no, it won't solve your current situation.
If you mean that you are not sure if the upgraded slurmdbd will be able to load/work with old backups, then don't worry, yes it will.
But again, you don't need to use any backup to solve your problem.
You only need yo upgrade slurmdbd to latest 17.11.

Once this is done and your current issue is fixed, then we strongly recommend you to plan a path to upgrade up to current release 21.08, but please note that you won't be able to jump from 17.11 to 21.08 directly, but you will need intermediate upgrades.
We can help you with this if you open a specigic ticket for it.

Regards,
Albert

Comment 21 ar 2021-10-29 09:28:07 MDT

Hi Albert,


Thank you so much for your detailed explanation.


We are using Bright Computing as a cluster managment software for the
cluster.



In the past, we upgraded slurm through Bright. Bright provided all the
slurm rpms for the upgrade. As I mentioned the cluster was supposed to be
retired last year and at the moment the cluster does not have Bright
support. We just asked for a quote to extend the Bright support.



The slurmdb upgrade is time sensitive and I will open another ticket with
schedmd to get help with the slurmdb upgrade. Would you be able to provide
the slurmdb RPM for the upgrade and do you believe that this will not
interfere with Bright integration and will not have any negative impact on
the cluster?


Best,

Axinia


*---*
Axinia Radeva
Manager, Research Computing Services




On Fri, Oct 29, 2021 at 8:18 AM <bugs@schedmd.com> wrote:

> *Comment # 20
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c20&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=E7PiCUP5KWBrtHY9nuDHPYrOQ_uX475AbDipbWU-vF4&s=YlaMC0q-10GsgEZKUqEv04BjPXKG1w_pEllJPHFfP-Q&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=E7PiCUP5KWBrtHY9nuDHPYrOQ_uX475AbDipbWU-vF4&s=6F-NsOIdmSsJ4iC4djoK00gA1XtMfeGaxOD5E9tGsQw&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> > Can we use any of the backups that we have before Oct. 25th to restore the
> > slurmdb?
>
> I'm not sure if I fully understand your question.
> First of all let me clarify that the information on MariaDB is correct, so you
> don't need to use any backup.
> The problem is that slurmctld is trying to send the information of one job
> (with special characters) to slurmdbd to be saved to MariaDB, but due the
> special characters slurmdbd is not able to do so because MariaDB is returning
> an error, but nothing else is happening, the data in MariaDB is fine.
>
> Therefore,
> If you mean that using a backup can solve your problem, then no, it won't solve
> your current situation.
> If you mean that you are not sure if the upgraded slurmdbd will be able to
> load/work with old backups, then don't worry, yes it will.
> But again, you don't need to use any backup to solve your problem.
> You only need yo upgrade slurmdbd to latest 17.11.
>
> Once this is done and your current issue is fixed, then we strongly recommend
> you to plan a path to upgrade up to current release 21.08, but please note that
> you won't be able to jump from 17.11 to 21.08 directly, but you will need
> intermediate upgrades.
> We can help you with this if you open a specigic ticket for it.
>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 22 Albert Gil 2021-10-29 10:29:49 MDT

Hi Axinia,

> We are using Bright Computing as a cluster managment software for the
> cluster.

I see.

> In the past, we upgraded slurm through Bright. Bright provided all the
> slurm rpms for the upgrade. As I mentioned the cluster was supposed to be
> retired last year and at the moment the cluster does not have Bright
> support. We just asked for a quote to extend the Bright support.

I'm not so familiar with Bright upgrade mechanism.

> The slurmdb upgrade is time sensitive and I will open another ticket with
> schedmd to get help with the slurmdb upgrade.

For the necessary minor update to fix your current issue (17.11.2 -> 17.11.13), I may help you here.
For the major upgrade (17.11.13 -> intermediates -> 21.08.x), yes, better open a new ticket.

But please note that the minor update is simpler and important for you, right now your slurmdbd is just stuck.
The major upgrade could wait a bit more, will be more complex, but it's also important.

> Would you be able to provide
> the slurmdb RPM for the upgrade

SchedMD doesn't provide .rpm files, only a way to create them.
See: https://slurm.schedmd.com/quickstart_admin.html#quick_start

Actually, right now we don't even provide the .tar files to build Slurm for versions prior to 20.11.7 or 20.02.7 for security reasons.
See: https://www.schedmd.com/archives.php

But you can always clone the code form github, checkout the tag/version that you want to build (slurm-17-11-13-2), and compile it.
See: https://github.com/SchedMD/slurm

> and do you believe that this will not
> interfere with Bright integration and will not have any negative impact on
> the cluster?

I don't know about Bright.
But I'm sure that updating a slurmdbd binary from 17.11.2 to a one 17.11.13 won't have any impact on the cluster, meaning that your current slurmctld 17.11.2 will be able to communicate with it (now it cant due a wrong job), all your slurmd won't have any issue neither, and it will be able to read the current MariDB information.

Please check this for further details on upgrades:
https://slurm.schedmd.com/quickstart_admin.html#upgrade

Hope it helps,
Albert

Comment 23 ar 2021-10-31 12:11:42 MDT

Hi Alber,

Thank you for the information. Our team is reduced to the minimum at
the moment but we are planning to do the upgrade tomorrow. In case if we
need your help, what is your availability tomorrow?


*---*
Axinia Radeva
Manager, Research Computing Services




On Fri, Oct 29, 2021 at 12:29 PM <bugs@schedmd.com> wrote:

> *Comment # 22
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c22&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=_Ggr--ybAbghbqvVpXwMQPegPWkXqYm16o2jlyWy0BI&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=0OtF6nbH1EXPzkEhzOZlgEydlsQ0OAahfho70XYjQas&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> > We are using Bright Computing as a cluster managment software for the
> > cluster.
>
> I see.
> > In the past, we upgraded slurm through Bright. Bright provided all the
> > slurm rpms for the upgrade. As I mentioned the cluster was supposed to be
> > retired last year and at the moment the cluster does not have Bright
> > support. We just asked for a quote to extend the Bright support.
>
> I'm not so familiar with Bright upgrade mechanism.
> > The slurmdb upgrade is time sensitive and I will open another ticket with
> > schedmd to get help with the slurmdb upgrade.
>
> For the necessary minor update to fix your current issue (17.11.2 -> 17.11.13),
> I may help you here.
> For the major upgrade (17.11.13 -> intermediates -> 21.08.x), yes, better open
> a new ticket.
>
> But please note that the minor update is simpler and important for you, right
> now your slurmdbd is just stuck.
> The major upgrade could wait a bit more, will be more complex, but it's also
> important.
> > Would you be able to provide
> > the slurmdb RPM for the upgrade
>
> SchedMD doesn't provide .rpm files, only a way to create them.
> See: https://slurm.schedmd.com/quickstart_admin.html#quick_start <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23quick-5Fstart&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=0uanAaljvKfW5bRxEzdg5fPdYJe3q_j7cnalbCGOdf0&e=>
>
> Actually, right now we don't even provide the .tar files to build Slurm for
> versions prior to 20.11.7 or 20.02.7 for security reasons.
> See: https://www.schedmd.com/archives.php <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.schedmd.com_archives.php&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=xK2rUQ9MkDPHa7-7hohCeNtvkceQoE3UehY84dD0qgE&e=>
>
> But you can always clone the code form github, checkout the tag/version that
> you want to build (slurm-17-11-13-2), and compile it.
> See: https://github.com/SchedMD/slurm <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=FcUOZx0EB1WafxNjUo8By1puyfQhqu_6m0oQrv7Whio&e=>
> > and do you believe that this will not
> > interfere with Bright integration and will not have any negative impact on
> > the cluster?
>
> I don't know about Bright.
> But I'm sure that updating a slurmdbd binary from 17.11.2 to a one 17.11.13
> won't have any impact on the cluster, meaning that your current slurmctld
> 17.11.2 will be able to communicate with it (now it cant due a wrong job), all
> your slurmd won't have any issue neither, and it will be able to read the
> current MariDB information.
>
> Please check this for further details on upgrades:https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23upgrade&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=P4TsB77HPuuluVQAMfoOdjqd-Ma04lINuz7nCusb4yA&e=>
>
> Hope it helps,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 24 Albert Gil 2021-11-01 05:25:51 MDT

Hi Axinia,

> Thank you for the information. Our team is reduced to the minimum at
> the moment but we are planning to do the upgrade tomorrow.

Good!

> In case if we
> need your help, what is your availability tomorrow?

I'm working normally today, but I'm personally located in Europe.
But don't worry about it, the support team of SchedMD is all over.
I'll ask other members of the team in your timezone to keep an eye on this ticket too.

Anyway, if you want to post some sort of summary of what you are planning to do today, I'll double-check that your plan is what I also have in mind, just to avoid any confusions.


Regards,
Albert

Comment 25 ar 2021-11-01 09:06:34 MDT

Hi Albert,

Thank you for your prompt reply and I appreciate your assistance in this
matter.

I want to confirm with you that we can do the upgrade on a live cluster and
we do not need downtime.

Here are the steps that I identified:


   1.

   Shutdown the slurmdbd daemon from the head node as root from cmsh
   (Bright management software):



   -

   Stop the slurmdbd service:

   [roll->device[roll]->services]% stop slurmdbd




   -

   Ensure that slurmdbd is not running anymore:

   [roll->device[roll]->services]% status slurmdbd




   -

   slurmctld might remain running while the database daemon is down. During
   this time, requests intended for slurmdbd are queued internally. The DBD
   Agent Queue size is limited, however, and should therefore be monitored
   with sdiag.


The current value of the Agent Queue size is 146160



*****************************************************

sdiag output at Mon Nov 01 09:45:07 2021 (1635774307)

Data since      Sun Oct 31 20:00:00 2021 (1635724800)

*******************************************************

Server thread count:  3

Agent queue size:     0

DBD Agent queue size: 146160



Jobs submitted: 7274

Jobs started:   4820

Jobs completed: 4595

Jobs canceled:  10

Jobs failed:    0



Jobs running:    19

Jobs running ts: Mon Nov 01 09:45:00 2021 (1635774300)


*2. *Backup the Slurm database using mysqldump (or similar tool), e.g.
mysqldump
--databases slurm_acct_db > backup.sql. You may also want to take this
opportunity to verify that the innodb_buffer_pool_size in my.cnf is at
least 128M.



   -

   Create a backup of the slurm_acct_db database:
   DBnode # mysqldump -p slurm_acct_db > slurm_acct_db.sql


In preparation for the conversion, ensure that the variable
innodb_buffer_pool_size is set to a value of 128 Mb or more:
On the database server, run the following command:




DBnode # echo  'SELECT @@innodb_buffer_pool_size/1024/1024;' | \

   -

   mysql --password --batch



I checked the innodb_buffer_pool_size and it set to 128MB

Do we need to increase innodb_buffer_pool_size?



[root@roll ar2667]# echo  'SELECT @@innodb_buffer_pool_size/1024/1024;' |
mysql -uslurm --password --batch

Enter password:

@@innodb_buffer_pool_size/1024/1024

128.00000000





   -

   To permanently change the size, edit the /etc/my.cnf file, set
   innodb_buffer_pool_size to 128 MB, then restart the database:
   DBnode # rcmysql restart



3. Upgrade the slurmdbd daemon

We got the source code from here:
https://github.com/SchedMD/slurm/tree/slurm-17.02

[ar2667@holmes slurmdb]$ cd
/rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11

[ar2667@holmes slurm-slurm-17.11]$ pwd

/rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11

[ar2667@holmes slurm-slurm-17.11]$ ls -ltr

total 1888

-rw-r--r--  1 ar2667 habarcs  18277 Oct 31 10:35 configure.ac

-rw-r--r--  1 ar2667 habarcs  16358 Oct 31 10:35 RELEASE_NOTES

-rw-r--r--  1 ar2667 habarcs   9555 Oct 31 10:35 INSTALL

-rw-r--r--  1 ar2667 habarcs   6369 Oct 31 10:35 DISCLAIMER

-rwxr-xr-x  1 ar2667 habarcs 893466 Oct 31 10:35 configure

-rw-r--r--  1 ar2667 habarcs    119 Oct 31 10:35 AUTHORS

-rw-r--r--  1 ar2667 habarcs   8543 Oct 31 10:35 LICENSE.OpenSSL

drwxrwxr-x  2 ar2667 habarcs   4096 Oct 31 10:35 slurm

drwxrwxr-x  5 ar2667 habarcs   4096 Oct 31 10:35 testsuite

drwxrwxr-x  2 ar2667 habarcs   4096 Oct 31 10:36 etc

drwxrwxr-x 18 ar2667 habarcs   4096 Oct 31 10:36 contribs

-rw-r--r--  1 ar2667 habarcs   1068 Oct 31 10:36 META

-rw-r--r--  1 ar2667 habarcs  16064 Oct 31 10:36 config.h.in

-rw-r--r--  1 ar2667 habarcs   1666 Oct 31 10:36 Makefile.am

-rw-r--r--  1 ar2667 habarcs  12429 Oct 31 10:36 BUILD.NOTES

-rw-r--r--  1 ar2667 habarcs  20474 Oct 31 10:36 COPYING

-rw-r--r--  1 ar2667 habarcs 530522 Oct 31 10:36 NEWS

-rw-r--r--  1 ar2667 habarcs   2761 Oct 31 10:36 CONTRIBUTING.md

-rw-r--r--  1 ar2667 habarcs  21601 Oct 31 10:36 slurm.spec

drwxrwxr-x  4 ar2667 habarcs   4096 Oct 31 10:36 doc

-rw-r--r--  1 ar2667 habarcs   3428 Oct 31 10:36 README.rst

drwxrwxr-x  2 ar2667 habarcs   4096 Oct 31 10:37 auxdir

-rw-r--r--  1 ar2667 habarcs  36672 Oct 31 10:37 Makefile.in

-rw-r--r--  1 ar2667 habarcs  71382 Oct 31 10:37 aclocal.m4

-rwxr-xr-x  1 ar2667 habarcs   2993 Oct 31 10:37 autogen.sh

drwxrwxr-x 33 ar2667 habarcs   4096 Oct 31 10:38 src


I checked the version:

[ar2667@holmes slurm-slurm-17.11]$  ./configure --version

slurm configure 17.11

generated by GNU Autoconf 2.69

Copyright (C) 2012 Free Software Foundation, Inc.

This configure script is free software; the Free Software Foundation

gives unlimited permission to copy, distribute and modify it

[ar2667@holmes slurm-slurm-17.11]$ id slurm

uid=450(slurm) gid=450(slurm) groups=450(slurm)


Building and Installing Slurm from source:



Slurm root directory is currently /cm/shared/apps/slurm/17.11.2.



We will backup the current /cm/shared/apps/slurm/17.11.2 directory and
install the new slurm db in the same directory
(/cm/shared/apps/slurm/17.11.2).


I am not sure whether we need to specify explicitlyCFLAGS and LDFLAGS



$ ./configure --prefix=/cm/shared/apps/slurm/17.11.2
--sysconfdir=/etc/slurm/slurm.conf  --cache-file=config.cache --enable-debug


$ make

$ make install


$ ldconfig -n /cm/shared/apps/slurm/17.11.2/lib64



Rebuild database


/rigel/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd -D -v


Once you see the following message, you can shut down slurmdbd by pressing
Ctrl–C:

Conversion done:

success!

Restart slurmdbd.


Please let me know if we missed something.

Best,

Axinia


*---*
Axinia Radeva
Manager, Research Computing Services




On Mon, Nov 1, 2021 at 7:25 AM <bugs@schedmd.com> wrote:

> *Comment # 24
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c24&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=NUYVsWumKjwDaN0xLiXBPRWbMFXnREu9rjAsMF9RySE&s=XpwN792fJDHomatr2nph7rJhogZCWkxSU1NER76wUH4&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=NUYVsWumKjwDaN0xLiXBPRWbMFXnREu9rjAsMF9RySE&s=lZKmtaThC0LksA1uWmXSqJNeqi0XuWH0d_ewuu54bYY&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> > Thank you for the information. Our team is reduced to the minimum at
> > the moment but we are planning to do the upgrade tomorrow.
>
> Good!
> > In case if we
> > need your help, what is your availability tomorrow?
>
> I'm working normally today, but I'm personally located in Europe.
> But don't worry about it, the support team of SchedMD is all over.
> I'll ask other members of the team in your timezone to keep an eye on this
> ticket too.
>
> Anyway, if you want to post some sort of summary of what you are planning to do
> today, I'll double-check that your plan is what I also have in mind, just to
> avoid any confusions.
>
>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 26 Albert Gil 2021-11-01 10:15:21 MDT

Hi Axinia,

> I want to confirm with you that we can do the upgrade on a live cluster and
> we do not need downtime.

Yes, I can confirm this.

>    Shutdown the slurmdbd daemon from the head node as root from cmsh
>    (Bright management software):
>    -
>    Stop the slurmdbd service:
>    [roll->device[roll]->services]% stop slurmdbd
>
>    -
>    Ensure that slurmdbd is not running anymore:
>    [roll->device[roll]->services]% status slurmdbd

Looks fine.

>    -
>    slurmctld might remain running while the database daemon is down. During
>    this time, requests intended for slurmdbd are queued internally. The DBD
>    Agent Queue size is limited, however, and should therefore be monitored
>    with sdiag.
>
> The current value of the Agent Queue size is 146160

Yes, but your DBD Agent Queue is probably full or close to be full, as slurmdbd hasn't been able to perform properly for a long time now.

Note that the max size of this cache queue on slurctld may be controlled by the MaxDBDMsgs parameter on slurm,conf:

MaxDBDMsgs
When communication to the SlurmDBD is not possible the slurmctld will queue messages meant to processed when the SlurmDBD is available again. In order to avoid running out of memory the slurmctld will only queue so many messages. The default value is 10000, or MaxJobCount * 2 + Node Count * 4, whichever is greater. The value can not be less than 10000.


> *2. *Backup the Slurm database using mysqldump (or similar tool), e.g.
> mysqldump
> --databases slurm_acct_db > backup.sql. You may also want to take this
> opportunity to verify that the innodb_buffer_pool_size in my.cnf is at
> least 128M.
>
>    -
>    Create a backup of the slurm_acct_db database:
>    DBnode # mysqldump -p slurm_acct_db > slurm_acct_db.sql

Good.

> In preparation for the conversion, ensure that the variable
> innodb_buffer_pool_size is set to a value of 128 Mb or more:
>
> I checked the innodb_buffer_pool_size and it set to 128MB
> Do we need to increase innodb_buffer_pool_size?

There are other MariaDB variables that you should also check:
https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build

Note that we recommend at least 1024M for innodb_buffer_pool_size:

$cat my.cnf
...
[mysqld]
innodb_buffer_pool_size=1024M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900

>    -
>    To permanently change the size, edit the /etc/my.cnf file, set
>    innodb_buffer_pool_size to 128 MB, then restart the database:
>    DBnode # rcmysql restart

Better change it to 1024M.

> 3. Upgrade the slurmdbd daemon
> 
> We got the source code from here:
> https://github.com/SchedMD/slurm/tree/slurm-17.02

Careful!
That is 17.02, this is WRONG.
You need 17.11!

> [ar2667@holmes slurmdb]$ cd
> /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11
> 
> [ar2667@holmes slurm-slurm-17.11]$ pwd
> /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11


> I checked the version:
> [ar2667@holmes slurm-slurm-17.11]$  ./configure --version
> 
> slurm configure 17.11

Ok, this looks better.
I assume you cloned the git repo, so you need to:

$ git checkout slurm-17-11-13-2

> Slurm root directory is currently /cm/shared/apps/slurm/17.11.2.
> 
> We will backup the current /cm/shared/apps/slurm/17.11.2 directory and
> install the new slurm db in the same directory
> (/cm/shared/apps/slurm/17.11.2).

Seems good, but the "17.11.2" of the path may be a bit confusing later.


> I am not sure whether we need to specify explicitlyCFLAGS and LDFLAGS

Shouldn't be necessary in general.

> $ ./configure --prefix=/cm/shared/apps/slurm/17.11.2
> --sysconfdir=/etc/slurm/slurm.conf  --cache-file=config.cache --enable-debug

I think that Bright do some special configs about the location of the config files, and it seems that you are handling them properly, but you may need to ask them to be sure.
I don't know neither if you need PMIX support or other Slurm features.

Also note that you'll need some packages in the system to build slurmdbd properly with munge and sql support, like libmunge-dev and libmariadbclient-dev. They depends on your Linux version.

> $ make
> $ make install

Yes.
I recommend to use some -jN to boost it a bit... ;-)


> Rebuild database
> 
> /rigel/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd -D -v
> 
> Once you see the following message, you can shut down slurmdbd by pressing
> Ctrl–C:
> 
> Conversion done:
> 
> success!
> 
> Restart slurmdbd.

Well, as you are doing a minor upgrade, I don't expect any DB conversion happening.
Please post errors if you face them.

> Please let me know if we missed something.

In general looks fine.
I'll still work for some time, but my teammates on your timezone are aware of this ticket.

Regards,
Albert

Comment 27 ar 2021-11-04 14:53:15 MDT

Hi,

We did the slurmdb upgrade from slurm-17.11.2 to slurm-17-11-13-2.

ldconfig -n <library_location>

Do we need to link the newly created slurmdb libraries in order for slurmdb
to run smoothly?

If we need to link do we need to stop slurmdb first?

The DBD Agent queue size is 0:

[ar2667@roll slurm-slurm-17-11-13-2]$ sdiag

*******************************************************

sdiag output at Thu Nov 04 16:27:34 2021 (1636057654)

Data since      Wed Nov 03 20:00:00 2021 (1635984000)

*******************************************************

Server thread count:  3

Agent queue size:     0

DBD Agent queue size: 0


Jobs submitted: 5261

Jobs started:   2579

Jobs completed: 2727

Jobs canceled:  649

Jobs failed:    0


Jobs running:    9

Jobs running ts: Thu Nov 04 16:27:24 2021 (1636057644)


Main schedule statistics (microseconds):

Last cycle:   10075

Max cycle:    1466764

Total cycles: 4530

Mean cycle:   14439

Mean depth cycle:  114

Cycles per minute: 3

Last queue length: 48


Backfilling stats

Total backfilled jobs (since last slurm start): 36646

Total backfilled jobs (since last stats cycle start): 1649

Total backfilled heterogeneous job components: 0

Total cycles: 2431

Last cycle when: Thu Nov 04 16:27:13 2021 (1636057633)

Last cycle: 106221

Max cycle:  10510947

Mean cycle: 272490

Last depth cycle: 128

Last depth cycle (try sched): 125

Depth Mean: 185

Depth Mean (try depth): 131

Last queue length: 48

Queue length mean: 150


Remote Procedure Call statistics by message type

REQUEST_PARTITION_INFO                  ( 2009) count:1010681 ave_time:2101
total_time:2124106863

REQUEST_JOB_INFO                        ( 2003) count:498878
ave_time:159082 total_time:79362635362

REQUEST_NODE_INFO_SINGLE                ( 2040) count:407431
ave_time:150493 total_time:61315783059

MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:366622 ave_time:50236
total_time:18417720567

MESSAGE_EPILOG_COMPLETE                 ( 6012) count:262028
ave_time:149176 total_time:39088453342

REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:249805
ave_time:873169 total_time:218122125025

REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:183054 ave_time:89474
total_time:16378755965

REQUEST_PING                            ( 1008) count:108050 ave_time:670
  total_time:72422636

REQUEST_FED_INFO                        ( 2049) count:105534 ave_time:495
  total_time:52299235

REQUEST_JOB_USER_INFO                   ( 2039) count:94936  ave_time:73919
total_time:7017583272

REQUEST_NODE_INFO                       ( 2007) count:61971  ave_time:10029
total_time:621519495

REQUEST_KILL_JOB                        ( 5032) count:35859  ave_time:70541
total_time:2529555102

REQUEST_JOB_STEP_CREATE                 ( 5001) count:21987  ave_time:19904
total_time:437649341

REQUEST_UPDATE_JOB                      ( 3001) count:18175  ave_time:13511
total_time:245579280

REQUEST_JOB_INFO_SINGLE                 ( 2021) count:10486  ave_time:56976
total_time:597460110

REQUEST_STEP_COMPLETE                   ( 5016) count:4860   ave_time:82818
total_time:402500244

REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:4106   ave_time:142568
total_time:585385921

REQUEST_JOB_READY                       ( 4019) count:2086   ave_time:48635
total_time:101452740

REQUEST_SHARE_INFO                      ( 2022) count:1734   ave_time:7757
total_time:13451288

REQUEST_CANCEL_JOB_STEP                 ( 5005) count:926    ave_time:34256
total_time:31721572

REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:924    ave_time:135138
total_time:124867815

REQUEST_RESOURCE_ALLOCATION             ( 4001) count:912    ave_time:151121
total_time:137823173

REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:83     ave_time:134873
total_time:11194528

REQUEST_RECONFIGURE                     ( 1003) count:23     ave_time:4378081
total_time:100695867

ACCOUNTING_UPDATE_MSG                   (10001) count:23     ave_time:3081122
total_time:70865821

REQUEST_JOB_NOTIFY                      ( 4022) count:14     ave_time:513
  total_time:7190

REQUEST_UPDATE_NODE                     ( 3002) count:13     ave_time:163184
total_time:2121392

REQUEST_PRIORITY_FACTORS                ( 2026) count:10     ave_time:75375
total_time:753750

REQUEST_STATS_INFO                      ( 2035) count:9      ave_time:350
  total_time:3153

REQUEST_JOB_STEP_INFO                   ( 2005) count:6      ave_time:913
  total_time:5482

ACCOUNTING_REGISTER_CTLD                (10003) count:3      ave_time:99144
total_time:297433

REQUEST_CREATE_RESERVATION              ( 3006) count:3      ave_time:1722
total_time:5167

REQUEST_RESERVATION_INFO                ( 2024) count:2      ave_time:297
  total_time:595

REQUEST_TOP_JOB                         ( 5038) count:1      ave_time:636
  total_time:636

REQUEST_BUILD_INFO                      ( 2001) count:1      ave_time:992
  total_time:992


Remote Procedure Call statistics by user

root            (       0) count:1968045 ave_time:177435
total_time:349201813840

rec2111         (  169996) count:957935 ave_time:74371
total_time:71243449580

de2356          (  466630) count:262940 ave_time:19786
total_time:5202663824

rcc2167         (  497658) count:111894 ave_time:80291
total_time:8984176222

mrd2165         (  456704) count:60636  ave_time:90680
total_time:5498526572

zp2221          (  546238) count:23312  ave_time:31225  total_time:727930989

hzz2000         (  476402) count:12953  ave_time:133267
total_time:1726212175

tk2757          (  475100) count:9867   ave_time:36060  total_time:355808405

rl2226          (  124636) count:8450   ave_time:88802  total_time:750377404

ls3759          (  544216) count:2462   ave_time:219621 total_time:540708722

kk3291          (  496278) count:2367   ave_time:137171 total_time:324685378

mi2493          (  546357) count:2249   ave_time:39402  total_time:88615117

mam2556         (  497945) count:1949   ave_time:96117  total_time:187333844

dl2860          (  358507) count:1593   ave_time:156732 total_time:249675512

yx2625          (  544217) count:1273   ave_time:58559  total_time:74545771

ls3326          (  441245) count:1162   ave_time:37463  total_time:43533040

adp2164         (  534798) count:1093   ave_time:59129  total_time:64628798

msd2202         (  545520) count:1042   ave_time:220796 total_time:230069625

aeh2213         (  525442) count:1025   ave_time:67385  total_time:69070417

mv2640          (  453243) count:963    ave_time:79477  total_time:76536782

lmk2202         (  476448) count:943    ave_time:82737  total_time:78021827

az2604          (  532427) count:888    ave_time:338792 total_time:300847531

pcd2120         (  475692) count:823    ave_time:107029 total_time:88085136

ab4689          (  495992) count:791    ave_time:111803 total_time:88436721

ma3631          (  462767) count:721    ave_time:70244  total_time:50646299

yg2811          (  545936) count:713    ave_time:52872  total_time:37698056

mcb2270         (  485162) count:638    ave_time:58953  total_time:37612491

taf2109         (  250023) count:614    ave_time:133794 total_time:82149615

zx2250          (  495594) count:597    ave_time:93204  total_time:55643075

nobody          (  486473) count:503    ave_time:124392 total_time:62569554

nobody          (  545556) count:477    ave_time:63677  total_time:30374144

yva2000         (  446751) count:476    ave_time:50400  total_time:23990449

sb4601          (  546356) count:446    ave_time:97455  total_time:43465066

kmx2000         (  538398) count:432    ave_time:55456  total_time:23957062

lma2197         (  545978) count:420    ave_time:93863  total_time:39422850

ad3395          (  457738) count:412    ave_time:141428 total_time:58268672

yw3376          (  521854) count:397    ave_time:429944 total_time:170687883

rh2845          (  472268) count:376    ave_time:78339  total_time:29455533

ar2667          (  217300) count:336    ave_time:68430  total_time:22992503

rh2883          (  487394) count:322    ave_time:15159  total_time:4881271

as5460          (  478898) count:302    ave_time:73032  total_time:22055729

slh2181         (  485956) count:282    ave_time:75242  total_time:21218504

am5328          (  525450) count:279    ave_time:66455  total_time:18540974

zw2105          (   80893) count:256    ave_time:64065  total_time:16400820

htr2104         (  411716) count:254    ave_time:32973  total_time:8375243

jb4493          (  546296) count:246    ave_time:57979  total_time:14262871

tma2145         (  493491) count:229    ave_time:140148 total_time:32094053

ll3450          (  546422) count:227    ave_time:63897  total_time:14504685

jic2121         (  463630) count:216    ave_time:39114  total_time:8448678

yy2865          (  492680) count:207    ave_time:57179  total_time:11836154

os2328          (  493602) count:200    ave_time:31509  total_time:6301955

mt3197          (  473089) count:179    ave_time:64292  total_time:11508437

jdn2133         (  470892) count:165    ave_time:77947  total_time:12861324

ab2080          (  110745) count:162    ave_time:85449  total_time:13842774

cx2204          (  477023) count:162    ave_time:101575 total_time:16455282

yz4047          (  543996) count:155    ave_time:2191   total_time:339624

ca2783          (  488827) count:151    ave_time:85713  total_time:12942746

am5284          (  519168) count:148    ave_time:397578 total_time:58841596

mts2188         (  546099) count:145    ave_time:83144  total_time:12055925

gt2453          (  545916) count:140    ave_time:7515   total_time:1052177

pfm2119         (  470991) count:138    ave_time:67934  total_time:9374942

flw2113         (  489455) count:132    ave_time:110236 total_time:14551159

mad2314         (  545522) count:132    ave_time:82226  total_time:10853953

yp2602          (  546118) count:129    ave_time:109131 total_time:14077980

mea2200         (  524843) count:127    ave_time:30659  total_time:3893818

hmm2183         (  543254) count:127    ave_time:81463  total_time:10345822

arr47           (  543516) count:126    ave_time:53569  total_time:6749815

sh3972          (  524988) count:125    ave_time:89528  total_time:11191059

bc212           (   30094) count:109    ave_time:7083   total_time:772063

st3107          (  474846) count:105    ave_time:19529  total_time:2050639

ia2337          (  423640) count:102    ave_time:2957   total_time:301701

ik2496          (  543457) count:87     ave_time:514581 total_time:44768622

kz2303          (  477679) count:84     ave_time:74047  total_time:6220030

gjc14           (  488676) count:84     ave_time:93202  total_time:7829028

as4525          (  365586) count:81     ave_time:1029267 total_time:83370670

yr2322          (  470274) count:71     ave_time:7979   total_time:566568

ja3170          (  463425) count:71     ave_time:64624  total_time:4588329

kat2193         (  496066) count:68     ave_time:147765 total_time:10048080

sj2787          (  453295) count:65     ave_time:165285 total_time:10743552

el2545          (  261905) count:58     ave_time:14282  total_time:828390

sw3203          (  479850) count:51     ave_time:138129 total_time:7044590

hl2902          (  414743) count:49     ave_time:490264 total_time:24022960

nt2560          (  544457) count:41     ave_time:161329 total_time:6614525

mgz2110         (  546421) count:36     ave_time:75339  total_time:2712235

sb3378          (  314424) count:29     ave_time:10018869
total_time:290547226

fw2366          (  546279) count:27     ave_time:125658 total_time:3392777

jj3134          (  545521) count:27     ave_time:123118 total_time:3324189

slurm           (     450) count:26     ave_time:2737048 total_time:71163254

wz2543          (  545958) count:21     ave_time:71778  total_time:1507344

aso2125         (  528824) count:20     ave_time:2716   total_time:54328

pab2170         (  423948) count:20     ave_time:82432  total_time:1648658

iu2153          (  465048) count:18     ave_time:113845 total_time:2049210

ns3316          (  498343) count:18     ave_time:62053  total_time:1116960

fg2465          (  498193) count:17     ave_time:107099 total_time:1820688

jeg2228         (  497608) count:14     ave_time:97146  total_time:1360048

jab2443         (  496774) count:14     ave_time:7292   total_time:102099

ms5924          (  533600) count:12     ave_time:11046  total_time:132552

kl2792          (  389785) count:12     ave_time:131466 total_time:1577598

qz2280          (  451169) count:12     ave_time:352823 total_time:4233881

rl3149          (  546516) count:11     ave_time:2815   total_time:30967

pab2163         (  363846) count:9      ave_time:21269  total_time:191421

mz2778          (  527313) count:8      ave_time:144421 total_time:1155371

xl3041          (  546419) count:8      ave_time:167814 total_time:1342519

kn2536          (  544496) count:8      ave_time:7088   total_time:56705

sx2220          (  476094) count:7      ave_time:270912 total_time:1896390

rc3362          (  545696) count:7      ave_time:8925   total_time:62478

jnt2136         (  518724) count:7      ave_time:77231  total_time:540618

mc4138          (  545636) count:6      ave_time:15008  total_time:90051

ags2198         (  502018) count:5      ave_time:7624   total_time:38121

jv2575          (  443748) count:5      ave_time:2155   total_time:10775

xl2727          (  477843) count:5      ave_time:3102   total_time:15510

yg2607          (  496576) count:4      ave_time:37494  total_time:149977

reg2171         (  542795) count:4      ave_time:5529   total_time:22116

hwp2108         (  527947) count:3      ave_time:1534   total_time:4603

new2128         (  546417) count:3      ave_time:3694   total_time:11083

nb2869          (  494059) count:3      ave_time:14663  total_time:43990

lef2150         (  524806) count:3      ave_time:7479   total_time:22438

yj2650          (  545416) count:2      ave_time:197    total_time:394

da2709          (  446758) count:2      ave_time:27576  total_time:55153

dp264           (   36357) count:1      ave_time:4085   total_time:4085

However we see the following error in slurmdbd logs:

[ar2667@roll slurm-slurm-17-11-13-2]$ sudo cat  /var/log/slurmdbd

[2021-11-04T15:37:02.775] Accounting storage MYSQL plugin loaded

[2021-11-04T15:37:02.855] error: chdir(/var/log): Permission denied

[2021-11-04T15:37:02.855] chdir to /var/tmp

[2021-11-04T15:37:18.050] slurmdbd version 17.11.13-2 started

[2021-11-04T15:37:19.493] error: We have more allocated time than is
possible (363722400 > 26179200) for cluster habanero(7272) from
2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 1

[2021-11-04T15:37:19.493] error: We have more time than is possible
(26179200+7948800+0)(34128000) > 26179200 for cluster habanero(7272) from
2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 1

[2021-11-04T15:37:19.493] error: We have more allocated time than is
possible (2239812280800 > 196300800000) for cluster habanero(54528000) from
2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 2

[2021-11-04T15:37:19.493] error: We have more time than is possible
(196300800000+47923200000+0)(244224000000) > 196300800000 for cluster
habanero(54528000) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 2

[2021-11-04T15:37:19.493] error: We have more allocated time than is
possible (363711600 > 26179200) for cluster habanero(7272) from
2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 5

[2021-11-04T15:37:19.493] error: We have more time than is possible
(26179200+7948800+0)(34128000) > 26179200 for cluster habanero(7272) from
2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 5

[2021-11-04T15:37:19.494] error: id_assoc 205 doesn't have any tres

[2021-11-04T16:00:08.405] error: id_assoc 205 doesn't have any tres

[2021-11-04T16:00:58.323] Warning: Note very large processing time from
daily_rollup for habanero: usec=49879794 began=16:00:08.443



[ar2667@roll slurm-slurm-17-11-13-2]$ sudo tail -300 /var/log/slurmctld

[2021-11-04T15:37:04.058] error: slurmdbd: DBD_SEND_MULT_JOB_START failure:
Connection refused

[2021-11-04T15:37:08.367] error: slurmdbd: Sending PersistInit msg:
Connection refused

[2021-11-04T15:37:09.060] error: slurmdbd: Sending PersistInit msg:
Connection refused

[2021-11-04T15:37:09.061] error: slurmdbd: DBD_SEND_MULT_JOB_START failure:
Connection refused

[2021-11-04T15:37:14.064] error: slurmdbd: Sending PersistInit msg:
Connection refused

[2021-11-04T15:37:14.065] error: slurmdbd: DBD_SEND_MULT_JOB_START failure:
Connection refused

[2021-11-04T15:37:18.058] Registering slurmctld at port 6817 with slurmdbd.

[2021-11-04T15:37:42.551] _job_complete: JobID=25547442 State=0x1 NodeCnt=1
WEXITSTATUS 0

[2021-11-04T15:37:42.551] email msg to yz4047@cumc.columbia.edu: SLURM
Job_id=25547442 Name=test.submit Ended, Run time 1-16:12:01, COMPLETED,
ExitCode 0

[2021-11-04T15:37:42.552] _job_complete: JobID=25547442 State=0x8003
NodeCnt=1 done

[2021-11-04T15:37:48.710] slurmdbd: agent queue size 133700

[2021-11-04T15:37:52.166] error: _shutdown_backup_controller:send/recv:
Connection refused

[2021-11-04T15:38:10.854] error: slurmdbd: agent queue filling (124905),
RESTART SLURMDBD NOW

[2021-11-04T15:39:48.820] slurmdbd: agent queue size 61100

[2021-11-04T15:41:35.971] _job_complete: JobID=25543331 State=0x1 NodeCnt=1
WEXITSTATUS 0

[2021-11-04T15:41:35.971] _job_complete: JobID=25543331 State=0x8003
NodeCnt=1 done

[2021-11-04T15:42:52.316] error: _shutdown_backup_controller:send/recv:
Connection refused

[2021-11-04T15:47:20.062] job_submit.lua: Function slurm_job_submit called.

[2021-11-04T15:47:20.062] job_submit.lua: Account is jalab.

[2021-11-04T15:47:20.062] job_submit.lua: Regular account.

[2021-11-04T15:47:20.066] _slurm_rpc_submit_batch_job: JobId=25552998
InitPrio=3530 usec=3748

[2021-11-04T15:47:35.215] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
25552998 uid 544496

[2021-11-04T15:47:35.215] email msg to kn2536@columbia.edu: SLURM
Job_id=25552998 Name=N_A_13000 Ended, Run time 00:00:00, CANCELLED,
ExitCode 0

[2021-11-04T15:47:52.305] error: _shutdown_backup_controller:send/recv:
Connection refused

[2021-11-04T15:52:34.952] _job_complete: JobID=25552413 State=0x0 NodeCnt=0
cancelled by interactive user

[2021-11-04T15:52:34.953] _job_complete: JobID=25552413 State=0x4 NodeCnt=0
done

[2021-11-04T15:52:34.972] error: slurm_receive_msg [10.43.4.228:48637]:
Zero Bytes were transmitted or received

[2021-11-04T15:52:52.747] error: _shutdown_backup_controller:send/recv:
Connection refused

[2021-11-04T15:52:58.726] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
25552427 uid 497658

[2021-11-04T15:52:58.727] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
25552419 uid 497658

[2021-11-04T15:52:58.773] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
25552423 uid 497658

....


*---*

[2021-11-04T15:53:06.622] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
25552993 uid 497658

[2021-11-04T15:53:06.622] _slurm_rpc_kill_job: REQUEST_KILL_JOB job
25552997 uid 497658

[2021-11-04T16:32:28.766] job_submit.lua: Function slurm_job_submit called.

[2021-11-04T16:32:28.766] job_submit.lua: Account is astro.

[2021-11-04T16:32:28.766] job_submit.lua: Regular account.

[2021-11-04T16:32:28.790] sched: _slurm_rpc_allocate_resources
JobId=25553001 NodeList=(null) usec=26990

[2021-11-04T16:32:28.955] _pick_best_nodes: job 25552999 never runnable in
partition apam1

[2021-11-04T16:32:52.048] error: _shutdown_backup_controller:send/recv:
Connection refused

[2021-11-04T16:33:03.090] _pick_best_nodes: job 25552999 never runnable in
partition apam1

[2021-11-04T16:34:03.265] _pick_best_nodes: job 25552999 never runnable in
partition apam1

[2021-11-04T16:34:21.513] _job_complete: JobID=25553001 State=0x0 NodeCnt=0
WTERMSIG 126

[2021-11-04T16:34:21.515] _job_complete: JobID=25553001 State=0x0 NodeCnt=0
cancelled by interactive user

[2021-11-04T16:34:21.518] _job_complete: JobID=25553001 State=0x4 NodeCnt=0
done

[2021-11-04T16:34:21.520] _slurm_rpc_complete_job_allocation:
JobID=25553001 State=0x4 NodeCnt=0 error Job/step already completing or
completed

[2021-11-04T16:35:03.460] _pick_best_nodes: job 25552999 never runnable in
partition apam1





Axinia Radeva
Manager, Research Computing Services




On Mon, Nov 1, 2021 at 12:15 PM <bugs@schedmd.com> wrote:

> *Comment # 26
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c26&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=5DfT6aQxK6ZiGeA3dE_4s79QVohqJVvvj2-g09MvxEM&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=Pzj9-0_NKdlO9CL4B21f9ltoRwEWrwhgbD1SmwywE_E&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> > I want to confirm with you that we can do the upgrade on a live cluster and
> > we do not need downtime.
>
> Yes, I can confirm this.
> >    Shutdown the slurmdbd daemon from the head node as root from cmsh
> >    (Bright management software):
> >    -
> >    Stop the slurmdbd service:
> >    [roll->device[roll]->services]% stop slurmdbd
> >>    -
> >    Ensure that slurmdbd is not running anymore:
> >    [roll->device[roll]->services]% status slurmdbd
>
> Looks fine.
> >    -
> >    slurmctld might remain running while the database daemon is down. During
> >    this time, requests intended for slurmdbd are queued internally. The DBD
> >    Agent Queue size is limited, however, and should therefore be monitored
> >    with sdiag.
> >> The current value of the Agent Queue size is 146160
>
> Yes, but your DBD Agent Queue is probably full or close to be full, as slurmdbd
> hasn't been able to perform properly for a long time now.
>
> Note that the max size of this cache queue on slurctld may be controlled by the
> MaxDBDMsgs parameter on slurm,conf:
>
> MaxDBDMsgs
> When communication to the SlurmDBD is not possible the slurmctld will queue
> messages meant to processed when the SlurmDBD is available again. In order to
> avoid running out of memory the slurmctld will only queue so many messages. The
> default value is 10000, or MaxJobCount * 2 + Node Count * 4, whichever is
> greater. The value can not be less than 10000.
>
> > *2. *Backup the Slurm database using mysqldump (or similar tool), e.g.
> > mysqldump
> > --databases slurm_acct_db > backup.sql. You may also want to take this
> > opportunity to verify that the innodb_buffer_pool_size in my.cnf is at
> > least 128M.
> >>    -
> >    Create a backup of the slurm_acct_db database:
> >    DBnode # mysqldump -p slurm_acct_db > slurm_acct_db.sql
>
> Good.
> > In preparation for the conversion, ensure that the variable
> > innodb_buffer_pool_size is set to a value of 128 Mb or more:
> >> I checked the innodb_buffer_pool_size and it set to 128MB
> > Do we need to increase innodb_buffer_pool_size?
>
> There are other MariaDB variables that you should also check:https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_accounting.html-23slurm-2Daccounting-2Dconfiguration-2Dbefore-2Dbuild&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=Yo5SVqySMlIzlN6DvRldV3-Hj1IN3Q6D3sGfstXBrsk&e=>
>
> Note that we recommend at least 1024M for innodb_buffer_pool_size:
>
> $cat my.cnf
> ...
> [mysqld]
> innodb_buffer_pool_size=1024M
> innodb_log_file_size=64M
> innodb_lock_wait_timeout=900
> >    -
> >    To permanently change the size, edit the /etc/my.cnf file, set
> >    innodb_buffer_pool_size to 128 MB, then restart the database:
> >    DBnode # rcmysql restart
>
> Better change it to 1024M.
> > 3. Upgrade the slurmdbd daemon
> >
> > We got the source code from here:
> > https://github.com/SchedMD/slurm/tree/slurm-17.02 <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm_tree_slurm-2D17.02&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=MkyPGaqKxcU6f1WGW7yMeG-NDh9v-30eHkb30gzopoA&e=>
>
> Careful!
> That is 17.02, this is WRONG.
> You need 17.11!
> > [ar2667@holmes slurmdb]$ cd
> > /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11
> >
> > [ar2667@holmes slurm-slurm-17.11]$ pwd
> > /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11
>
> > I checked the version:
> > [ar2667@holmes slurm-slurm-17.11]$  ./configure --version
> >
> > slurm configure 17.11
>
> Ok, this looks better.
> I assume you cloned the git repo, so you need to:
>
> $ git checkout slurm-17-11-13-2
> > Slurm root directory is currently /cm/shared/apps/slurm/17.11.2.
> >
> > We will backup the current /cm/shared/apps/slurm/17.11.2 directory and
> > install the new slurm db in the same directory
> > (/cm/shared/apps/slurm/17.11.2).
>
> Seems good, but the "17.11.2" of the path may be a bit confusing later.
>
> > I am not sure whether we need to specify explicitlyCFLAGS and LDFLAGS
>
> Shouldn't be necessary in general.
> > $ ./configure --prefix=/cm/shared/apps/slurm/17.11.2
> > --sysconfdir=/etc/slurm/slurm.conf  --cache-file=config.cache --enable-debug
>
> I think that Bright do some special configs about the location of the config
> files, and it seems that you are handling them properly, but you may need to
> ask them to be sure.
> I don't know neither if you need PMIX support or other Slurm features.
>
> Also note that you'll need some packages in the system to build slurmdbd
> properly with munge and sql support, like libmunge-dev and
> libmariadbclient-dev. They depends on your Linux version.
> > $ make
> > $ make install
>
> Yes.
> I recommend to use some -jN to boost it a bit... ;-)
>
> > Rebuild database
> >
> > /rigel/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd -D -v
> >
> > Once you see the following message, you can shut down slurmdbd by pressing
> > Ctrl–C:
> >
> > Conversion done:
> >
> > success!
> >
> > Restart slurmdbd.
>
> Well, as you are doing a minor upgrade, I don't expect any DB conversion
> happening.
> Please post errors if you face them.
> > Please let me know if we missed something.
>
> In general looks fine.
> I'll still work for some time, but my teammates on your timezone are aware of
> this ticket.
>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 28 Albert Gil 2021-11-05 03:27:19 MDT

Axinia Radeva,

> We did the slurmdb upgrade from slurm-17.11.2 to slurm-17-11-13-2.

Great!
I assume that you did not yet upgrade slurmctld and neither slurmds and clients, right.
I would recommend to do so, but just for sanity, not actually related to your main issue.

> ldconfig -n <library_location>
> 
> Do we need to link the newly created slurmdb libraries in order for slurmdb
> to run smoothly?

You shouldn't.
If you install it, the executable will just link properly.
From the logs I can see that you are running the right slurmdbd version:

> [2021-11-04T15:37:18.050] slurmdbd version 17.11.13-2 started
> The DBD Agent queue size is 0:

That 0 are very good news! ;-)
That means that slurmctld has been able already to sent all the pending messages to slurmdbd.

> However we see the following error in slurmdbd logs:

Yes, I was expecting to see still some errors, but they are not that important as the previous ones.

> [2021-11-04T15:37:02.775] Accounting storage MYSQL plugin loaded
> [2021-11-04T15:37:02.855] error: chdir(/var/log): Permission denied
> [2021-11-04T15:37:02.855] chdir to /var/tmp

You need to change permissions of /var/log, or change the location of the log files in slurmdbd.conf.

> [2021-11-04T15:37:19.493] error: We have more allocated time than is
> possible (363722400 > 26179200) for cluster habanero(7272) from
> 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 1

Due your original issue, I was already expecting runaway jobs due the probable lost of messages between slurmctld and slurmdbd, and runaways are the main source of these kind of errors.
Actually, runaways are exactly the symptom that your were seeing initially: jobs that are not anymore running in the cluster, but the slurmdbd thinks that they still are.
We can fix runaways, but I would recommend to do it later (see below).

Regarding the errors on slurmcltd:

> [ar2667@roll slurm-slurm-17-11-13-2]$ sudo tail -300 /var/log/slurmctld
> [2021-11-04T15:37:18.058] Registering slurmctld at port 6817 with slurmdbd.
> [2021-11-04T15:37:48.710] slurmdbd: agent queue size 133700

The queue was huge.
Even if we managed to reduce/digest ~10k message it just some seconds, it was still huge:

> [2021-11-04T15:38:10.854] error: slurmdbd: agent queue filling (124905),
> RESTART SLURMDBD NOW

But in just one minute we halved the pending messages (~65k messages digested):

> [2021-11-04T15:39:48.820] slurmdbd: agent queue size 61100

And as sdiag tells us, now the queue is just empty, so no pending messages between slurmctld and slurmdbd.
Great!

But yes, there are other errors that will need attention, like these:

> [2021-11-04T15:37:52.166] error: _shutdown_backup_controller:send/recv:
> Connection refused
> [2021-11-04T15:52:34.972] error: slurm_receive_msg [10.43.4.228:48637]:
> Zero Bytes were transmitted or received

My recommended roadmap for your site would be:

1a) I'm a bit worried about this "backup controller", I would recommend to disable it until your site is more stable and you have a supported version running. Do you know how to disable the backup slurmctld?
1b) Complete the minor upgrade to 17.11.13 for the slurmctld and slurmds and clients.

If you need it, we can keep this bug open to help you on this first step.
In that case, please attach your slurm.conf to have a better idea of your config.

2a) Open a new ticket to help you to plan and make the major upgrade up to 21.08. Please note that you CANNOT make it directly, you'll need to upgrade to intermediate versions first.
2b) Once in 21.08, fix runaway jobs. I recommend to do that in 21.08 because there were significant improvements on detecting and fixing runaway jobs since 17.11, so it's better to do it in a newer versions.

And finally:

3a) Open new tickets with a supported version running to study the remaining error messages (on a supported version) of your logs.
3b) Re-enable a backup controller if you really need it. If it's not necessary, I would say that it's better to not setup a controller backup but to keep your config simpler.

Regards,
Albert

Comment 29 Albert Gil 2021-11-12 04:19:32 MST

Hi Axinia,

> > The DBD Agent queue size is 0:
> 
> That 0 are very good news! ;-)

I hope this value still remains low.

> My recommended roadmap for your site would be:
> 
> 1a) I'm a bit worried about this "backup controller", I would recommend to
> disable it until your site is more stable and you have a supported version
> running. Do you know how to disable the backup slurmctld?

Did you disable the backup slurmctld?

> 1b) Complete the minor upgrade to 17.11.13 for the slurmctld and slurmds and
> clients.

Have you been able to upgrade the full cluster (all daemons and clients) to 17.11.13?

> If you need it, we can keep this bug open to help you on this first step.
> In that case, please attach your slurm.conf to have a better idea of your
> config.

Do you want to keep this open to complete this 1st step of the suggested roadmap, or can we close it already?


Regards,
Albert

Comment 30 ar 2021-11-15 09:24:24 MST

Hi Albert,

Thank you for following up on this. Unfortunately, we are spread very thin
at the moment and we do not have the recourses to execute all our projects.

Can you please see my responses below?

Are those tasks time-sensitive?

Best,
Axinia



*---*
Axinia Radeva
Manager, Research Computing Services




On Fri, Nov 12, 2021 at 6:19 AM <bugs@schedmd.com> wrote:

> *Comment # 29
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c29&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=-0snpS0JJ8ZG0nZzbBtovXYVAqV6T1qbe8bEbaJKAa8&s=TwjpvYb4tZxVJqcvqn4Tc9D2JKa2yyGIQJyB_4IBi64&e=>
> on bug 12747
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=-0snpS0JJ8ZG0nZzbBtovXYVAqV6T1qbe8bEbaJKAa8&s=H47DgyK2WPxf3p6Alv_NM7sQDjKFujRSVa6Ht8oCVMg&e=>
> from Albert Gil <albert.gil@schedmd.com> *
>
> Hi Axinia,
> > > The DBD Agent queue size is 0:
> >
> > That 0 are very good news! ;-)
>
> I hope this value still remains low.
>
> The DBD Agent queue size is  still 0

> > My recommended roadmap for your site would be:
> >
> > 1a) I'm a bit worried about this "backup controller", I would recommend to
> > disable it until your site is more stable and you have a supported version
> > running. Do you know how to disable the backup slurmctld?
>
> Did you disable the backup slurmctld?
>
> No, we have not disabled the backup slurmctld. Can you please provide the
steps how to do it.

> > 1b) Complete the minor upgrade to 17.11.13 for the slurmctld and slurmds and
> > clients.
>
> Have you been able to upgrade the full cluster (all daemons and clients) to
> 17.11.13?
>
> We have not been able to upgrade the full cluster. I assume we need
downtime for this.

> > If you need it, we can keep this bug open to help you on this first step.
> > In that case, please attach your slurm.conf to have a better idea of your
> > config.
>
>
The slurm.conf is attached.

> Do you want to keep this open to complete this 1st step of the suggested
> roadmap, or can we close it already?
>
>
> Regards,
> Albert
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 31 Albert Gil 2021-11-15 12:05:50 MST

Hi Axinia,

> Thank you for following up on this. Unfortunately, we are spread very thin
> at the moment and we do not have the recourses to execute all our projects.

Ok.

> Are those tasks time-sensitive?

Not really.
You've already done the time-sensitive one.

> > Did you disable the backup slurmctld?
> >
> > No, we have not disabled the backup slurmctld. Can you please provide the
> steps how to do it.

As you seem busy, this is not time-sensitive and neither related to the original issue of the ticket, I would recommend you to file a new ticket whenever you have time to work on it and we'll provide you with more instructions there.
But basically the instructions are:
- Change the slurm.conf to disable to the backup slurmctld
- Stop the slurmctld backup
- Restart the main slurmcltd 

> > Have you been able to upgrade the full cluster (all daemons and clients) to
> > 17.11.13?
> >
> > We have not been able to upgrade the full cluster. I assume we need
> downtime for this.

Not really.
You can just stop your old slurmctld running 17.11.2 and start the new one 17.11.13.
All jobs will keep running and users shouldn't notice it (if you do it quick enough, ie some seconds).
Same thing for all slurmd daemons on the nodes.

It's not a problem if you run slurmdbd on version 17.11.13 and the other daemons with 17.11.2, but it's always recommended to use the same version, just for sanity.

> > > If you need it, we can keep this bug open to help you on this first step.
> > > In that case, please attach your slurm.conf to have a better idea of your
> > > config.
> >
> >
> The slurm.conf is attached.

It seems that it's not attached in bugzilla for some reason, but don't worry.
If this is ok for you I'm closing this ticket as infogiven and once you have time, please don't hesitate to file a new bug with your slurm.conf and we'll help you with the rest of the recommended steps mentioned in comment 28.

Regards,
Albert

Comment 32 Albert Gil 2021-11-15 14:20:08 MST

Closing as infogiven.