Hi, We observed some malfunctioning in Slurm DB. $ sacct -u mv2640 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 25473996 lib2st3-5 dsi2 dsi 24 RUNNING 0:0 25485488 himem_st5 dsi1,dsi2 dsi 1 PENDING 0:0 25485489 lib2st5 dsi1,dsi2+ dsi 24 PENDING 0:0 25486925 lib2st5 dsi1,dsi2+ dsi 24 PENDING 0:0 $ squeue -u mv2640 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 25485488 dsi1,dsi2 himem_st mv2640 PD 0:00 1 (Priority) (base) [ab2080@viz ~]$ So: job 25473996 is no longer running, and the jobs 25485489 and 25486925 and are no longer pending (they already ran and ended as well). For example 25473996 is not runnning [ar2667@holmes ~]$ sacct -j 25473996 -u mv2640 --format=User,JobID,jobname,state,time,start,end,elapsed,ReqTRE,nodelist User JobID JobName State Timelimit Start End Elapsed ReqTRES NodeList --------- ------------ ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- --------------- mv2640 25473996 lib2st3-5 RUNNING 5-00:00:00 2021-10-25T14:01:10 Unknown 22:11:17 cpu=24,me+ node215 [ar2667@node215 ~]$ sudo less /var/log/slurmd |grep 25473996 [2021-10-25T14:01:13.115] _run_prolog: prolog with lock for job 25473996 ran for 0 seconds [2021-10-25T14:01:13.115] Launching batch job 25473996 for UID 453243 [2021-10-25T14:01:13.177] [25473996.batch] task/cgroup: /slurm/uid_453243/job_25473996: alloc=96000MB mem.limit=96000MB memsw.limit=96000MB [2021-10-25T14:01:13.179] [25473996.batch] task/cgroup: /slurm/uid_453243/job_25473996/step_batch: alloc=96000MB mem.limit=96000MB memsw.limit=96000MB [2021-10-26T05:16:28.459] [25473996.batch] error: Step 25473996.4294967294 hit memory+swap limit at least once during execution. This may or may not result in some failure. [2021-10-26T05:16:30.928] [25473996.batch] Defering sending signal, processes in job are currently core dumping [2021-10-26T05:17:01.986] [25473996.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 34560 [2021-10-26T05:17:01.989] [25473996.batch] done with job Do you have any sugestions how to reset the DB?
Based on what you have posted there seems to be a communication issue between the two. Would please attach the following to this bug. > slurmdbd.log > slurmctld.log > "sacctmgr show cluster format=Cluster,ControlHost,ControlPort,RPC" If you restart the slurmctld does the issues resolve? Do you have any runaways on this cluster? > "sacctmgr show runaway"
slurmctld <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/view?usp=drive_web> slurmdbd-20210612 <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/view?usp=drive_web> Hi, Thank you for your reply. The slurmctld logs and slurmdbd logs until June 12th are attached. We are gathering the slurmdbd logs, however the size of the slurmdbd logs is 0 now. The slurmdbd.service has been running since June, 8th and the log file was created on Jun 12. [ar2667@roll ~]$ systemctl status slurmdbd.service *●* slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: *active (running)* since Tue 2021-06-08 18:36:21 EDT; 4 months 18 days ago Process: 32682 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 32837 (slurmdbd) Memory: 200.1M CGroup: /system.slice/slurmdbd.service └─32837 /cm/shared/apps/slurm/17.11.2/sbin/slurmdbd [ar2667@roll ~]$ sudo ls -ltr /var/log/slurmdbd -rw-r----- 1 slurm root 0 Jun 12 03:35 /var/log/slurmdbd We restarted slurmctld.service on September 30th but I do not think this solved the issue. [ar2667@roll ~]$ systemctl status slurmctld.service *●* slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: *active (running)* since Thu 2021-09-30 12:33:10 EDT; 3 weeks 5 days ago Process: 10661 ExecStart=/cm/shared/apps/slurm/17.11.2/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 10663 (slurmctld) Memory: 3.1G CGroup: /system.slice/slurmctld.service └─10663 /cm/shared/apps/slurm/17.11.2/sbin/slurmctld [ar2667@roll ~]$ sacctmgr show cluster format=Cluster,ControlHost,ControlPort,RPC Cluster ControlHost ControlPort RPC ---------- --------------- ------------ ----- habanero 10.43.4.2 6817 8192 slurm_clu+ 0 7680 The "sacctmgr show runaway" hangs out. [ar2667@roll ~]$ sacctmgr show runaway Would restarting slurmdbd service be helpful in this situation and what will be the effect on the currently running jobs? Thank you, Axinia *---* Axinia Radeva Manager, Research Computing Services On Tue, Oct 26, 2021 at 3:08 PM <bugs@schedmd.com> wrote: > *Comment # 1 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c1&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=WgpYwhrTKPGbkNtmW9ZiI5143EQ70tyWY1riDHC40-8&s=U0vmu0Jwco4yWkPxAsmMommCTJnUGWQ6dI4RsWYsBgg&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=WgpYwhrTKPGbkNtmW9ZiI5143EQ70tyWY1riDHC40-8&s=lYtV7GK7GYwWldtA9caStvaLHgkUx3AMf8RKDTn0cag&e=> > from Jason Booth <jbooth@schedmd.com> * > > Based on what you have posted there seems to be a communication issue between > the two. Would please attach the following to this bug. > > > slurmdbd.log > > slurmctld.log > > "sacctmgr show cluster format=Cluster,ControlHost,ControlPort,RPC" > > If you restart the slurmctld does the issues resolve? > > Do you have any runaways on this cluster? > > "sacctmgr show runaway" > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, > slurmctld > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ > view?usp=drive_web> > slurmdbd-20210612 > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ > view?usp=drive_web> I tried to access, but wasn't able. I requested access thought gdocs, but you can attach then here if you want. > [ar2667@roll ~]$ sacctmgr show runaway So, no output at all? What version are you using? Can you also attach the output of: $ sdiag $ sacctmgr show stats > Would restarting slurmdbd service be helpful in this situation It depends on the actual problem you are facing, but it's a simple test. > and what > will be the effect on the currently running jobs? No affect at all. Note that Slurm is designed in a fault-tolerant way, so it's able to keep working perfectly fine even when slurmdbd is down (for a while caching data, then accounting data is discarded). Actually running jobs keep running also with slurmctld is down, but no more jobs can be submitted or will be launched. So, you can restart the slurmdbd without any issue. Regards, Albert
Created attachment 21980 [details] slurmctld Hi Albert, I just gave you permissions to slurmctld and slurmdbd-20210612 logs. Can you please try again and let me know if you have any issues accessing the files. ~axinia *---* Axinia Radeva Manager, Research Computing Services On Wed, Oct 27, 2021 at 2:00 PM <bugs@schedmd.com> wrote: > *Comment # 4 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c4&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=iG-AOlmFNdo7Dm5K8XdTTzMoPIcA2UoLNKTV3vdjAPM&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=k2UTHEfpXxZPaIJBGwF9dOh79Di8UVS1s_BjiLA_Dls&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > slurmctld > > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=rcLHVp5R1AgHk7cDsPm3cSlSM8eyDsQqNrTsK1pb_lI&e=> > > view?usp=drive_web> > > slurmdbd-20210612 > > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1vjnPIqj-5Fqb4LZf3oZfVK-5F-5Fhbz432kl94_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=HkHzYBwHZDSRzJiixNivMv_KOBlYvcSszHxDGewyitA&e=> > > view?usp=drive_web> > > I tried to access, but wasn't able. > I requested access thought gdocs, but you can attach then here if you want. > > [ar2667@roll ~]$ sacctmgr show runaway > > So, no output at all? > What version are you using? > > Can you also attach the output of: > > $ sdiag > $ sacctmgr show stats > > Would restarting slurmdbd service be helpful in this situation > > It depends on the actual problem you are facing, but it's a simple test. > > and what > > will be the effect on the currently running jobs? > > No affect at all. > Note that Slurm is designed in a fault-tolerant way, so it's able to keep > working perfectly fine even when slurmdbd is down (for a while caching data, > then accounting data is discarded). > Actually running jobs keep running also with slurmctld is down, but no more > jobs can be submitted or will be launched. > > So, you can restart the slurmdbd without any issue. > > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
*---* Axinia Radeva Manager, Research Computing Services On Wed, Oct 27, 2021 at 4:21 PM Axinia Radeva <aradeva@columbia.edu> wrote: > Hi Albert, > > I just gave you permissions to slurmctld and slurmdbd-20210612 logs. > > Can you please try again and let me know if you have any issues accessing > the files. > > ~axinia > > > > *---* > Axinia Radeva > Manager, Research Computing Services > > > > > On Wed, Oct 27, 2021 at 2:00 PM <bugs@schedmd.com> wrote: > >> *Comment # 4 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c4&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=iG-AOlmFNdo7Dm5K8XdTTzMoPIcA2UoLNKTV3vdjAPM&e=> >> on bug 12747 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=k2UTHEfpXxZPaIJBGwF9dOh79Di8UVS1s_BjiLA_Dls&e=> >> from Albert Gil <albert.gil@schedmd.com> * >> >> Hi Axinia, >> > slurmctld >> > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=rcLHVp5R1AgHk7cDsPm3cSlSM8eyDsQqNrTsK1pb_lI&e=> >> > view?usp=drive_web> >> > slurmdbd-20210612 >> > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1vjnPIqj-5Fqb4LZf3oZfVK-5F-5Fhbz432kl94_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=HkHzYBwHZDSRzJiixNivMv_KOBlYvcSszHxDGewyitA&e=> >> > view?usp=drive_web> >> >> I tried to access, but wasn't able. >> I requested access thought gdocs, but you can attach then here if you want. >> > [ar2667@roll ~]$ sacctmgr show runaway >> >> So, no output at all? >> What version are you using? >> >> Can you also attach the output of: >> >> $ sdiag >> $ sacctmgr show stats >> > Would restarting slurmdbd service be helpful in this situation >> >> It depends on the actual problem you are facing, but it's a simple test. >> > and what >> > will be the effect on the currently running jobs? >> >> No affect at all. >> Note that Slurm is designed in a fault-tolerant way, so it's able to keep >> working perfectly fine even when slurmdbd is down (for a while caching data, >> then accounting data is discarded). >> Actually running jobs keep running also with slurmctld is down, but no more >> jobs can be submitted or will be launched. >> >> So, you can restart the slurmdbd without any issue. >> >> >> Regards, >> Albert >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
Can you please see my responses below? *---* Axinia Radeva Manager, Research Computing Services On Wed, Oct 27, 2021 at 2:00 PM <bugs@schedmd.com> wrote: > *Comment # 4 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c4&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=iG-AOlmFNdo7Dm5K8XdTTzMoPIcA2UoLNKTV3vdjAPM&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=k2UTHEfpXxZPaIJBGwF9dOh79Di8UVS1s_BjiLA_Dls&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > slurmctld > > <https://drive.google.com/file/d/1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1XSHV4XQnFLm5RmP0nKlsKwHMFszbcTjR_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=rcLHVp5R1AgHk7cDsPm3cSlSM8eyDsQqNrTsK1pb_lI&e=> > > view?usp=drive_web> > > slurmdbd-20210612 > > <https://drive.google.com/file/d/1vjnPIqj_qb4LZf3oZfVK__hbz432kl94/ <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_file_d_1vjnPIqj-5Fqb4LZf3oZfVK-5F-5Fhbz432kl94_&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=BdB2qNxzfB2HbU4PWL-Y3Ru8ir06KhpOVvGZjXI4oPE&s=HkHzYBwHZDSRzJiixNivMv_KOBlYvcSszHxDGewyitA&e=> > > view?usp=drive_web> > > I tried to access, but wasn't able. > I requested access thought gdocs, but you can attach then here if you want. > > [ar2667@roll ~]$ sacctmgr show runaway > > no > So, no output at all? > What version are you using? > > We have slurm 17.11.2 > Can you also attach the output of: > > $ sdiag > > [ar2667@roll ~]$ sdiag ******************************************************* sdiag output at Wed Oct 27 16:36:15 2021 (1635366975) Data since Tue Oct 26 20:00:00 2021 (1635292800) ******************************************************* Server thread count: 3 Agent queue size: 0 DBD Agent queue size: 40553 Jobs submitted: 245 Jobs started: 516 Jobs completed: 1403 Jobs canceled: 6 Jobs failed: 0 Jobs running: 297 Jobs running ts: Wed Oct 27 16:35:51 2021 (1635366951) Main schedule statistics (microseconds): Last cycle: 1739 Max cycle: 173709 Total cycles: 3049 Mean cycle: 3888 Mean depth cycle: 22 Cycles per minute: 2 Last queue length: 4 Backfilling stats Total backfilled jobs (since last slurm start): 22707 Total backfilled jobs (since last stats cycle start): 27 Total backfilled heterogeneous job components: 0 Total cycles: 2247 Last cycle when: Wed Oct 27 16:36:05 2021 (1635366965) Last cycle: 21708 Max cycle: 17654554 Mean cycle: 1728779 Last depth cycle: 5 Last depth cycle (try sched): 4 Depth Mean: 36 Depth Mean (try depth): 33 Last queue length: 4 Queue length mean: 13 Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:813036 ave_time:1967 total_time:1599431435 REQUEST_JOB_INFO ( 2003) count:396657 ave_time:171717 total_time:68113042699 REQUEST_NODE_INFO_SINGLE ( 2040) count:315594 ave_time:181072 total_time:57145431559 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:283896 ave_time:57581 total_time:16347141542 MESSAGE_EPILOG_COMPLETE ( 6012) count:206419 ave_time:168857 total_time:34855343419 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:195367 ave_time:814519 total_time:159130204413 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:144407 ave_time:89256 total_time:12889214894 REQUEST_FED_INFO ( 2049) count:103471 ave_time:445 total_time:46063494 REQUEST_JOB_USER_INFO ( 2039) count:93247 ave_time:66961 total_time:6243924165 REQUEST_PING ( 1008) count:83486 ave_time:587 total_time:49082340 REQUEST_NODE_INFO ( 2007) count:49710 ave_time:10925 total_time:543102582 REQUEST_KILL_JOB ( 5032) count:27008 ave_time:63347 total_time:1710881768 REQUEST_UPDATE_JOB ( 3001) count:17744 ave_time:13718 total_time:243417555 REQUEST_JOB_INFO_SINGLE ( 2021) count:10143 ave_time:41705 total_time:423019514 REQUEST_JOB_STEP_CREATE ( 5001) count:2562 ave_time:1508 total_time:3865672 REQUEST_STEP_COMPLETE ( 5016) count:2042 ave_time:103350 total_time:211041984 REQUEST_JOB_READY ( 4019) count:1550 ave_time:52569 total_time:81482015 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:1433 ave_time:118185 total_time:169360333 REQUEST_SHARE_INFO ( 2022) count:1343 ave_time:6551 total_time:8799096 REQUEST_CANCEL_JOB_STEP ( 5005) count:739 ave_time:41977 total_time:31021520 REQUEST_RESOURCE_ALLOCATION ( 4001) count:673 ave_time:189879 total_time:127788965 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:671 ave_time:175800 total_time:117961935 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:47 ave_time:237879 total_time:11180339 ACCOUNTING_UPDATE_MSG (10001) count:21 ave_time:3054223 total_time:64138693 REQUEST_RECONFIGURE ( 1003) count:19 ave_time:4520842 total_time:85895999 REQUEST_JOB_NOTIFY ( 4022) count:13 ave_time:490 total_time:6376 REQUEST_UPDATE_NODE ( 3002) count:12 ave_time:176752 total_time:2121031 REQUEST_PRIORITY_FACTORS ( 2026) count:10 ave_time:75375 total_time:753750 REQUEST_JOB_STEP_INFO ( 2005) count:4 ave_time:814 total_time:3259 REQUEST_TOP_JOB ( 5038) count:1 ave_time:636 total_time:636 REQUEST_STATS_INFO ( 2035) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user root ( 0) count:1528109 ave_time:181229 total_time:276938293972 rec2111 ( 169996) count:776364 ave_time:79997 total_time:62107356803 de2356 ( 466630) count:262940 ave_time:19786 total_time:5202663824 rcc2167 ( 497658) count:97623 ave_time:78925 total_time:7704912315 mrd2165 ( 456704) count:39139 ave_time:90827 total_time:3554897920 hzz2000 ( 476402) count:10550 ave_time:96118 total_time:1014046590 rl2226 ( 124636) count:6533 ave_time:100609 total_time:657278646 kk3291 ( 496278) count:2068 ave_time:150099 total_time:310405049 mam2556 ( 497945) count:1949 ave_time:96117 total_time:187333844 zp2221 ( 546238) count:1920 ave_time:46497 total_time:89275799 mi2493 ( 546357) count:1669 ave_time:50929 total_time:85001409 ls3759 ( 544216) count:1491 ave_time:49866 total_time:74351335 adp2164 ( 534798) count:1093 ave_time:59129 total_time:64628798 aeh2213 ( 525442) count:1025 ave_time:67385 total_time:69070417 yx2625 ( 544217) count:1014 ave_time:56383 total_time:57172456 mv2640 ( 453243) count:931 ave_time:80170 total_time:74638895 ls3326 ( 441245) count:926 ave_time:28078 total_time:26000676 dl2860 ( 358507) count:900 ave_time:97916 total_time:88125149 lmk2202 ( 476448) count:885 ave_time:85170 total_time:75375897 msd2202 ( 545520) count:834 ave_time:83247 total_time:69428080 pcd2120 ( 475692) count:823 ave_time:107029 total_time:88085136 ab4689 ( 495992) count:784 ave_time:112730 total_time:88380903 ma3631 ( 462767) count:655 ave_time:76656 total_time:50209911 az2604 ( 532427) count:601 ave_time:211855 total_time:127325085 nobody ( 486473) count:503 ave_time:124392 total_time:62569554 zx2250 ( 495594) count:479 ave_time:48664 total_time:23310421 al4188 ( 545556) count:477 ave_time:63677 total_time:30374144 sb4601 ( 546356) count:446 ave_time:97455 total_time:43465066 kmx2000 ( 538398) count:432 ave_time:55456 total_time:23957062 lma2197 ( 545978) count:420 ave_time:93863 total_time:39422850 ad3395 ( 457738) count:412 ave_time:141428 total_time:58268672 rh2845 ( 472268) count:376 ave_time:78339 total_time:29455533 tk2757 ( 475100) count:373 ave_time:148094 total_time:55239112 yva2000 ( 446751) count:366 ave_time:63237 total_time:23144945 yw3376 ( 521854) count:296 ave_time:557442 total_time:165003015 slh2181 ( 485956) count:275 ave_time:77101 total_time:21202791 taf2109 ( 250023) count:263 ave_time:239869 total_time:63085689 zw2105 ( 80893) count:256 ave_time:64065 total_time:16400820 mcb2270 ( 485162) count:247 ave_time:44855 total_time:11079282 ar2667 ( 217300) count:238 ave_time:76016 total_time:18091848 ll3450 ( 546422) count:227 ave_time:63897 total_time:14504685 jic2121 ( 463630) count:216 ave_time:39114 total_time:8448678 yg2811 ( 545936) count:214 ave_time:72923 total_time:15605609 yy2865 ( 492680) count:207 ave_time:57179 total_time:11836154 mt3197 ( 473089) count:179 ave_time:64292 total_time:11508437 jdn2133 ( 470892) count:165 ave_time:77947 total_time:12861324 os2328 ( 493602) count:159 ave_time:39265 total_time:6243245 am5328 ( 525450) count:150 ave_time:49954 total_time:7493181 am5284 ( 519168) count:148 ave_time:397578 total_time:58841596 mts2188 ( 546099) count:145 ave_time:83144 total_time:12055925 ab2080 ( 110745) count:139 ave_time:98768 total_time:13728832 flw2113 ( 489455) count:132 ave_time:110236 total_time:14551159 yp2602 ( 546118) count:129 ave_time:109131 total_time:14077980 mea2200 ( 524843) count:127 ave_time:30659 total_time:3893818 hmm2183 ( 543254) count:127 ave_time:81463 total_time:10345822 sh3972 ( 524988) count:125 ave_time:89528 total_time:11191059 mad2314 ( 545522) count:124 ave_time:87364 total_time:10833220 gt2453 ( 545916) count:123 ave_time:8074 total_time:993106 arr47 ( 543516) count:119 ave_time:56511 total_time:6724840 bc212 ( 30094) count:109 ave_time:7083 total_time:772063 st3107 ( 474846) count:104 ave_time:19534 total_time:2031568 htr2104 ( 411716) count:96 ave_time:54954 total_time:5275619 pfm2119 ( 470991) count:93 ave_time:76138 total_time:7080912 jb4493 ( 546296) count:90 ave_time:46526 total_time:4187423 ca2783 ( 488827) count:85 ave_time:149016 total_time:12666414 kz2303 ( 477679) count:84 ave_time:74047 total_time:6220030 as4525 ( 365586) count:81 ave_time:1029267 total_time:83370670 cx2204 ( 477023) count:75 ave_time:187986 total_time:14099019 gjc14 ( 488676) count:75 ave_time:18452 total_time:1383904 ik2496 ( 543457) count:74 ave_time:604775 total_time:44753384 ja3170 ( 463425) count:71 ave_time:64624 total_time:4588329 kat2193 ( 496066) count:68 ave_time:147765 total_time:10048080 sj2787 ( 453295) count:65 ave_time:165285 total_time:10743552 hl2902 ( 414743) count:49 ave_time:490264 total_time:24022960 nt2560 ( 544457) count:41 ave_time:161329 total_time:6614525 sw3203 ( 479850) count:39 ave_time:178515 total_time:6962087 mgz2110 ( 546421) count:36 ave_time:75339 total_time:2712235 fw2366 ( 546279) count:27 ave_time:125658 total_time:3392777 sb3378 ( 314424) count:23 ave_time:8064164 total_time:185475792 wz2543 ( 545958) count:21 ave_time:71778 total_time:1507344 slurm ( 450) count:21 ave_time:3054223 total_time:64138693 pab2170 ( 423948) count:20 ave_time:82432 total_time:1648658 ns3316 ( 498343) count:18 ave_time:62053 total_time:1116960 el2545 ( 261905) count:16 ave_time:6983 total_time:111735 jab2443 ( 496774) count:14 ave_time:7292 total_time:102099 jeg2228 ( 497608) count:14 ave_time:97146 total_time:1360048 qz2280 ( 451169) count:12 ave_time:352823 total_time:4233881 kl2792 ( 389785) count:12 ave_time:131466 total_time:1577598 ms5924 ( 533600) count:12 ave_time:11046 total_time:132552 iu2153 ( 465048) count:12 ave_time:59375 total_time:712510 ia2337 ( 423640) count:11 ave_time:4349 total_time:47843 rl3149 ( 546516) count:11 ave_time:2815 total_time:30967 as5460 ( 478898) count:11 ave_time:3335 total_time:36688 pab2163 ( 363846) count:9 ave_time:21269 total_time:191421 yr2322 ( 470274) count:9 ave_time:13798 total_time:124185 mz2778 ( 527313) count:8 ave_time:144421 total_time:1155371 xl3041 ( 546419) count:8 ave_time:167814 total_time:1342519 fg2465 ( 498193) count:7 ave_time:251875 total_time:1763125 sx2220 ( 476094) count:7 ave_time:270912 total_time:1896390 rc3362 ( 545696) count:7 ave_time:8925 total_time:62478 jnt2136 ( 518724) count:7 ave_time:77231 total_time:540618 mc4138 ( 545636) count:6 ave_time:15008 total_time:90051 ags2198 ( 502018) count:5 ave_time:7624 total_time:38121 yg2607 ( 496576) count:4 ave_time:37494 total_time:149977 reg2171 ( 542795) count:4 ave_time:5529 total_time:22116 nb2869 ( 494059) count:3 ave_time:14663 total_time:43990 hwp2108 ( 527947) count:3 ave_time:1534 total_time:4603 new2128 ( 546417) count:3 ave_time:3694 total_time:11083 da2709 ( 446758) count:2 ave_time:27576 total_time:55153 yj2650 ( 545416) count:2 ave_time:197 total_time:394 dp264 ( 36357) count:1 ave_time:4085 total_time:4085 > $ sacctmgr show stats > > [root@roll ar2667]# sacctmgr show stats Rollup statistics Hour count:0 ave_time:0 max_time:0 total_time:0 Day count:0 ave_time:0 max_time:0 total_time:0 Month count:0 ave_time:0 max_time:0 total_time:0 Remote Procedure Call statistics by message type DBD_JOB_COMPLETE ( 1424) count:13260 ave_time:1023 total_time:13573082 DBD_STEP_START ( 1442) count:10062 ave_time:1079 total_time:10865749 DBD_STEP_COMPLETE ( 1441) count:9984 ave_time:1185 total_time:11836405 DBD_JOB_START ( 1425) count:3744 ave_time:1107 total_time:4145054 DBD_NODE_STATE ( 1432) count:1989 ave_time:783 total_time:1558136 DBD_SEND_MULT_JOB_START ( 1472) count:142 ave_time:5390 total_time:765402 DBD_SEND_MULT_MSG ( 1474) count:39 ave_time:1086964 total_time:42391599 MsgType=6500 ( 6500) count:3 ave_time:4563 total_time:13691 DBD_CLUSTER_TRES ( 1407) count:2 ave_time:1713 total_time:3426 DBD_FINI ( 1401) count:2 ave_time:163 total_time:326 DBD_REGISTER_CTLD ( 1434) count:1 ave_time:15990 total_time:15990 DBD_GET_STATS ( 1489) count:1 ave_time:178 total_time:178 Remote Procedure Call statistics by user slurm ( 450) count:39224 ave_time:2171 total_time:85155669 ar2667 ( 217300) count:3 ave_time:521 total_time:1565 root ( 0) count:2 ave_time:5902 total_time:11804 [root@roll ar2667]# > > Would restarting slurmdbd service be helpful in this situation > > It depends on the actual problem you are facing, but it's a simple test. > > and what > > will be the effect on the currently running jobs? > > No affect at all. > Note that Slurm is designed in a fault-tolerant way, so it's able to keep > working perfectly fine even when slurmdbd is down (for a while caching data, > then accounting data is discarded). > Actually running jobs keep running also with slurmctld is down, but no more > jobs can be submitted or will be launched. > > So, you can restart the slurmdbd without any issue. > > We restarted slurmdb and now we can see the logs in slurmdb log file. Best, Axinia > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, > I just gave you permissions to slurmctld and slurmdbd-20210612 logs. > > Can you please try again and let me know if you have any issues accessing > the files. Yes, I've been able to access and digest them. > > What version are you using? > > > We have slurm 17.11.2 Right I don't think that your problem is related to being in an old version, but once we fix your current issue I would recommend to upgrade to get a better Slurm, and also a better Slurm support. > > [ar2667@roll ~]$ sdiag > > ******************************************************* > sdiag output at Wed Oct 27 16:36:15 2021 (1635366975) > Data since Tue Oct 26 20:00:00 2021 (1635292800) > ******************************************************* > Server thread count: 3 > Agent queue size: 0 > DBD Agent queue size: 40553 This 40553 is bad, it should be close to 0. This means that slurmctld has not been able to communicate with slurmdbd for a long while. > > So, you can restart the slurmdbd without any issue. > > > We restarted slurmdb and now we can see the logs in slurmdb log file. Good! From the logs it's quite clear that you are facing some comm issues. There are several issues, but I think that the most important is related to the SQL backend (MariaDB). It seems that slurmdbd is not able to connect with it since 2021-10-25T09:48: [2021-10-25T09:48:55.248] error: It looks like the storage has gone away trying to reconnect Since that point it seems that slurmdbd is not able to interact properly with the SQL backend, so it return errors to slurmctld, so they get out of sync as you saw (while slurmctld tries to keep the right info in that DBD Agent queue size mentioned above). Once slurmdbd is able to interact again with MariaDB, then slurmctld will be able to send the updated info to slurmdbd -> mariadb, that DBD Agent queue size should be reduced at good rate, and then the info between slurmctld and slurmdbd will be on sync again. But we need slurmdbd to be able ot interact with MariDB properly again for all that to happen. The last slurmdbd that I see in your logs is from 2021-06-08, so I cannot see if after your last restart thing are going better or not. Could you attach a newer slurmdbd.log, plus a new output of "sdiag" to see if the DBD Agent queue size is going down or not? Could you also attach your slurm.conf and slurmdbd.conf (without the passwd)? And finally, could you verify that you sql backend is up and running normally? Thanks, Albert
Created attachment 21993 [details] slurmdbd Hi Albert, Thank you so much for looking into this. The queue size is still 44365. And the logs do not look good :( Please find the slurmdbd logs attached. [ar2667@roll ~]$ sdiag ******************************************************* sdiag output at Thu Oct 28 10:23:54 2021 (1635431034) Data since Wed Oct 27 20:00:00 2021 (1635379200) ******************************************************* Server thread count: 3 Agent queue size: 0 DBD Agent queue size: 44365 Jobs submitted: 3085 Jobs started: 1288 Jobs completed: 249 Jobs canceled: 7 Jobs failed: 0 Jobs running: 1270 Jobs running ts: Thu Oct 28 10:23:51 2021 (1635431031) Main schedule statistics (microseconds): Last cycle: 1317 Max cycle: 36750 Total cycles: 1221 Mean cycle: 2633 Mean depth cycle: 5 Cycles per minute: 1 Last queue length: 12 Backfilling stats Total backfilled jobs (since last slurm start): 23715 Total backfilled jobs (since last stats cycle start): 1006 Total backfilled heterogeneous job components: 0 Total cycles: 1720 Last cycle when: Thu Oct 28 10:23:50 2021 (1635431030) Last cycle: 93291 Max cycle: 1431332 Mean cycle: 37931 Last depth cycle: 11 Last depth cycle (try sched): 9 Depth Mean: 8 Depth Mean (try depth): 7 Last queue length: 12 Queue length mean: 4 Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:831552 ave_time:1952 total_time:1623594330 REQUEST_JOB_INFO ( 2003) count:406512 ave_time:171044 total_time:69531675747 REQUEST_NODE_INFO_SINGLE ( 2040) count:323927 ave_time:176427 total_time:57149598015 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:291516 ave_time:56131 total_time:16363286259 MESSAGE_EPILOG_COMPLETE ( 6012) count:206864 ave_time:168497 total_time:34856149931 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:195699 ave_time:813156 total_time:159133881639 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:144479 ave_time:89942 total_time:12994735290 REQUEST_FED_INFO ( 2049) count:103740 ave_time:457 total_time:47453433 REQUEST_JOB_USER_INFO ( 2039) count:93403 ave_time:68546 total_time:6402433658 REQUEST_PING ( 1008) count:85760 ave_time:591 total_time:50726628 REQUEST_NODE_INFO ( 2007) count:50925 ave_time:10755 total_time:547729035 REQUEST_KILL_JOB ( 5032) count:27028 ave_time:63304 total_time:1711003402 REQUEST_UPDATE_JOB ( 3001) count:17924 ave_time:13591 total_time:243614165 REQUEST_JOB_STEP_CREATE ( 5001) count:10733 ave_time:5657 total_time:60717172 REQUEST_JOB_INFO_SINGLE ( 2021) count:10254 ave_time:56508 total_time:579434129 REQUEST_STEP_COMPLETE ( 5016) count:3400 ave_time:71048 total_time:241564287 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:2772 ave_time:61473 total_time:170404510 REQUEST_JOB_READY ( 4019) count:1637 ave_time:50788 total_time:83141324 REQUEST_SHARE_INFO ( 2022) count:1379 ave_time:6501 total_time:8965464 REQUEST_CANCEL_JOB_STEP ( 5005) count:770 ave_time:40307 total_time:31036992 REQUEST_RESOURCE_ALLOCATION ( 4001) count:710 ave_time:180220 total_time:127956586 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:709 ave_time:166439 total_time:118005616 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:52 ave_time:215023 total_time:11181207 ACCOUNTING_UPDATE_MSG (10001) count:21 ave_time:3054223 total_time:64138693 REQUEST_RECONFIGURE ( 1003) count:19 ave_time:4520842 total_time:85895999 REQUEST_UPDATE_NODE ( 3002) count:13 ave_time:163184 total_time:2121392 REQUEST_JOB_NOTIFY ( 4022) count:13 ave_time:490 total_time:6376 REQUEST_PRIORITY_FACTORS ( 2026) count:10 ave_time:75375 total_time:753750 REQUEST_JOB_STEP_INFO ( 2005) count:6 ave_time:913 total_time:5482 REQUEST_TOP_JOB ( 5038) count:1 ave_time:636 total_time:636 REQUEST_STATS_INFO ( 2035) count:1 ave_time:407 total_time:407 ACCOUNTING_REGISTER_CTLD (10003) count:1 ave_time:99160 total_time:99160 Remote Procedure Call statistics by user root ( 0) count:1560149 ave_time:177771 total_time:277349410174 rec2111 ( 169996) count:794324 ave_time:79564 total_time:63199851716 de2356 ( 466630) count:262940 ave_time:19786 total_time:5202663824 rcc2167 ( 497658) count:97623 ave_time:78925 total_time:7704912315 mrd2165 ( 456704) count:39142 ave_time:90820 total_time:3554899599 zp2221 ( 546238) count:11424 ave_time:26717 total_time:305219161 hzz2000 ( 476402) count:10550 ave_time:96118 total_time:1014046590 rl2226 ( 124636) count:6711 ave_time:98085 total_time:658254663 kk3291 ( 496278) count:2080 ave_time:149244 total_time:310429166 mi2493 ( 546357) count:2074 ave_time:41405 total_time:85875958 mam2556 ( 497945) count:1949 ave_time:96117 total_time:187333844 ls3759 ( 544216) count:1491 ave_time:49866 total_time:74351335 adp2164 ( 534798) count:1093 ave_time:59129 total_time:64628798 yx2625 ( 544217) count:1065 ave_time:53826 total_time:57325529 aeh2213 ( 525442) count:1025 ave_time:67385 total_time:69070417 ls3326 ( 441245) count:953 ave_time:27497 total_time:26205520 mv2640 ( 453243) count:937 ave_time:79661 total_time:74642907 dl2860 ( 358507) count:900 ave_time:97916 total_time:88125149 lmk2202 ( 476448) count:885 ave_time:85170 total_time:75375897 msd2202 ( 545520) count:836 ave_time:271654 total_time:227102845 pcd2120 ( 475692) count:823 ave_time:107029 total_time:88085136 ab4689 ( 495992) count:784 ave_time:112730 total_time:88380903 ma3631 ( 462767) count:714 ave_time:70916 total_time:50634248 az2604 ( 532427) count:606 ave_time:210642 total_time:127649628 nobody ( 486473) count:503 ave_time:124392 total_time:62569554 zx2250 ( 495594) count:482 ave_time:48364 total_time:23311688 al4188 ( 545556) count:477 ave_time:63677 total_time:30374144 sb4601 ( 546356) count:446 ave_time:97455 total_time:43465066 kmx2000 ( 538398) count:432 ave_time:55456 total_time:23957062 lma2197 ( 545978) count:420 ave_time:93863 total_time:39422850 yva2000 ( 446751) count:416 ave_time:57401 total_time:23878992 ad3395 ( 457738) count:412 ave_time:141428 total_time:58268672 rh2845 ( 472268) count:376 ave_time:78339 total_time:29455533 tk2757 ( 475100) count:373 ave_time:148094 total_time:55239112 taf2109 ( 250023) count:306 ave_time:206495 total_time:63187509 yw3376 ( 521854) count:296 ave_time:557442 total_time:165003015 mcb2270 ( 485162) count:276 ave_time:40250 total_time:11109249 slh2181 ( 485956) count:275 ave_time:77101 total_time:21202791 zw2105 ( 80893) count:256 ave_time:64065 total_time:16400820 ar2667 ( 217300) count:251 ave_time:72292 total_time:18145345 ll3450 ( 546422) count:227 ave_time:63897 total_time:14504685 jic2121 ( 463630) count:216 ave_time:39114 total_time:8448678 yg2811 ( 545936) count:214 ave_time:72923 total_time:15605609 yy2865 ( 492680) count:207 ave_time:57179 total_time:11836154 am5328 ( 525450) count:207 ave_time:36530 total_time:7561807 mt3197 ( 473089) count:179 ave_time:64292 total_time:11508437 jdn2133 ( 470892) count:165 ave_time:77947 total_time:12861324 os2328 ( 493602) count:159 ave_time:39265 total_time:6243245 am5284 ( 519168) count:148 ave_time:397578 total_time:58841596 mts2188 ( 546099) count:145 ave_time:83144 total_time:12055925 ab2080 ( 110745) count:139 ave_time:98768 total_time:13728832 flw2113 ( 489455) count:132 ave_time:110236 total_time:14551159 yp2602 ( 546118) count:129 ave_time:109131 total_time:14077980 mea2200 ( 524843) count:127 ave_time:30659 total_time:3893818 hmm2183 ( 543254) count:127 ave_time:81463 total_time:10345822 sh3972 ( 524988) count:125 ave_time:89528 total_time:11191059 mad2314 ( 545522) count:124 ave_time:87364 total_time:10833220 gt2453 ( 545916) count:123 ave_time:8074 total_time:993106 arr47 ( 543516) count:119 ave_time:56511 total_time:6724840 bc212 ( 30094) count:109 ave_time:7083 total_time:772063 ca2783 ( 488827) count:104 ave_time:122669 total_time:12757626 st3107 ( 474846) count:104 ave_time:19534 total_time:2031568 cx2204 ( 477023) count:98 ave_time:144847 total_time:14195088 htr2104 ( 411716) count:96 ave_time:54954 total_time:5275619 pfm2119 ( 470991) count:93 ave_time:76138 total_time:7080912 jb4493 ( 546296) count:90 ave_time:46526 total_time:4187423 kz2303 ( 477679) count:84 ave_time:74047 total_time:6220030 as4525 ( 365586) count:81 ave_time:1029267 total_time:83370670 ik2496 ( 543457) count:78 ave_time:573887 total_time:44763197 gjc14 ( 488676) count:75 ave_time:18452 total_time:1383904 ja3170 ( 463425) count:71 ave_time:64624 total_time:4588329 kat2193 ( 496066) count:68 ave_time:147765 total_time:10048080 sj2787 ( 453295) count:65 ave_time:165285 total_time:10743552 hl2902 ( 414743) count:49 ave_time:490264 total_time:24022960 nt2560 ( 544457) count:41 ave_time:161329 total_time:6614525 sw3203 ( 479850) count:39 ave_time:178515 total_time:6962087 mgz2110 ( 546421) count:36 ave_time:75339 total_time:2712235 fw2366 ( 546279) count:27 ave_time:125658 total_time:3392777 sb3378 ( 314424) count:26 ave_time:11173851 total_time:290520130 slurm ( 450) count:22 ave_time:2919902 total_time:64237853 wz2543 ( 545958) count:21 ave_time:71778 total_time:1507344 pab2170 ( 423948) count:20 ave_time:82432 total_time:1648658 ns3316 ( 498343) count:18 ave_time:62053 total_time:1116960 el2545 ( 261905) count:16 ave_time:6983 total_time:111735 fg2465 ( 498193) count:15 ave_time:120308 total_time:1804628 jab2443 ( 496774) count:14 ave_time:7292 total_time:102099 jeg2228 ( 497608) count:14 ave_time:97146 total_time:1360048 kl2792 ( 389785) count:12 ave_time:131466 total_time:1577598 iu2153 ( 465048) count:12 ave_time:59375 total_time:712510 ms5924 ( 533600) count:12 ave_time:11046 total_time:132552 qz2280 ( 451169) count:12 ave_time:352823 total_time:4233881 ia2337 ( 423640) count:11 ave_time:4349 total_time:47843 rl3149 ( 546516) count:11 ave_time:2815 total_time:30967 as5460 ( 478898) count:11 ave_time:3335 total_time:36688 yr2322 ( 470274) count:9 ave_time:13798 total_time:124185 pab2163 ( 363846) count:9 ave_time:21269 total_time:191421 xl3041 ( 546419) count:8 ave_time:167814 total_time:1342519 mz2778 ( 527313) count:8 ave_time:144421 total_time:1155371 sx2220 ( 476094) count:7 ave_time:270912 total_time:1896390 rc3362 ( 545696) count:7 ave_time:8925 total_time:62478 jnt2136 ( 518724) count:7 ave_time:77231 total_time:540618 mc4138 ( 545636) count:6 ave_time:15008 total_time:90051 ags2198 ( 502018) count:5 ave_time:7624 total_time:38121 yg2607 ( 496576) count:4 ave_time:37494 total_time:149977 reg2171 ( 542795) count:4 ave_time:5529 total_time:22116 nb2869 ( 494059) count:3 ave_time:14663 total_time:43990 hwp2108 ( 527947) count:3 ave_time:1534 total_time:4603 new2128 ( 546417) count:3 ave_time:3694 total_time:11083 da2709 ( 446758) count:2 ave_time:27576 total_time:55153 yj2650 ( 545416) count:2 ave_time:197 total_time:394 dp264 ( 36357) count:1 ave_time:4085 total_time:4085 Best, Axinia *---* Axinia Radeva Manager, Research Computing Services On Thu, Oct 28, 2021 at 7:05 AM <bugs@schedmd.com> wrote: > *Comment # 11 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c11&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=tqQlpaN0udlsX775V5GRVrj8U3SNw_Wcq4U0Rz5gkmU&s=c6SbDjEL80emGfjUC-YzL3-Q6u6zn_e91KMGBBmYHFw&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=tqQlpaN0udlsX775V5GRVrj8U3SNw_Wcq4U0Rz5gkmU&s=bWBmBfhTvTyQ6twg_Zba_MNmQMckk0cyOXQQWD7Tow4&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > I just gave you permissions to slurmctld and slurmdbd-20210612 logs. > > > > Can you please try again and let me know if you have any issues accessing > > the files. > > Yes, I've been able to access and digest them. > > > What version are you using? > > > > > We have slurm 17.11.2 > > Right I don't think that your problem is related to being in an old version, > but once we fix your current issue I would recommend to upgrade to get a better > Slurm, and also a better Slurm support. > > > > [ar2667@roll ~]$ sdiag > > > > ******************************************************* > > sdiag output at Wed Oct 27 16:36:15 2021 (1635366975) > > Data since Tue Oct 26 20:00:00 2021 (1635292800) > > ******************************************************* > > Server thread count: 3 > > Agent queue size: 0 > > DBD Agent queue size: 40553 > > This 40553 is bad, it should be close to 0. > This means that slurmctld has not been able to communicate with slurmdbd for a > long while. > > > So, you can restart the slurmdbd without any issue. > > > > > We restarted slurmdb and now we can see the logs in slurmdb log file. > > Good! > From the logs it's quite clear that you are facing some comm issues. > There are several issues, but I think that the most important is related to the > SQL backend (MariaDB). > It seems that slurmdbd is not able to connect with it since 2021-10-25T09:48: > > [2021-10-25T09:48:55.248] error: It looks like the storage has gone away trying > to reconnect > > Since that point it seems that slurmdbd is not able to interact properly with > the SQL backend, so it return errors to slurmctld, so they get out of sync as > you saw (while slurmctld tries to keep the right info in that DBD Agent queue > size mentioned above). > Once slurmdbd is able to interact again with MariaDB, then slurmctld will be > able to send the updated info to slurmdbd -> mariadb, that DBD Agent queue size > should be reduced at good rate, and then the info between slurmctld and > slurmdbd will be on sync again. > But we need slurmdbd to be able ot interact with MariDB properly again for all > that to happen. > > The last slurmdbd that I see in your logs is from 2021-06-08, so I cannot see > if after your last restart thing are going better or not. > Could you attach a newer slurmdbd.log, plus a new output of "sdiag" to see if > the DBD Agent queue size is going down or not? > > Could you also attach your slurm.conf and slurmdbd.conf (without the passwd)? > And finally, could you verify that you sql backend is up and running normally? > > Thanks, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, I think that I know where is the problem, and how to fix it. The origin of the issue is that the job 25482027 was launched in this work_dir: /rigel/sscc/projects/measure_pov/temp cluster_infos/2_Code/5_country datasets creation/Country scripts/Cote d'Ivoire Note that this directory contains some special characters like spaces, and specially the apostrophe '. On your old Slurm version those special characters were an issue, and in your case this apostrophe is leading to MariaDB to return an error leading to slurmdbd thinking that it's not running properly and trying again. Actually, that issue was an actual CVE issue: https://lists.schedmd.com/pipermail/slurm-announce/2018/000006.html In your case you are not hitting any security problem, but your old version has that vulnerability too. Fortunately, as you can see, it was already fixed for versions 17.11.5, so you don't need to do (for the moment) a major version upgrade to fix it, but just a minor upgrade. Therefore, to fix your issue you should upgrade at least your slurmdbd to the latest 17.11 release (17.11.13). Please read the upgrade notes before: https://slurm.schedmd.com/quickstart_admin.html#upgrade I recommend you to don't hesitate to much on doing this upgrade because slurmdbd is really stuck and this is bad. Note that this is an upgrade of a minor release, so fortunately you won't need to change any config or similar. Hope this helps, Albert PS: For more details, this is the commit fixing the CVE/your issue: - https://github.com/SchedMD/slurm/commit/db468895240ad6817628d07054fe54e71273b2fe
Axinia, I'm downgrading the severity of the issue as we already found the problem and the solution. Let me know how the upgrade worked, and please try to keep your Slurm version updated with the supported ones so you can have a better Slurm and we can provide a better support. Regards, Albert
Hi Albert, Thank you so much for figuring out what is the cause for slurmdb issue. This cluster was supposed to be retired last year but because of the pandemic was extended for another year. We were not planning to perform any upgrades but it looks we need to upgrade at least the DB. I have the following questions: 1) Do we need downtime to upgrade the DB or we can do it on an active cluster? 2) Do we need to do anything in addition to clear the DBD Agent Queue? In the documentation that you sent I can see: “The slurmctld daemon must also be upgraded before or at the same time as the slurmd daemons on the compute nodes. Generally, upgrading Slurm on all of the login and compute nodes is recommended, although rolling upgrades are also possible (i.e. upgrading the head node(s) first then upgrading the compute and login nodes later at various times). Also see the note above about reverse compatibility.” 3) Do we need to upgrade the slurmctld daemon and the slurmd daemons on the compute nodes? Best, Axinia *---* Axinia Radeva Manager, Research Computing Services On Thu, Oct 28, 2021 at 12:57 PM <bugs@schedmd.com> wrote: > *Comment # 16 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c16&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=zLfCbFuzXSzrGtJJsEmcHwEMGAVSR4YYIPczd4HYV2I&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=fxeRF7GtJ2kSEI0bLnwA4dTXJ3OboeKTo70jct0NLv4&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > I think that I know where is the problem, and how to fix it. > The origin of the issue is that the job 25482027 was launched in this work_dir: > > /rigel/sscc/projects/measure_pov/temp cluster_infos/2_Code/5_country datasets > creation/Country scripts/Cote d'Ivoire > > Note that this directory contains some special characters like spaces, and > specially the apostrophe '. > On your old Slurm version those special characters were an issue, and in your > case this apostrophe is leading to MariaDB to return an error leading to > slurmdbd thinking that it's not running properly and trying again. > > Actually, that issue was an actual CVE issue:https://lists.schedmd.com/pipermail/slurm-announce/2018/000006.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.schedmd.com_pipermail_slurm-2Dannounce_2018_000006.html&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=rfXuQ9EpchxJcYj0l07L4l06AIn2_Yeh1C84A7rr4Ds&e=> > > In your case you are not hitting any security problem, but your old version has > that vulnerability too. > > Fortunately, as you can see, it was already fixed for versions 17.11.5, so you > don't need to do (for the moment) a major version upgrade to fix it, but just a > minor upgrade. > > Therefore, to fix your issue you should upgrade at least your slurmdbd to the > latest 17.11 release (17.11.13). > Please read the upgrade notes before:https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23upgrade&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=9BFJDJo1eiu8bRAJO3xf6jPuEDQKr9KCRs3TyEo_AeM&e=> > > I recommend you to don't hesitate to much on doing this upgrade because > slurmdbd is really stuck and this is bad. > Note that this is an upgrade of a minor release, so fortunately you won't need > to change any config or similar. > > Hope this helps, > Albert > > PS: For more details, this is the commit fixing the CVE/your issue: > > -https://github.com/SchedMD/slurm/commit/db468895240ad6817628d07054fe54e71273b2fe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm_commit_db468895240ad6817628d07054fe54e71273b2fe&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=0Ztj_NuPirpcfrf5VtVfSiuMtZyMNTVWZISti4TFDOw&e=> > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Albert, I have another question. Can we use any of the backups that we have before Oct. 25th to restore the slurmdb? $ sudo ls -l /var/spool/cmd/backup/ total 19816 -rw------- 1 root root 2244656 Oct 22 03:19 backup-Fri.sql.gz -rw------- 1 root root 338859 Oct 22 03:19 backup-monitor-Fri.sql.gz -rw------- 1 root root 338873 Oct 25 03:45 backup-monitor-Mon.sql.gz -rw------- 1 root root 338858 Oct 23 03:12 backup-monitor-Sat.sql.gz -rw------- 1 root root 338873 Oct 24 03:24 backup-monitor-Sun.sql.gz -rw------- 1 root root 339541 Oct 28 03:45 backup-monitor-Thu.sql.gz -rw------- 1 root root 339429 Oct 26 03:08 backup-monitor-Tue.sql.gz -rw------- 1 root root 339505 Oct 27 03:09 backup-monitor-Wed.sql.gz -rw------- 1 root root 2244656 Oct 25 03:45 backup-Mon.sql.gz -rw------- 1 root root 2244656 Oct 23 03:12 backup-Sat.sql.gz -rw------- 1 root root 2244656 Oct 24 03:24 backup-Sun.sql.gz -rw------- 1 root root 2244655 Oct 28 03:45 backup-Thu.sql.gz -rw------- 1 root root 2244658 Oct 26 03:08 backup-Tue.sql.gz -rw------- 1 root root 2244656 Oct 27 03:09 backup-Wed.sql.gz drwxr-xr-x 5 root root 84 Nov 15 2017 certificates -rw------- 1 root root 2108333 Feb 13 2017 pre-upgrade-17-02-13_01-49-20_Mon.sql.gz -rw------- 1 root root 57946 Feb 13 2017 pre-upgrade-monitor-17-02-13_01-49-20_Mon.sql.gz Best, Axinia *---* Axinia Radeva Manager, Research Computing Services On Thu, Oct 28, 2021 at 1:34 PM Axinia Radeva <aradeva@columbia.edu> wrote: > Hi Albert, > > > > Thank you so much for figuring out what is the cause for slurmdb issue. > > > > This cluster was supposed to be retired last year but because of the > pandemic was extended for another year. We were not planning to perform any > upgrades but it looks we need to upgrade at least the DB. > > > > I have the following questions: > > > > 1) Do we need downtime to upgrade the DB or we can do it on an active > cluster? > > > > 2) Do we need to do anything in addition to clear the DBD Agent Queue? > > > > In the documentation that you sent I can see: > > “The slurmctld daemon must also be upgraded before or at the same time as > the slurmd daemons on the compute nodes. Generally, upgrading Slurm on all > of the login and compute nodes is recommended, although rolling upgrades > are also possible (i.e. upgrading the head node(s) first then upgrading the > compute and login nodes later at various times). Also see the note above > about reverse compatibility.” > > 3) Do we need to upgrade the slurmctld daemon and the slurmd daemons on > the compute nodes? > > Best, > > Axinia > > > *---* > Axinia Radeva > Manager, Research Computing Services > > > > > On Thu, Oct 28, 2021 at 12:57 PM <bugs@schedmd.com> wrote: > >> *Comment # 16 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c16&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=zLfCbFuzXSzrGtJJsEmcHwEMGAVSR4YYIPczd4HYV2I&e=> >> on bug 12747 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=fxeRF7GtJ2kSEI0bLnwA4dTXJ3OboeKTo70jct0NLv4&e=> >> from Albert Gil <albert.gil@schedmd.com> * >> >> Hi Axinia, >> >> I think that I know where is the problem, and how to fix it. >> The origin of the issue is that the job 25482027 was launched in this work_dir: >> >> /rigel/sscc/projects/measure_pov/temp cluster_infos/2_Code/5_country datasets >> creation/Country scripts/Cote d'Ivoire >> >> Note that this directory contains some special characters like spaces, and >> specially the apostrophe '. >> On your old Slurm version those special characters were an issue, and in your >> case this apostrophe is leading to MariaDB to return an error leading to >> slurmdbd thinking that it's not running properly and trying again. >> >> Actually, that issue was an actual CVE issue:https://lists.schedmd.com/pipermail/slurm-announce/2018/000006.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.schedmd.com_pipermail_slurm-2Dannounce_2018_000006.html&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=rfXuQ9EpchxJcYj0l07L4l06AIn2_Yeh1C84A7rr4Ds&e=> >> >> In your case you are not hitting any security problem, but your old version has >> that vulnerability too. >> >> Fortunately, as you can see, it was already fixed for versions 17.11.5, so you >> don't need to do (for the moment) a major version upgrade to fix it, but just a >> minor upgrade. >> >> Therefore, to fix your issue you should upgrade at least your slurmdbd to the >> latest 17.11 release (17.11.13). >> Please read the upgrade notes before:https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23upgrade&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=9BFJDJo1eiu8bRAJO3xf6jPuEDQKr9KCRs3TyEo_AeM&e=> >> >> I recommend you to don't hesitate to much on doing this upgrade because >> slurmdbd is really stuck and this is bad. >> Note that this is an upgrade of a minor release, so fortunately you won't need >> to change any config or similar. >> >> Hope this helps, >> Albert >> >> PS: For more details, this is the commit fixing the CVE/your issue: >> >> -https://github.com/SchedMD/slurm/commit/db468895240ad6817628d07054fe54e71273b2fe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm_commit_db468895240ad6817628d07054fe54e71273b2fe&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=vnGn1-IeRQsoZf76fBIAcOOT0X6fiysBkj8818QLpoM&s=0Ztj_NuPirpcfrf5VtVfSiuMtZyMNTVWZISti4TFDOw&e=> >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >>
Hi Axinia, > Can we use any of the backups that we have before Oct. 25th to restore the > slurmdb? I'm not sure if I fully understand your question. First of all let me clarify that the information on MariaDB is correct, so you don't need to use any backup. The problem is that slurmctld is trying to send the information of one job (with special characters) to slurmdbd to be saved to MariaDB, but due the special characters slurmdbd is not able to do so because MariaDB is returning an error, but nothing else is happening, the data in MariaDB is fine. Therefore, If you mean that using a backup can solve your problem, then no, it won't solve your current situation. If you mean that you are not sure if the upgraded slurmdbd will be able to load/work with old backups, then don't worry, yes it will. But again, you don't need to use any backup to solve your problem. You only need yo upgrade slurmdbd to latest 17.11. Once this is done and your current issue is fixed, then we strongly recommend you to plan a path to upgrade up to current release 21.08, but please note that you won't be able to jump from 17.11 to 21.08 directly, but you will need intermediate upgrades. We can help you with this if you open a specigic ticket for it. Regards, Albert
Hi Albert, Thank you so much for your detailed explanation. We are using Bright Computing as a cluster managment software for the cluster. In the past, we upgraded slurm through Bright. Bright provided all the slurm rpms for the upgrade. As I mentioned the cluster was supposed to be retired last year and at the moment the cluster does not have Bright support. We just asked for a quote to extend the Bright support. The slurmdb upgrade is time sensitive and I will open another ticket with schedmd to get help with the slurmdb upgrade. Would you be able to provide the slurmdb RPM for the upgrade and do you believe that this will not interfere with Bright integration and will not have any negative impact on the cluster? Best, Axinia *---* Axinia Radeva Manager, Research Computing Services On Fri, Oct 29, 2021 at 8:18 AM <bugs@schedmd.com> wrote: > *Comment # 20 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c20&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=E7PiCUP5KWBrtHY9nuDHPYrOQ_uX475AbDipbWU-vF4&s=YlaMC0q-10GsgEZKUqEv04BjPXKG1w_pEllJPHFfP-Q&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=E7PiCUP5KWBrtHY9nuDHPYrOQ_uX475AbDipbWU-vF4&s=6F-NsOIdmSsJ4iC4djoK00gA1XtMfeGaxOD5E9tGsQw&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > Can we use any of the backups that we have before Oct. 25th to restore the > > slurmdb? > > I'm not sure if I fully understand your question. > First of all let me clarify that the information on MariaDB is correct, so you > don't need to use any backup. > The problem is that slurmctld is trying to send the information of one job > (with special characters) to slurmdbd to be saved to MariaDB, but due the > special characters slurmdbd is not able to do so because MariaDB is returning > an error, but nothing else is happening, the data in MariaDB is fine. > > Therefore, > If you mean that using a backup can solve your problem, then no, it won't solve > your current situation. > If you mean that you are not sure if the upgraded slurmdbd will be able to > load/work with old backups, then don't worry, yes it will. > But again, you don't need to use any backup to solve your problem. > You only need yo upgrade slurmdbd to latest 17.11. > > Once this is done and your current issue is fixed, then we strongly recommend > you to plan a path to upgrade up to current release 21.08, but please note that > you won't be able to jump from 17.11 to 21.08 directly, but you will need > intermediate upgrades. > We can help you with this if you open a specigic ticket for it. > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, > We are using Bright Computing as a cluster managment software for the > cluster. I see. > In the past, we upgraded slurm through Bright. Bright provided all the > slurm rpms for the upgrade. As I mentioned the cluster was supposed to be > retired last year and at the moment the cluster does not have Bright > support. We just asked for a quote to extend the Bright support. I'm not so familiar with Bright upgrade mechanism. > The slurmdb upgrade is time sensitive and I will open another ticket with > schedmd to get help with the slurmdb upgrade. For the necessary minor update to fix your current issue (17.11.2 -> 17.11.13), I may help you here. For the major upgrade (17.11.13 -> intermediates -> 21.08.x), yes, better open a new ticket. But please note that the minor update is simpler and important for you, right now your slurmdbd is just stuck. The major upgrade could wait a bit more, will be more complex, but it's also important. > Would you be able to provide > the slurmdb RPM for the upgrade SchedMD doesn't provide .rpm files, only a way to create them. See: https://slurm.schedmd.com/quickstart_admin.html#quick_start Actually, right now we don't even provide the .tar files to build Slurm for versions prior to 20.11.7 or 20.02.7 for security reasons. See: https://www.schedmd.com/archives.php But you can always clone the code form github, checkout the tag/version that you want to build (slurm-17-11-13-2), and compile it. See: https://github.com/SchedMD/slurm > and do you believe that this will not > interfere with Bright integration and will not have any negative impact on > the cluster? I don't know about Bright. But I'm sure that updating a slurmdbd binary from 17.11.2 to a one 17.11.13 won't have any impact on the cluster, meaning that your current slurmctld 17.11.2 will be able to communicate with it (now it cant due a wrong job), all your slurmd won't have any issue neither, and it will be able to read the current MariDB information. Please check this for further details on upgrades: https://slurm.schedmd.com/quickstart_admin.html#upgrade Hope it helps, Albert
Hi Alber, Thank you for the information. Our team is reduced to the minimum at the moment but we are planning to do the upgrade tomorrow. In case if we need your help, what is your availability tomorrow? *---* Axinia Radeva Manager, Research Computing Services On Fri, Oct 29, 2021 at 12:29 PM <bugs@schedmd.com> wrote: > *Comment # 22 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c22&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=_Ggr--ybAbghbqvVpXwMQPegPWkXqYm16o2jlyWy0BI&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=0OtF6nbH1EXPzkEhzOZlgEydlsQ0OAahfho70XYjQas&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > We are using Bright Computing as a cluster managment software for the > > cluster. > > I see. > > In the past, we upgraded slurm through Bright. Bright provided all the > > slurm rpms for the upgrade. As I mentioned the cluster was supposed to be > > retired last year and at the moment the cluster does not have Bright > > support. We just asked for a quote to extend the Bright support. > > I'm not so familiar with Bright upgrade mechanism. > > The slurmdb upgrade is time sensitive and I will open another ticket with > > schedmd to get help with the slurmdb upgrade. > > For the necessary minor update to fix your current issue (17.11.2 -> 17.11.13), > I may help you here. > For the major upgrade (17.11.13 -> intermediates -> 21.08.x), yes, better open > a new ticket. > > But please note that the minor update is simpler and important for you, right > now your slurmdbd is just stuck. > The major upgrade could wait a bit more, will be more complex, but it's also > important. > > Would you be able to provide > > the slurmdb RPM for the upgrade > > SchedMD doesn't provide .rpm files, only a way to create them. > See: https://slurm.schedmd.com/quickstart_admin.html#quick_start <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23quick-5Fstart&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=0uanAaljvKfW5bRxEzdg5fPdYJe3q_j7cnalbCGOdf0&e=> > > Actually, right now we don't even provide the .tar files to build Slurm for > versions prior to 20.11.7 or 20.02.7 for security reasons. > See: https://www.schedmd.com/archives.php <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.schedmd.com_archives.php&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=xK2rUQ9MkDPHa7-7hohCeNtvkceQoE3UehY84dD0qgE&e=> > > But you can always clone the code form github, checkout the tag/version that > you want to build (slurm-17-11-13-2), and compile it. > See: https://github.com/SchedMD/slurm <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=FcUOZx0EB1WafxNjUo8By1puyfQhqu_6m0oQrv7Whio&e=> > > and do you believe that this will not > > interfere with Bright integration and will not have any negative impact on > > the cluster? > > I don't know about Bright. > But I'm sure that updating a slurmdbd binary from 17.11.2 to a one 17.11.13 > won't have any impact on the cluster, meaning that your current slurmctld > 17.11.2 will be able to communicate with it (now it cant due a wrong job), all > your slurmd won't have any issue neither, and it will be able to read the > current MariDB information. > > Please check this for further details on upgrades:https://slurm.schedmd.com/quickstart_admin.html#upgrade <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_quickstart-5Fadmin.html-23upgrade&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=76gJ4dNmYvMFSJGxYJbk-umSvHtRWOCAMZxRQTkW_Uw&s=P4TsB77HPuuluVQAMfoOdjqd-Ma04lINuz7nCusb4yA&e=> > > Hope it helps, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, > Thank you for the information. Our team is reduced to the minimum at > the moment but we are planning to do the upgrade tomorrow. Good! > In case if we > need your help, what is your availability tomorrow? I'm working normally today, but I'm personally located in Europe. But don't worry about it, the support team of SchedMD is all over. I'll ask other members of the team in your timezone to keep an eye on this ticket too. Anyway, if you want to post some sort of summary of what you are planning to do today, I'll double-check that your plan is what I also have in mind, just to avoid any confusions. Regards, Albert
Hi Albert, Thank you for your prompt reply and I appreciate your assistance in this matter. I want to confirm with you that we can do the upgrade on a live cluster and we do not need downtime. Here are the steps that I identified: 1. Shutdown the slurmdbd daemon from the head node as root from cmsh (Bright management software): - Stop the slurmdbd service: [roll->device[roll]->services]% stop slurmdbd - Ensure that slurmdbd is not running anymore: [roll->device[roll]->services]% status slurmdbd - slurmctld might remain running while the database daemon is down. During this time, requests intended for slurmdbd are queued internally. The DBD Agent Queue size is limited, however, and should therefore be monitored with sdiag. The current value of the Agent Queue size is 146160 ***************************************************** sdiag output at Mon Nov 01 09:45:07 2021 (1635774307) Data since Sun Oct 31 20:00:00 2021 (1635724800) ******************************************************* Server thread count: 3 Agent queue size: 0 DBD Agent queue size: 146160 Jobs submitted: 7274 Jobs started: 4820 Jobs completed: 4595 Jobs canceled: 10 Jobs failed: 0 Jobs running: 19 Jobs running ts: Mon Nov 01 09:45:00 2021 (1635774300) *2. *Backup the Slurm database using mysqldump (or similar tool), e.g. mysqldump --databases slurm_acct_db > backup.sql. You may also want to take this opportunity to verify that the innodb_buffer_pool_size in my.cnf is at least 128M. - Create a backup of the slurm_acct_db database: DBnode # mysqldump -p slurm_acct_db > slurm_acct_db.sql In preparation for the conversion, ensure that the variable innodb_buffer_pool_size is set to a value of 128 Mb or more: On the database server, run the following command: DBnode # echo 'SELECT @@innodb_buffer_pool_size/1024/1024;' | \ - mysql --password --batch I checked the innodb_buffer_pool_size and it set to 128MB Do we need to increase innodb_buffer_pool_size? [root@roll ar2667]# echo 'SELECT @@innodb_buffer_pool_size/1024/1024;' | mysql -uslurm --password --batch Enter password: @@innodb_buffer_pool_size/1024/1024 128.00000000 - To permanently change the size, edit the /etc/my.cnf file, set innodb_buffer_pool_size to 128 MB, then restart the database: DBnode # rcmysql restart 3. Upgrade the slurmdbd daemon We got the source code from here: https://github.com/SchedMD/slurm/tree/slurm-17.02 [ar2667@holmes slurmdb]$ cd /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11 [ar2667@holmes slurm-slurm-17.11]$ pwd /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11 [ar2667@holmes slurm-slurm-17.11]$ ls -ltr total 1888 -rw-r--r-- 1 ar2667 habarcs 18277 Oct 31 10:35 configure.ac -rw-r--r-- 1 ar2667 habarcs 16358 Oct 31 10:35 RELEASE_NOTES -rw-r--r-- 1 ar2667 habarcs 9555 Oct 31 10:35 INSTALL -rw-r--r-- 1 ar2667 habarcs 6369 Oct 31 10:35 DISCLAIMER -rwxr-xr-x 1 ar2667 habarcs 893466 Oct 31 10:35 configure -rw-r--r-- 1 ar2667 habarcs 119 Oct 31 10:35 AUTHORS -rw-r--r-- 1 ar2667 habarcs 8543 Oct 31 10:35 LICENSE.OpenSSL drwxrwxr-x 2 ar2667 habarcs 4096 Oct 31 10:35 slurm drwxrwxr-x 5 ar2667 habarcs 4096 Oct 31 10:35 testsuite drwxrwxr-x 2 ar2667 habarcs 4096 Oct 31 10:36 etc drwxrwxr-x 18 ar2667 habarcs 4096 Oct 31 10:36 contribs -rw-r--r-- 1 ar2667 habarcs 1068 Oct 31 10:36 META -rw-r--r-- 1 ar2667 habarcs 16064 Oct 31 10:36 config.h.in -rw-r--r-- 1 ar2667 habarcs 1666 Oct 31 10:36 Makefile.am -rw-r--r-- 1 ar2667 habarcs 12429 Oct 31 10:36 BUILD.NOTES -rw-r--r-- 1 ar2667 habarcs 20474 Oct 31 10:36 COPYING -rw-r--r-- 1 ar2667 habarcs 530522 Oct 31 10:36 NEWS -rw-r--r-- 1 ar2667 habarcs 2761 Oct 31 10:36 CONTRIBUTING.md -rw-r--r-- 1 ar2667 habarcs 21601 Oct 31 10:36 slurm.spec drwxrwxr-x 4 ar2667 habarcs 4096 Oct 31 10:36 doc -rw-r--r-- 1 ar2667 habarcs 3428 Oct 31 10:36 README.rst drwxrwxr-x 2 ar2667 habarcs 4096 Oct 31 10:37 auxdir -rw-r--r-- 1 ar2667 habarcs 36672 Oct 31 10:37 Makefile.in -rw-r--r-- 1 ar2667 habarcs 71382 Oct 31 10:37 aclocal.m4 -rwxr-xr-x 1 ar2667 habarcs 2993 Oct 31 10:37 autogen.sh drwxrwxr-x 33 ar2667 habarcs 4096 Oct 31 10:38 src I checked the version: [ar2667@holmes slurm-slurm-17.11]$ ./configure --version slurm configure 17.11 generated by GNU Autoconf 2.69 Copyright (C) 2012 Free Software Foundation, Inc. This configure script is free software; the Free Software Foundation gives unlimited permission to copy, distribute and modify it [ar2667@holmes slurm-slurm-17.11]$ id slurm uid=450(slurm) gid=450(slurm) groups=450(slurm) Building and Installing Slurm from source: Slurm root directory is currently /cm/shared/apps/slurm/17.11.2. We will backup the current /cm/shared/apps/slurm/17.11.2 directory and install the new slurm db in the same directory (/cm/shared/apps/slurm/17.11.2). I am not sure whether we need to specify explicitlyCFLAGS and LDFLAGS $ ./configure --prefix=/cm/shared/apps/slurm/17.11.2 --sysconfdir=/etc/slurm/slurm.conf --cache-file=config.cache --enable-debug $ make $ make install $ ldconfig -n /cm/shared/apps/slurm/17.11.2/lib64 Rebuild database /rigel/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd -D -v Once you see the following message, you can shut down slurmdbd by pressing Ctrl–C: Conversion done: success! Restart slurmdbd. Please let me know if we missed something. Best, Axinia *---* Axinia Radeva Manager, Research Computing Services On Mon, Nov 1, 2021 at 7:25 AM <bugs@schedmd.com> wrote: > *Comment # 24 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c24&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=NUYVsWumKjwDaN0xLiXBPRWbMFXnREu9rjAsMF9RySE&s=XpwN792fJDHomatr2nph7rJhogZCWkxSU1NER76wUH4&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=NUYVsWumKjwDaN0xLiXBPRWbMFXnREu9rjAsMF9RySE&s=lZKmtaThC0LksA1uWmXSqJNeqi0XuWH0d_ewuu54bYY&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > Thank you for the information. Our team is reduced to the minimum at > > the moment but we are planning to do the upgrade tomorrow. > > Good! > > In case if we > > need your help, what is your availability tomorrow? > > I'm working normally today, but I'm personally located in Europe. > But don't worry about it, the support team of SchedMD is all over. > I'll ask other members of the team in your timezone to keep an eye on this > ticket too. > > Anyway, if you want to post some sort of summary of what you are planning to do > today, I'll double-check that your plan is what I also have in mind, just to > avoid any confusions. > > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, > I want to confirm with you that we can do the upgrade on a live cluster and > we do not need downtime. Yes, I can confirm this. > Shutdown the slurmdbd daemon from the head node as root from cmsh > (Bright management software): > - > Stop the slurmdbd service: > [roll->device[roll]->services]% stop slurmdbd > > - > Ensure that slurmdbd is not running anymore: > [roll->device[roll]->services]% status slurmdbd Looks fine. > - > slurmctld might remain running while the database daemon is down. During > this time, requests intended for slurmdbd are queued internally. The DBD > Agent Queue size is limited, however, and should therefore be monitored > with sdiag. > > The current value of the Agent Queue size is 146160 Yes, but your DBD Agent Queue is probably full or close to be full, as slurmdbd hasn't been able to perform properly for a long time now. Note that the max size of this cache queue on slurctld may be controlled by the MaxDBDMsgs parameter on slurm,conf: MaxDBDMsgs When communication to the SlurmDBD is not possible the slurmctld will queue messages meant to processed when the SlurmDBD is available again. In order to avoid running out of memory the slurmctld will only queue so many messages. The default value is 10000, or MaxJobCount * 2 + Node Count * 4, whichever is greater. The value can not be less than 10000. > *2. *Backup the Slurm database using mysqldump (or similar tool), e.g. > mysqldump > --databases slurm_acct_db > backup.sql. You may also want to take this > opportunity to verify that the innodb_buffer_pool_size in my.cnf is at > least 128M. > > - > Create a backup of the slurm_acct_db database: > DBnode # mysqldump -p slurm_acct_db > slurm_acct_db.sql Good. > In preparation for the conversion, ensure that the variable > innodb_buffer_pool_size is set to a value of 128 Mb or more: > > I checked the innodb_buffer_pool_size and it set to 128MB > Do we need to increase innodb_buffer_pool_size? There are other MariaDB variables that you should also check: https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build Note that we recommend at least 1024M for innodb_buffer_pool_size: $cat my.cnf ... [mysqld] innodb_buffer_pool_size=1024M innodb_log_file_size=64M innodb_lock_wait_timeout=900 > - > To permanently change the size, edit the /etc/my.cnf file, set > innodb_buffer_pool_size to 128 MB, then restart the database: > DBnode # rcmysql restart Better change it to 1024M. > 3. Upgrade the slurmdbd daemon > > We got the source code from here: > https://github.com/SchedMD/slurm/tree/slurm-17.02 Careful! That is 17.02, this is WRONG. You need 17.11! > [ar2667@holmes slurmdb]$ cd > /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11 > > [ar2667@holmes slurm-slurm-17.11]$ pwd > /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11 > I checked the version: > [ar2667@holmes slurm-slurm-17.11]$ ./configure --version > > slurm configure 17.11 Ok, this looks better. I assume you cloned the git repo, so you need to: $ git checkout slurm-17-11-13-2 > Slurm root directory is currently /cm/shared/apps/slurm/17.11.2. > > We will backup the current /cm/shared/apps/slurm/17.11.2 directory and > install the new slurm db in the same directory > (/cm/shared/apps/slurm/17.11.2). Seems good, but the "17.11.2" of the path may be a bit confusing later. > I am not sure whether we need to specify explicitlyCFLAGS and LDFLAGS Shouldn't be necessary in general. > $ ./configure --prefix=/cm/shared/apps/slurm/17.11.2 > --sysconfdir=/etc/slurm/slurm.conf --cache-file=config.cache --enable-debug I think that Bright do some special configs about the location of the config files, and it seems that you are handling them properly, but you may need to ask them to be sure. I don't know neither if you need PMIX support or other Slurm features. Also note that you'll need some packages in the system to build slurmdbd properly with munge and sql support, like libmunge-dev and libmariadbclient-dev. They depends on your Linux version. > $ make > $ make install Yes. I recommend to use some -jN to boost it a bit... ;-) > Rebuild database > > /rigel/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd -D -v > > Once you see the following message, you can shut down slurmdbd by pressing > Ctrl–C: > > Conversion done: > > success! > > Restart slurmdbd. Well, as you are doing a minor upgrade, I don't expect any DB conversion happening. Please post errors if you face them. > Please let me know if we missed something. In general looks fine. I'll still work for some time, but my teammates on your timezone are aware of this ticket. Regards, Albert
Hi, We did the slurmdb upgrade from slurm-17.11.2 to slurm-17-11-13-2. ldconfig -n <library_location> Do we need to link the newly created slurmdb libraries in order for slurmdb to run smoothly? If we need to link do we need to stop slurmdb first? The DBD Agent queue size is 0: [ar2667@roll slurm-slurm-17-11-13-2]$ sdiag ******************************************************* sdiag output at Thu Nov 04 16:27:34 2021 (1636057654) Data since Wed Nov 03 20:00:00 2021 (1635984000) ******************************************************* Server thread count: 3 Agent queue size: 0 DBD Agent queue size: 0 Jobs submitted: 5261 Jobs started: 2579 Jobs completed: 2727 Jobs canceled: 649 Jobs failed: 0 Jobs running: 9 Jobs running ts: Thu Nov 04 16:27:24 2021 (1636057644) Main schedule statistics (microseconds): Last cycle: 10075 Max cycle: 1466764 Total cycles: 4530 Mean cycle: 14439 Mean depth cycle: 114 Cycles per minute: 3 Last queue length: 48 Backfilling stats Total backfilled jobs (since last slurm start): 36646 Total backfilled jobs (since last stats cycle start): 1649 Total backfilled heterogeneous job components: 0 Total cycles: 2431 Last cycle when: Thu Nov 04 16:27:13 2021 (1636057633) Last cycle: 106221 Max cycle: 10510947 Mean cycle: 272490 Last depth cycle: 128 Last depth cycle (try sched): 125 Depth Mean: 185 Depth Mean (try depth): 131 Last queue length: 48 Queue length mean: 150 Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:1010681 ave_time:2101 total_time:2124106863 REQUEST_JOB_INFO ( 2003) count:498878 ave_time:159082 total_time:79362635362 REQUEST_NODE_INFO_SINGLE ( 2040) count:407431 ave_time:150493 total_time:61315783059 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:366622 ave_time:50236 total_time:18417720567 MESSAGE_EPILOG_COMPLETE ( 6012) count:262028 ave_time:149176 total_time:39088453342 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:249805 ave_time:873169 total_time:218122125025 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:183054 ave_time:89474 total_time:16378755965 REQUEST_PING ( 1008) count:108050 ave_time:670 total_time:72422636 REQUEST_FED_INFO ( 2049) count:105534 ave_time:495 total_time:52299235 REQUEST_JOB_USER_INFO ( 2039) count:94936 ave_time:73919 total_time:7017583272 REQUEST_NODE_INFO ( 2007) count:61971 ave_time:10029 total_time:621519495 REQUEST_KILL_JOB ( 5032) count:35859 ave_time:70541 total_time:2529555102 REQUEST_JOB_STEP_CREATE ( 5001) count:21987 ave_time:19904 total_time:437649341 REQUEST_UPDATE_JOB ( 3001) count:18175 ave_time:13511 total_time:245579280 REQUEST_JOB_INFO_SINGLE ( 2021) count:10486 ave_time:56976 total_time:597460110 REQUEST_STEP_COMPLETE ( 5016) count:4860 ave_time:82818 total_time:402500244 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:4106 ave_time:142568 total_time:585385921 REQUEST_JOB_READY ( 4019) count:2086 ave_time:48635 total_time:101452740 REQUEST_SHARE_INFO ( 2022) count:1734 ave_time:7757 total_time:13451288 REQUEST_CANCEL_JOB_STEP ( 5005) count:926 ave_time:34256 total_time:31721572 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:924 ave_time:135138 total_time:124867815 REQUEST_RESOURCE_ALLOCATION ( 4001) count:912 ave_time:151121 total_time:137823173 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:83 ave_time:134873 total_time:11194528 REQUEST_RECONFIGURE ( 1003) count:23 ave_time:4378081 total_time:100695867 ACCOUNTING_UPDATE_MSG (10001) count:23 ave_time:3081122 total_time:70865821 REQUEST_JOB_NOTIFY ( 4022) count:14 ave_time:513 total_time:7190 REQUEST_UPDATE_NODE ( 3002) count:13 ave_time:163184 total_time:2121392 REQUEST_PRIORITY_FACTORS ( 2026) count:10 ave_time:75375 total_time:753750 REQUEST_STATS_INFO ( 2035) count:9 ave_time:350 total_time:3153 REQUEST_JOB_STEP_INFO ( 2005) count:6 ave_time:913 total_time:5482 ACCOUNTING_REGISTER_CTLD (10003) count:3 ave_time:99144 total_time:297433 REQUEST_CREATE_RESERVATION ( 3006) count:3 ave_time:1722 total_time:5167 REQUEST_RESERVATION_INFO ( 2024) count:2 ave_time:297 total_time:595 REQUEST_TOP_JOB ( 5038) count:1 ave_time:636 total_time:636 REQUEST_BUILD_INFO ( 2001) count:1 ave_time:992 total_time:992 Remote Procedure Call statistics by user root ( 0) count:1968045 ave_time:177435 total_time:349201813840 rec2111 ( 169996) count:957935 ave_time:74371 total_time:71243449580 de2356 ( 466630) count:262940 ave_time:19786 total_time:5202663824 rcc2167 ( 497658) count:111894 ave_time:80291 total_time:8984176222 mrd2165 ( 456704) count:60636 ave_time:90680 total_time:5498526572 zp2221 ( 546238) count:23312 ave_time:31225 total_time:727930989 hzz2000 ( 476402) count:12953 ave_time:133267 total_time:1726212175 tk2757 ( 475100) count:9867 ave_time:36060 total_time:355808405 rl2226 ( 124636) count:8450 ave_time:88802 total_time:750377404 ls3759 ( 544216) count:2462 ave_time:219621 total_time:540708722 kk3291 ( 496278) count:2367 ave_time:137171 total_time:324685378 mi2493 ( 546357) count:2249 ave_time:39402 total_time:88615117 mam2556 ( 497945) count:1949 ave_time:96117 total_time:187333844 dl2860 ( 358507) count:1593 ave_time:156732 total_time:249675512 yx2625 ( 544217) count:1273 ave_time:58559 total_time:74545771 ls3326 ( 441245) count:1162 ave_time:37463 total_time:43533040 adp2164 ( 534798) count:1093 ave_time:59129 total_time:64628798 msd2202 ( 545520) count:1042 ave_time:220796 total_time:230069625 aeh2213 ( 525442) count:1025 ave_time:67385 total_time:69070417 mv2640 ( 453243) count:963 ave_time:79477 total_time:76536782 lmk2202 ( 476448) count:943 ave_time:82737 total_time:78021827 az2604 ( 532427) count:888 ave_time:338792 total_time:300847531 pcd2120 ( 475692) count:823 ave_time:107029 total_time:88085136 ab4689 ( 495992) count:791 ave_time:111803 total_time:88436721 ma3631 ( 462767) count:721 ave_time:70244 total_time:50646299 yg2811 ( 545936) count:713 ave_time:52872 total_time:37698056 mcb2270 ( 485162) count:638 ave_time:58953 total_time:37612491 taf2109 ( 250023) count:614 ave_time:133794 total_time:82149615 zx2250 ( 495594) count:597 ave_time:93204 total_time:55643075 nobody ( 486473) count:503 ave_time:124392 total_time:62569554 nobody ( 545556) count:477 ave_time:63677 total_time:30374144 yva2000 ( 446751) count:476 ave_time:50400 total_time:23990449 sb4601 ( 546356) count:446 ave_time:97455 total_time:43465066 kmx2000 ( 538398) count:432 ave_time:55456 total_time:23957062 lma2197 ( 545978) count:420 ave_time:93863 total_time:39422850 ad3395 ( 457738) count:412 ave_time:141428 total_time:58268672 yw3376 ( 521854) count:397 ave_time:429944 total_time:170687883 rh2845 ( 472268) count:376 ave_time:78339 total_time:29455533 ar2667 ( 217300) count:336 ave_time:68430 total_time:22992503 rh2883 ( 487394) count:322 ave_time:15159 total_time:4881271 as5460 ( 478898) count:302 ave_time:73032 total_time:22055729 slh2181 ( 485956) count:282 ave_time:75242 total_time:21218504 am5328 ( 525450) count:279 ave_time:66455 total_time:18540974 zw2105 ( 80893) count:256 ave_time:64065 total_time:16400820 htr2104 ( 411716) count:254 ave_time:32973 total_time:8375243 jb4493 ( 546296) count:246 ave_time:57979 total_time:14262871 tma2145 ( 493491) count:229 ave_time:140148 total_time:32094053 ll3450 ( 546422) count:227 ave_time:63897 total_time:14504685 jic2121 ( 463630) count:216 ave_time:39114 total_time:8448678 yy2865 ( 492680) count:207 ave_time:57179 total_time:11836154 os2328 ( 493602) count:200 ave_time:31509 total_time:6301955 mt3197 ( 473089) count:179 ave_time:64292 total_time:11508437 jdn2133 ( 470892) count:165 ave_time:77947 total_time:12861324 ab2080 ( 110745) count:162 ave_time:85449 total_time:13842774 cx2204 ( 477023) count:162 ave_time:101575 total_time:16455282 yz4047 ( 543996) count:155 ave_time:2191 total_time:339624 ca2783 ( 488827) count:151 ave_time:85713 total_time:12942746 am5284 ( 519168) count:148 ave_time:397578 total_time:58841596 mts2188 ( 546099) count:145 ave_time:83144 total_time:12055925 gt2453 ( 545916) count:140 ave_time:7515 total_time:1052177 pfm2119 ( 470991) count:138 ave_time:67934 total_time:9374942 flw2113 ( 489455) count:132 ave_time:110236 total_time:14551159 mad2314 ( 545522) count:132 ave_time:82226 total_time:10853953 yp2602 ( 546118) count:129 ave_time:109131 total_time:14077980 mea2200 ( 524843) count:127 ave_time:30659 total_time:3893818 hmm2183 ( 543254) count:127 ave_time:81463 total_time:10345822 arr47 ( 543516) count:126 ave_time:53569 total_time:6749815 sh3972 ( 524988) count:125 ave_time:89528 total_time:11191059 bc212 ( 30094) count:109 ave_time:7083 total_time:772063 st3107 ( 474846) count:105 ave_time:19529 total_time:2050639 ia2337 ( 423640) count:102 ave_time:2957 total_time:301701 ik2496 ( 543457) count:87 ave_time:514581 total_time:44768622 kz2303 ( 477679) count:84 ave_time:74047 total_time:6220030 gjc14 ( 488676) count:84 ave_time:93202 total_time:7829028 as4525 ( 365586) count:81 ave_time:1029267 total_time:83370670 yr2322 ( 470274) count:71 ave_time:7979 total_time:566568 ja3170 ( 463425) count:71 ave_time:64624 total_time:4588329 kat2193 ( 496066) count:68 ave_time:147765 total_time:10048080 sj2787 ( 453295) count:65 ave_time:165285 total_time:10743552 el2545 ( 261905) count:58 ave_time:14282 total_time:828390 sw3203 ( 479850) count:51 ave_time:138129 total_time:7044590 hl2902 ( 414743) count:49 ave_time:490264 total_time:24022960 nt2560 ( 544457) count:41 ave_time:161329 total_time:6614525 mgz2110 ( 546421) count:36 ave_time:75339 total_time:2712235 sb3378 ( 314424) count:29 ave_time:10018869 total_time:290547226 fw2366 ( 546279) count:27 ave_time:125658 total_time:3392777 jj3134 ( 545521) count:27 ave_time:123118 total_time:3324189 slurm ( 450) count:26 ave_time:2737048 total_time:71163254 wz2543 ( 545958) count:21 ave_time:71778 total_time:1507344 aso2125 ( 528824) count:20 ave_time:2716 total_time:54328 pab2170 ( 423948) count:20 ave_time:82432 total_time:1648658 iu2153 ( 465048) count:18 ave_time:113845 total_time:2049210 ns3316 ( 498343) count:18 ave_time:62053 total_time:1116960 fg2465 ( 498193) count:17 ave_time:107099 total_time:1820688 jeg2228 ( 497608) count:14 ave_time:97146 total_time:1360048 jab2443 ( 496774) count:14 ave_time:7292 total_time:102099 ms5924 ( 533600) count:12 ave_time:11046 total_time:132552 kl2792 ( 389785) count:12 ave_time:131466 total_time:1577598 qz2280 ( 451169) count:12 ave_time:352823 total_time:4233881 rl3149 ( 546516) count:11 ave_time:2815 total_time:30967 pab2163 ( 363846) count:9 ave_time:21269 total_time:191421 mz2778 ( 527313) count:8 ave_time:144421 total_time:1155371 xl3041 ( 546419) count:8 ave_time:167814 total_time:1342519 kn2536 ( 544496) count:8 ave_time:7088 total_time:56705 sx2220 ( 476094) count:7 ave_time:270912 total_time:1896390 rc3362 ( 545696) count:7 ave_time:8925 total_time:62478 jnt2136 ( 518724) count:7 ave_time:77231 total_time:540618 mc4138 ( 545636) count:6 ave_time:15008 total_time:90051 ags2198 ( 502018) count:5 ave_time:7624 total_time:38121 jv2575 ( 443748) count:5 ave_time:2155 total_time:10775 xl2727 ( 477843) count:5 ave_time:3102 total_time:15510 yg2607 ( 496576) count:4 ave_time:37494 total_time:149977 reg2171 ( 542795) count:4 ave_time:5529 total_time:22116 hwp2108 ( 527947) count:3 ave_time:1534 total_time:4603 new2128 ( 546417) count:3 ave_time:3694 total_time:11083 nb2869 ( 494059) count:3 ave_time:14663 total_time:43990 lef2150 ( 524806) count:3 ave_time:7479 total_time:22438 yj2650 ( 545416) count:2 ave_time:197 total_time:394 da2709 ( 446758) count:2 ave_time:27576 total_time:55153 dp264 ( 36357) count:1 ave_time:4085 total_time:4085 However we see the following error in slurmdbd logs: [ar2667@roll slurm-slurm-17-11-13-2]$ sudo cat /var/log/slurmdbd [2021-11-04T15:37:02.775] Accounting storage MYSQL plugin loaded [2021-11-04T15:37:02.855] error: chdir(/var/log): Permission denied [2021-11-04T15:37:02.855] chdir to /var/tmp [2021-11-04T15:37:18.050] slurmdbd version 17.11.13-2 started [2021-11-04T15:37:19.493] error: We have more allocated time than is possible (363722400 > 26179200) for cluster habanero(7272) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 1 [2021-11-04T15:37:19.493] error: We have more time than is possible (26179200+7948800+0)(34128000) > 26179200 for cluster habanero(7272) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 1 [2021-11-04T15:37:19.493] error: We have more allocated time than is possible (2239812280800 > 196300800000) for cluster habanero(54528000) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 2 [2021-11-04T15:37:19.493] error: We have more time than is possible (196300800000+47923200000+0)(244224000000) > 196300800000 for cluster habanero(54528000) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 2 [2021-11-04T15:37:19.493] error: We have more allocated time than is possible (363711600 > 26179200) for cluster habanero(7272) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 5 [2021-11-04T15:37:19.493] error: We have more time than is possible (26179200+7948800+0)(34128000) > 26179200 for cluster habanero(7272) from 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 5 [2021-11-04T15:37:19.494] error: id_assoc 205 doesn't have any tres [2021-11-04T16:00:08.405] error: id_assoc 205 doesn't have any tres [2021-11-04T16:00:58.323] Warning: Note very large processing time from daily_rollup for habanero: usec=49879794 began=16:00:08.443 [ar2667@roll slurm-slurm-17-11-13-2]$ sudo tail -300 /var/log/slurmctld [2021-11-04T15:37:04.058] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2021-11-04T15:37:08.367] error: slurmdbd: Sending PersistInit msg: Connection refused [2021-11-04T15:37:09.060] error: slurmdbd: Sending PersistInit msg: Connection refused [2021-11-04T15:37:09.061] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2021-11-04T15:37:14.064] error: slurmdbd: Sending PersistInit msg: Connection refused [2021-11-04T15:37:14.065] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2021-11-04T15:37:18.058] Registering slurmctld at port 6817 with slurmdbd. [2021-11-04T15:37:42.551] _job_complete: JobID=25547442 State=0x1 NodeCnt=1 WEXITSTATUS 0 [2021-11-04T15:37:42.551] email msg to yz4047@cumc.columbia.edu: SLURM Job_id=25547442 Name=test.submit Ended, Run time 1-16:12:01, COMPLETED, ExitCode 0 [2021-11-04T15:37:42.552] _job_complete: JobID=25547442 State=0x8003 NodeCnt=1 done [2021-11-04T15:37:48.710] slurmdbd: agent queue size 133700 [2021-11-04T15:37:52.166] error: _shutdown_backup_controller:send/recv: Connection refused [2021-11-04T15:38:10.854] error: slurmdbd: agent queue filling (124905), RESTART SLURMDBD NOW [2021-11-04T15:39:48.820] slurmdbd: agent queue size 61100 [2021-11-04T15:41:35.971] _job_complete: JobID=25543331 State=0x1 NodeCnt=1 WEXITSTATUS 0 [2021-11-04T15:41:35.971] _job_complete: JobID=25543331 State=0x8003 NodeCnt=1 done [2021-11-04T15:42:52.316] error: _shutdown_backup_controller:send/recv: Connection refused [2021-11-04T15:47:20.062] job_submit.lua: Function slurm_job_submit called. [2021-11-04T15:47:20.062] job_submit.lua: Account is jalab. [2021-11-04T15:47:20.062] job_submit.lua: Regular account. [2021-11-04T15:47:20.066] _slurm_rpc_submit_batch_job: JobId=25552998 InitPrio=3530 usec=3748 [2021-11-04T15:47:35.215] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 25552998 uid 544496 [2021-11-04T15:47:35.215] email msg to kn2536@columbia.edu: SLURM Job_id=25552998 Name=N_A_13000 Ended, Run time 00:00:00, CANCELLED, ExitCode 0 [2021-11-04T15:47:52.305] error: _shutdown_backup_controller:send/recv: Connection refused [2021-11-04T15:52:34.952] _job_complete: JobID=25552413 State=0x0 NodeCnt=0 cancelled by interactive user [2021-11-04T15:52:34.953] _job_complete: JobID=25552413 State=0x4 NodeCnt=0 done [2021-11-04T15:52:34.972] error: slurm_receive_msg [10.43.4.228:48637]: Zero Bytes were transmitted or received [2021-11-04T15:52:52.747] error: _shutdown_backup_controller:send/recv: Connection refused [2021-11-04T15:52:58.726] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 25552427 uid 497658 [2021-11-04T15:52:58.727] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 25552419 uid 497658 [2021-11-04T15:52:58.773] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 25552423 uid 497658 .... *---* [2021-11-04T15:53:06.622] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 25552993 uid 497658 [2021-11-04T15:53:06.622] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 25552997 uid 497658 [2021-11-04T16:32:28.766] job_submit.lua: Function slurm_job_submit called. [2021-11-04T16:32:28.766] job_submit.lua: Account is astro. [2021-11-04T16:32:28.766] job_submit.lua: Regular account. [2021-11-04T16:32:28.790] sched: _slurm_rpc_allocate_resources JobId=25553001 NodeList=(null) usec=26990 [2021-11-04T16:32:28.955] _pick_best_nodes: job 25552999 never runnable in partition apam1 [2021-11-04T16:32:52.048] error: _shutdown_backup_controller:send/recv: Connection refused [2021-11-04T16:33:03.090] _pick_best_nodes: job 25552999 never runnable in partition apam1 [2021-11-04T16:34:03.265] _pick_best_nodes: job 25552999 never runnable in partition apam1 [2021-11-04T16:34:21.513] _job_complete: JobID=25553001 State=0x0 NodeCnt=0 WTERMSIG 126 [2021-11-04T16:34:21.515] _job_complete: JobID=25553001 State=0x0 NodeCnt=0 cancelled by interactive user [2021-11-04T16:34:21.518] _job_complete: JobID=25553001 State=0x4 NodeCnt=0 done [2021-11-04T16:34:21.520] _slurm_rpc_complete_job_allocation: JobID=25553001 State=0x4 NodeCnt=0 error Job/step already completing or completed [2021-11-04T16:35:03.460] _pick_best_nodes: job 25552999 never runnable in partition apam1 Axinia Radeva Manager, Research Computing Services On Mon, Nov 1, 2021 at 12:15 PM <bugs@schedmd.com> wrote: > *Comment # 26 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c26&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=5DfT6aQxK6ZiGeA3dE_4s79QVohqJVvvj2-g09MvxEM&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=Pzj9-0_NKdlO9CL4B21f9ltoRwEWrwhgbD1SmwywE_E&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > I want to confirm with you that we can do the upgrade on a live cluster and > > we do not need downtime. > > Yes, I can confirm this. > > Shutdown the slurmdbd daemon from the head node as root from cmsh > > (Bright management software): > > - > > Stop the slurmdbd service: > > [roll->device[roll]->services]% stop slurmdbd > >> - > > Ensure that slurmdbd is not running anymore: > > [roll->device[roll]->services]% status slurmdbd > > Looks fine. > > - > > slurmctld might remain running while the database daemon is down. During > > this time, requests intended for slurmdbd are queued internally. The DBD > > Agent Queue size is limited, however, and should therefore be monitored > > with sdiag. > >> The current value of the Agent Queue size is 146160 > > Yes, but your DBD Agent Queue is probably full or close to be full, as slurmdbd > hasn't been able to perform properly for a long time now. > > Note that the max size of this cache queue on slurctld may be controlled by the > MaxDBDMsgs parameter on slurm,conf: > > MaxDBDMsgs > When communication to the SlurmDBD is not possible the slurmctld will queue > messages meant to processed when the SlurmDBD is available again. In order to > avoid running out of memory the slurmctld will only queue so many messages. The > default value is 10000, or MaxJobCount * 2 + Node Count * 4, whichever is > greater. The value can not be less than 10000. > > > *2. *Backup the Slurm database using mysqldump (or similar tool), e.g. > > mysqldump > > --databases slurm_acct_db > backup.sql. You may also want to take this > > opportunity to verify that the innodb_buffer_pool_size in my.cnf is at > > least 128M. > >> - > > Create a backup of the slurm_acct_db database: > > DBnode # mysqldump -p slurm_acct_db > slurm_acct_db.sql > > Good. > > In preparation for the conversion, ensure that the variable > > innodb_buffer_pool_size is set to a value of 128 Mb or more: > >> I checked the innodb_buffer_pool_size and it set to 128MB > > Do we need to increase innodb_buffer_pool_size? > > There are other MariaDB variables that you should also check:https://slurm.schedmd.com/accounting.html#slurm-accounting-configuration-before-build <https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_accounting.html-23slurm-2Daccounting-2Dconfiguration-2Dbefore-2Dbuild&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=Yo5SVqySMlIzlN6DvRldV3-Hj1IN3Q6D3sGfstXBrsk&e=> > > Note that we recommend at least 1024M for innodb_buffer_pool_size: > > $cat my.cnf > ... > [mysqld] > innodb_buffer_pool_size=1024M > innodb_log_file_size=64M > innodb_lock_wait_timeout=900 > > - > > To permanently change the size, edit the /etc/my.cnf file, set > > innodb_buffer_pool_size to 128 MB, then restart the database: > > DBnode # rcmysql restart > > Better change it to 1024M. > > 3. Upgrade the slurmdbd daemon > > > > We got the source code from here: > > https://github.com/SchedMD/slurm/tree/slurm-17.02 <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_SchedMD_slurm_tree_slurm-2D17.02&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=Nzy-8W3vRQnCkRgxD6wAfzsaVx9baV1h91HBzSJWNhQ&s=MkyPGaqKxcU6f1WGW7yMeG-NDh9v-30eHkb30gzopoA&e=> > > Careful! > That is 17.02, this is WRONG. > You need 17.11! > > [ar2667@holmes slurmdb]$ cd > > /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11 > > > > [ar2667@holmes slurm-slurm-17.11]$ pwd > > /rigel/rcs/projects/downloads/slurmdb/slurm-slurm-17.11 > > > I checked the version: > > [ar2667@holmes slurm-slurm-17.11]$ ./configure --version > > > > slurm configure 17.11 > > Ok, this looks better. > I assume you cloned the git repo, so you need to: > > $ git checkout slurm-17-11-13-2 > > Slurm root directory is currently /cm/shared/apps/slurm/17.11.2. > > > > We will backup the current /cm/shared/apps/slurm/17.11.2 directory and > > install the new slurm db in the same directory > > (/cm/shared/apps/slurm/17.11.2). > > Seems good, but the "17.11.2" of the path may be a bit confusing later. > > > I am not sure whether we need to specify explicitlyCFLAGS and LDFLAGS > > Shouldn't be necessary in general. > > $ ./configure --prefix=/cm/shared/apps/slurm/17.11.2 > > --sysconfdir=/etc/slurm/slurm.conf --cache-file=config.cache --enable-debug > > I think that Bright do some special configs about the location of the config > files, and it seems that you are handling them properly, but you may need to > ask them to be sure. > I don't know neither if you need PMIX support or other Slurm features. > > Also note that you'll need some packages in the system to build slurmdbd > properly with munge and sql support, like libmunge-dev and > libmariadbclient-dev. They depends on your Linux version. > > $ make > > $ make install > > Yes. > I recommend to use some -jN to boost it a bit... ;-) > > > Rebuild database > > > > /rigel/cm/shared/apps/slurm/17.11.2/sbin/slurmdbd -D -v > > > > Once you see the following message, you can shut down slurmdbd by pressing > > Ctrl–C: > > > > Conversion done: > > > > success! > > > > Restart slurmdbd. > > Well, as you are doing a minor upgrade, I don't expect any DB conversion > happening. > Please post errors if you face them. > > Please let me know if we missed something. > > In general looks fine. > I'll still work for some time, but my teammates on your timezone are aware of > this ticket. > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Axinia Radeva, > We did the slurmdb upgrade from slurm-17.11.2 to slurm-17-11-13-2. Great! I assume that you did not yet upgrade slurmctld and neither slurmds and clients, right. I would recommend to do so, but just for sanity, not actually related to your main issue. > ldconfig -n <library_location> > > Do we need to link the newly created slurmdb libraries in order for slurmdb > to run smoothly? You shouldn't. If you install it, the executable will just link properly. From the logs I can see that you are running the right slurmdbd version: > [2021-11-04T15:37:18.050] slurmdbd version 17.11.13-2 started > The DBD Agent queue size is 0: That 0 are very good news! ;-) That means that slurmctld has been able already to sent all the pending messages to slurmdbd. > However we see the following error in slurmdbd logs: Yes, I was expecting to see still some errors, but they are not that important as the previous ones. > [2021-11-04T15:37:02.775] Accounting storage MYSQL plugin loaded > [2021-11-04T15:37:02.855] error: chdir(/var/log): Permission denied > [2021-11-04T15:37:02.855] chdir to /var/tmp You need to change permissions of /var/log, or change the location of the log files in slurmdbd.conf. > [2021-11-04T15:37:19.493] error: We have more allocated time than is > possible (363722400 > 26179200) for cluster habanero(7272) from > 2021-11-04T14:00:00 - 2021-11-04T15:00:00 tres 1 Due your original issue, I was already expecting runaway jobs due the probable lost of messages between slurmctld and slurmdbd, and runaways are the main source of these kind of errors. Actually, runaways are exactly the symptom that your were seeing initially: jobs that are not anymore running in the cluster, but the slurmdbd thinks that they still are. We can fix runaways, but I would recommend to do it later (see below). Regarding the errors on slurmcltd: > [ar2667@roll slurm-slurm-17-11-13-2]$ sudo tail -300 /var/log/slurmctld > [2021-11-04T15:37:18.058] Registering slurmctld at port 6817 with slurmdbd. > [2021-11-04T15:37:48.710] slurmdbd: agent queue size 133700 The queue was huge. Even if we managed to reduce/digest ~10k message it just some seconds, it was still huge: > [2021-11-04T15:38:10.854] error: slurmdbd: agent queue filling (124905), > RESTART SLURMDBD NOW But in just one minute we halved the pending messages (~65k messages digested): > [2021-11-04T15:39:48.820] slurmdbd: agent queue size 61100 And as sdiag tells us, now the queue is just empty, so no pending messages between slurmctld and slurmdbd. Great! But yes, there are other errors that will need attention, like these: > [2021-11-04T15:37:52.166] error: _shutdown_backup_controller:send/recv: > Connection refused > [2021-11-04T15:52:34.972] error: slurm_receive_msg [10.43.4.228:48637]: > Zero Bytes were transmitted or received My recommended roadmap for your site would be: 1a) I'm a bit worried about this "backup controller", I would recommend to disable it until your site is more stable and you have a supported version running. Do you know how to disable the backup slurmctld? 1b) Complete the minor upgrade to 17.11.13 for the slurmctld and slurmds and clients. If you need it, we can keep this bug open to help you on this first step. In that case, please attach your slurm.conf to have a better idea of your config. 2a) Open a new ticket to help you to plan and make the major upgrade up to 21.08. Please note that you CANNOT make it directly, you'll need to upgrade to intermediate versions first. 2b) Once in 21.08, fix runaway jobs. I recommend to do that in 21.08 because there were significant improvements on detecting and fixing runaway jobs since 17.11, so it's better to do it in a newer versions. And finally: 3a) Open new tickets with a supported version running to study the remaining error messages (on a supported version) of your logs. 3b) Re-enable a backup controller if you really need it. If it's not necessary, I would say that it's better to not setup a controller backup but to keep your config simpler. Regards, Albert
Hi Axinia, > > The DBD Agent queue size is 0: > > That 0 are very good news! ;-) I hope this value still remains low. > My recommended roadmap for your site would be: > > 1a) I'm a bit worried about this "backup controller", I would recommend to > disable it until your site is more stable and you have a supported version > running. Do you know how to disable the backup slurmctld? Did you disable the backup slurmctld? > 1b) Complete the minor upgrade to 17.11.13 for the slurmctld and slurmds and > clients. Have you been able to upgrade the full cluster (all daemons and clients) to 17.11.13? > If you need it, we can keep this bug open to help you on this first step. > In that case, please attach your slurm.conf to have a better idea of your > config. Do you want to keep this open to complete this 1st step of the suggested roadmap, or can we close it already? Regards, Albert
Hi Albert, Thank you for following up on this. Unfortunately, we are spread very thin at the moment and we do not have the recourses to execute all our projects. Can you please see my responses below? Are those tasks time-sensitive? Best, Axinia *---* Axinia Radeva Manager, Research Computing Services On Fri, Nov 12, 2021 at 6:19 AM <bugs@schedmd.com> wrote: > *Comment # 29 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747-23c29&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=-0snpS0JJ8ZG0nZzbBtovXYVAqV6T1qbe8bEbaJKAa8&s=TwjpvYb4tZxVJqcvqn4Tc9D2JKa2yyGIQJyB_4IBi64&e=> > on bug 12747 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D12747&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=4v6XNOMLkOlZVNYSZZEHUddWVGsteDK-7RNrHFN7nyY&m=-0snpS0JJ8ZG0nZzbBtovXYVAqV6T1qbe8bEbaJKAa8&s=H47DgyK2WPxf3p6Alv_NM7sQDjKFujRSVa6Ht8oCVMg&e=> > from Albert Gil <albert.gil@schedmd.com> * > > Hi Axinia, > > > The DBD Agent queue size is 0: > > > > That 0 are very good news! ;-) > > I hope this value still remains low. > > The DBD Agent queue size is still 0 > > My recommended roadmap for your site would be: > > > > 1a) I'm a bit worried about this "backup controller", I would recommend to > > disable it until your site is more stable and you have a supported version > > running. Do you know how to disable the backup slurmctld? > > Did you disable the backup slurmctld? > > No, we have not disabled the backup slurmctld. Can you please provide the steps how to do it. > > 1b) Complete the minor upgrade to 17.11.13 for the slurmctld and slurmds and > > clients. > > Have you been able to upgrade the full cluster (all daemons and clients) to > 17.11.13? > > We have not been able to upgrade the full cluster. I assume we need downtime for this. > > If you need it, we can keep this bug open to help you on this first step. > > In that case, please attach your slurm.conf to have a better idea of your > > config. > > The slurm.conf is attached. > Do you want to keep this open to complete this 1st step of the suggested > roadmap, or can we close it already? > > > Regards, > Albert > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Axinia, > Thank you for following up on this. Unfortunately, we are spread very thin > at the moment and we do not have the recourses to execute all our projects. Ok. > Are those tasks time-sensitive? Not really. You've already done the time-sensitive one. > > Did you disable the backup slurmctld? > > > > No, we have not disabled the backup slurmctld. Can you please provide the > steps how to do it. As you seem busy, this is not time-sensitive and neither related to the original issue of the ticket, I would recommend you to file a new ticket whenever you have time to work on it and we'll provide you with more instructions there. But basically the instructions are: - Change the slurm.conf to disable to the backup slurmctld - Stop the slurmctld backup - Restart the main slurmcltd > > Have you been able to upgrade the full cluster (all daemons and clients) to > > 17.11.13? > > > > We have not been able to upgrade the full cluster. I assume we need > downtime for this. Not really. You can just stop your old slurmctld running 17.11.2 and start the new one 17.11.13. All jobs will keep running and users shouldn't notice it (if you do it quick enough, ie some seconds). Same thing for all slurmd daemons on the nodes. It's not a problem if you run slurmdbd on version 17.11.13 and the other daemons with 17.11.2, but it's always recommended to use the same version, just for sanity. > > > If you need it, we can keep this bug open to help you on this first step. > > > In that case, please attach your slurm.conf to have a better idea of your > > > config. > > > > > The slurm.conf is attached. It seems that it's not attached in bugzilla for some reason, but don't worry. If this is ok for you I'm closing this ticket as infogiven and once you have time, please don't hesitate to file a new bug with your slurm.conf and we'll help you with the rest of the recommended steps mentioned in comment 28. Regards, Albert
Closing as infogiven.