Created attachment 4877 [details] slurmdbd.log We had a maintenance today and as we were about to start bringing our computes online we started seeing errors coming out of slurmdbd to this effect: Jul 05 15:29:13 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more allocated time than is possible (36178096 > 34560000) for cluster summit(9600) from 2017-02-17T14:00:00 - 2017-02-17T15:00:00 tres 1 Jul 05 15:29:13 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34560000+691200+0)(35251200) > 34560000 for cluster summit(9600) from 2017-02-17T14:00:00 - 2017-02-17T15:00:00 tres 1 Jul 05 15:29:14 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more allocated time than is possible (36905760 > 34560000) for cluster summit(9600) from 2017-02-17T15:00:00 - 2017-02-17T16:00:00 tres 1 Jul 05 15:29:14 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34560000+691200+0)(35251200) > 34560000 for cluster summit(9600) from 2017-02-17T15:00:00 - 2017-02-17T16:00:00 tres 1 Jul 05 15:29:16 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34118928+725352+0)(34844280) > 34560000 for cluster summit(9600) from 2017-02-17T16:00:00 - 2017-02-17T17:00:00 tres 1 Jul 05 15:29:19 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34070472+777600+0)(34848072) > 34560000 for cluster summit(9600) from 2017-02-17T19:00:00 - 2017-02-17T20:00:00 tres 1 Jul 05 15:29:21 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (33897744+777600+0)(34675344) > 34560000 for cluster summit(9600) from 2017-02-17T20:00:00 - 2017-02-17T21:00:00 tres 1 Jul 05 15:29:24 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34011932+777600+0)(34789532) > 34560000 for cluster summit(9600) from 2017-02-17T23:00:00 - 2017-02-18T00:00:00 tres 1 Jul 05 15:29:27 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (33890880+777600+0)(34668480) > 34560000 for cluster summit(9600) from 2017-02-18T01:00:00 - 2017-02-18T02:00:00 tres 1 Jul 05 15:29:28 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (33987552+777600+0)(34765152) > 34560000 for cluster summit(9600) from 2017-02-18T02:00:00 - 2017-02-18T03:00:00 tres 1 We also noticed off the slurm controller daemon nodes repeated messages about the slurmdbd having it's agent queue filled. We checked for runaway jobs and cleaned them up using "sacctmgr show runaway" However even after doing a restart of the slurmdb daemon and slurm controller daemons we are still throwing errors and jobs will not start. The db server is fine on space, and we have already checked for any stuck db processes. I have attached the log output for the slurmctld's and slurmdbd Any timely advice would be appreciated.
Created attachment 4878 [details] slurm5-ctld.log
Created attachment 4879 [details] slurm6-ctld.log
Regardless of what you do, can you please set aside a copy of the dbd.messages file in the StateSaveLocation on your systems? If you're able to attach them, we'd like to see what caused this. I believe this is likely a recurrence of the issues seen in bug 3891 and bug 3939, and led to the 17.02.6 release today. You can either: a) Wait for us to review the file and confirm the issue, and see if we can work around the bad message causing this problem. or b) Stop slurmctld on the systems, move dbd.messages out of the way, restart slurmctld, and check that accounting data is moving correctly again. The only downside is that you'll loose the records cached there; the upside is that new job records will start flowing. In my experience, it's more likely that continuing to loose data will be more harmful than getting a chance to restore whatever else is cached in dbd.messages.
Worth noting that these log messages are appearing on a completely idle system (with jobs queued, but not running).
Any chance, if we move dbd.messages out of the way, that we could process it manually later to recover what's cached there?
Did this happen during a maintenance window? Is it possible a firewall was changed that is now blocking communication between the slurmctld host and slurmdbd? The output from 'sacctmgr list cluster' would help us understand what's going on, as well as additional slurmctld logs with SlurmctldDebug=debug?
(In reply to Jonathon Anderson from comment #6) > Any chance, if we move dbd.messages out of the way, that we could process it > manually later to recover what's cached there? No, that's generally not possible.
Both slurm5 (slurmctld) and slurmdb1 (slurmdbd) have empty iptables with default policies ACCEPT. ~jonathon On 7/5/17, 7:34 PM, "bugs@schedmd.com" <bugs@schedmd.com> wrote: Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=3960#c7> on bug 3960 <https://bugs.schedmd.com/show_bug.cgi?id=3960> from Tim Wickberg <mailto:tim@schedmd.com>Did this happen during a maintenance window? Is it possible a firewall was changed that is now blocking communication between the slurmctld host and slurmdbd? The output from 'sacctmgr list cluster' would help us understand what's going on, as well as additional slurmctld logs with SlurmctldDebug=debug? ________________________________________ You are receiving this mail because: * You are on the CC list for the bug.
Actually, the bug I've referenced here does not apply on further review - the issues described won't apply. Please disregard comment 1, and leave dbd.messages in place. If you could, I'd like to see additional logs from a clean restart of slurmctld, including a newly submitted job. There should be some indication of the communication issues from that. Can you also verify that you can run "sacctmgr show clusters" from the slurmctld hosts, and attach that output?
[root@slurm5 ~]# sacctmgr list cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- blanca 10.225.160.46 7820 7168 1 blanca summit 10.225.160.83 6820 7680 1 normal
# sacctmgr list cluster Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS ---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- blanca 10.225.160.46 7820 7168 1 blanca summit 10.225.160.83 6820 7680 1 normal
That's a good sign, but leaves several open questions. Has anything else changed on the configuration recently? I'm guessing you're in a maintenance window at the moment as you'd mentioned having no jobs running? Can you increase DebugLevel in slurmdbd.conf, restart it, and attach the new logs? Output from 'sacct -vvv' may also shed some light.
# sacct -vvv sacct: Jobs eligible from Wed Jul 05 00:00:00 2017 - Now sacct: debug: Options selected: opt_completion=0 opt_dup=0 opt_field_list=(null) opt_help=0 opt_allocs=0 sacct: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) sacct: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded sacct: debug: slurmdbd: Sent DbdInit msg sacct: debug2: Clusters requested: summit sacct: debug2: Userids requested: all JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 141909 NEB_Cu_Al+ shas ucb-gener+ 480 TIMEOUT 1:0 141909.batch batch ucb-gener+ 24 CANCELLED 0:15 141909.exte+ extern ucb-gener+ 480 COMPLETED 0:0 141912 NEB_Co_Al+ shas ucb-gener+ 480 PENDING 0:0 141916 NEB_Fe_Al+ shas ucb-gener+ 480 PENDING 0:0 141924 Cr2O3 shas ucb-gener+ 48 PENDING 0:0 142001 f10l20t20+ shas ucb-gener+ 11 PENDING 0:0 142314 W_H1.log shas ucb-gener+ 192 PENDING 0:0 142315 W_H2.log shas ucb-gener+ 192 PENDING 0:0 143342 N-1000-ipo shas ucb-gener+ 18 PENDING 0:0 143343 N-1000-ipo shas ucb-gener+ 18 PENDING 0:0 143400 test35 shas ucb-gener+ 1 PENDING 0:0 143447 seed_sim shas ucb-gener+ 12 PENDING 0:0 144974 exp_gen7 shas ucb-gener+ 1 PENDING 0:0 145232 mpi_equi shas ucb-gener+ 96 PENDING 0:0 145233 test37 shas ucb-gener+ 96 PENDING 0:0 145234 test38 shas ucb-gener+ 96 PENDING 0:0 145255 Tio2_ads_+ shas ucb-gener+ 72 PENDING 0:0 145256 Tio2_deh2+ shas ucb-gener+ 72 PENDING 0:0 145372 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145373 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145477 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145478 4-thio_28+ shas ucb-gener+ 120 PENDING 0:0 145484 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145503 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145504 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145510 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145514 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145519 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145521 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145539 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145549 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145551 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145552 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 145669 raxml_boo+ smem ucb-gener+ 96 PENDING 0:0 145687 gs-riser-+ shas ucb-gener+ 24 PENDING 0:0 146366 15_15_-50+ shas ucb-gener+ 384 PENDING 0:0 146373 19_0_-50_+ shas ucb-gener+ 384 PENDING 0:0 146374 7_7_-50_1+ shas ucb-gener+ 384 PENDING 0:0 146378 8_4_-50_1+ shas ucb-gener+ 384 PENDING 0:0 147322 init2DNS shas ucb-gener+ 96 COMPLETED 0:0 147322.batch batch ucb-gener+ 23 COMPLETED 0:0 147322.exte+ extern ucb-gener+ 96 COMPLETED 0:0 147322.0 uNS ucb-gener+ 96 COMPLETED 0:0 147323 init2LES shas ucb-gener+ 96 PENDING 0:0 147693 raxml shas ucb-gener+ 35 PENDING 0:0 147785 002_x008_+ shas ucb-gener+ 384 PENDING 0:0 147843 Audef_OO shas ucb-gener+ 96 PENDING 0:0 147844 Audef_OTia shas ucb-gener+ 96 PENDING 0:0 147845 Audef_OTic shas ucb-gener+ 96 PENDING 0:0 147846 Audef_OTi+ shas ucb-gener+ 96 PENDING 0:0 147847 Audef_OTi+ shas ucb-gener+ 96 PENDING 0:0 147848 Cupv_OTiOb shas ucb-gener+ 96 PENDING 0:0 147850 Cupv_OTiOa shas ucb-gener+ 96 PENDING 0:0 151878 raxml shas ucb-gener+ 24 PENDING 0:0 157962 PLCpep7_p+ shas ucb-gener+ 1 PENDING 0:0 160675 cr2_12e_dz shas ucb-summi+ 160 PENDING 0:0 160710 TC32DNS2 shas ucb-gener+ 48 TIMEOUT 1:0 160710.batch batch ucb-gener+ 4 CANCELLED 0:15 160710.exte+ extern ucb-gener+ 48 COMPLETED 0:0 160710.0 uNS ucb-gener+ 48 CANCELLED 0:15 160711 TC32DNS shas ucb-gener+ 48 COMPLETED 0:0 160711.batch batch ucb-gener+ 16 COMPLETED 0:0 160711.exte+ extern ucb-gener+ 48 COMPLETED 0:0 160711.0 uNS ucb-gener+ 48 COMPLETED 0:0 160712 TC32LES shas ucb-gener+ 48 COMPLETED 0:0 160712.batch batch ucb-gener+ 18 COMPLETED 0:0 160712.exte+ extern ucb-gener+ 48 COMPLETED 0:0 160712.0 uNS ucb-gener+ 48 COMPLETED 0:0 160713 TC32LES2 shas ucb-gener+ 48 PENDING 0:0 166639 raxml_boo+ smem ucb-gener+ 96 PENDING 0:0 168351 S2Cpp shas ucb-gener+ 24 PENDING 0:0 176001 no_stk shas ucb-gener+ 512 CANCELLED+ 0:0 176006 La030 shas ucb-gener+ 512 CANCELLED+ 0:0 176007 no_stk shas ucb-gener+ 256 CANCELLED+ 0:0 176008 La020 shas ucb-gener+ 256 PENDING 0:0 176009 La030 shas ucb-gener+ 256 CANCELLED+ 0:0 176013 La040 shas ucb-gener+ 256 CANCELLED+ 0:0 176065 ipyrad-s3 shas ucb-gener+ 48 PENDING 0:0 176070 SH2B1-Exp+ shas ucb-gener+ 1 PENDING 0:0 176071 SH2B1-Exp+ shas ucb-gener+ 1 PENDING 0:0 176072 SH2B1-Exp+ shas ucb-gener+ 1 PENDING 0:0 176074 angsdCer smem ucb-summi+ 1 PENDING 0:0 176106 BiTeSe1 shas ucb-gener+ 288 PENDING 0:0 176107 BiTeSe2 shas ucb-gener+ 288 PENDING 0:0 176108 BiTeSe3 shas ucb-gener+ 288 PENDING 0:0 176110 BiTeSe4 shas ucb-gener+ 288 PENDING 0:0 176111 BiTeSe5 shas ucb-gener+ 288 PENDING 0:0 176112 tab_blast+ smem ucb-gener+ 1 PENDING 0:0 176113 text_blas+ smem ucb-gener+ 1 PENDING 0:0 176155 tab_pref_+ smem ucb-gener+ 1 PENDING 0:0 176158 text_pref+ smem ucb-gener+ 1 PENDING 0:0 176318 n96_03_lo+ shas ucb-summi+ 12 PENDING 0:0 176319 n64_04 shas ucb-summi+ 12 PENDING 0:0 176321 WV_MOAT_c+ shas ucb-gener+ 24 PENDING 0:0 176322 WV_MOAT_c+ shas ucb-gener+ 24 PENDING 0:0 176323 WV_MOAT_c+ shas ucb-gener+ 24 PENDING 0:0 176343 Ek5Ra2e7 shas ucb-gener+ 144 COMPLETED 0:0 176343.batch batch ucb-gener+ 24 COMPLETED 0:0 176343.exte+ extern ucb-gener+ 144 COMPLETED 0:0 176344 Ek5Ra5e7 shas ucb-gener+ 144 COMPLETED 0:0 176344.batch batch ucb-gener+ 24 COMPLETED 0:0 176344.exte+ extern ucb-gener+ 144 COMPLETED 0:0 176345 Ek5Ra1e8 shas ucb-gener+ 144 PENDING 0:0 176766 f12l32t32+ shas ucb16_sum+ 528 COMPLETED 0:0 176766.batch batch ucb16_sum+ 24 COMPLETED 0:0 176766.exte+ extern ucb16_sum+ 528 COMPLETED 0:0 176768 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 176771 gs-riser-+ shas ucb-gener+ 24 PENDING 0:0 176780 E5R8XD shas ucb-gener+ 384 TIMEOUT 1:0 176780.batch batch ucb-gener+ 24 CANCELLED 0:15 176780.exte+ extern ucb-gener+ 384 COMPLETED 0:0 176780.0 pmi_proxy ucb-gener+ 16 COMPLETED 0:0 176781 E5R10XD shas ucb-gener+ 384 TIMEOUT 1:0 176781.batch batch ucb-gener+ 24 CANCELLED 0:15 176781.exte+ extern ucb-gener+ 384 COMPLETED 0:0 176781.0 pmi_proxy ucb-gener+ 16 COMPLETED 0:0 176806 RAxML shas ucb-gener+ 20 PENDING 0:0 176856 f12l32t32+ shas ucb16_sum+ 22 PENDING 0:0 176925 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 176926 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 176927 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 176930 Pyrad smem ucb-gener+ 1 PENDING 0:0 176997 nR shas ucb-gener+ 24 PENDING 0:0 176999 gs-bo_vs_+ shas ucb-gener+ 24 PENDING 0:0 177000 ionizatio+ shas csu-testi+ 1536 PENDING 0:0 177002 mpivasp shas csu-gener+ 96 PENDING 0:0 177007 SPW_24.txt shas ucb-gener+ 120 PENDING 0:0 177011 Cleaved_1+ shas ucb-gener+ 24 PENDING 0:0 177012 Water_002+ shas ucb-gener+ 24 PENDING 0:0 177013 N-1000-O3 shas ucb-gener+ 18 PENDING 0:0 177014 N-1000-O3 shas ucb-gener+ 18 PENDING 0:0 177015 N-1000-O3 shas ucb-gener+ 18 PENDING 0:0 177016 N-1000-su+ shas ucb-gener+ 18 PENDING 0:0 177017 N-1000-su+ shas ucb-gener+ 18 PENDING 0:0 177018 N-1000-su+ shas ucb-gener+ 18 PENDING 0:0 177019 simcore shas ucb-summi+ 4 PENDING 0:0 177020 simcore shas ucb-summi+ 4 PENDING 0:0 177021 simcore shas ucb-summi+ 4 PENDING 0:0 177022 simcore shas ucb-summi+ 4 PENDING 0:0 177023 simcore shas ucb-summi+ 4 PENDING 0:0 177024 simcore shas ucb-summi+ 4 PENDING 0:0 177025 simcore shas ucb-summi+ 4 PENDING 0:0 177026 simcore shas ucb-summi+ 4 PENDING 0:0 177027 simcore shas ucb-summi+ 4 PENDING 0:0 177028 simcore shas ucb-summi+ 4 PENDING 0:0 177029 simcore shas ucb-summi+ 4 PENDING 0:0 177030 simcore shas ucb-summi+ 4 PENDING 0:0 177031 simcore shas ucb-summi+ 4 PENDING 0:0 177032 simcore shas ucb-summi+ 4 PENDING 0:0 177033 simcore shas ucb-summi+ 4 PENDING 0:0 177086 16dVMSk2k+ shas ucb-gener+ 24 COMPLETED 0:0 177086.batch batch ucb-gener+ 24 COMPLETED 0:0 177086.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177086.0 uNS ucb-gener+ 24 COMPLETED 0:0 177087 16dVMSk2k+ shas ucb-gener+ 24 TIMEOUT 1:0 177087.batch batch ucb-gener+ 24 CANCELLED 0:15 177087.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177087.0 uNS ucb-gener+ 24 CANCELLED 0:15 177088_[1] summary_s+ shas csu-gener+ 1 PENDING 0:0 177091 16RBEVMuk+ shas ucb-gener+ 24 COMPLETED 0:0 177091.batch batch ucb-gener+ 24 COMPLETED 0:0 177091.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177091.0 uNS ucb-gener+ 24 COMPLETED 0:0 177092_[1] summary_s+ shas csu-gener+ 1 PENDING 0:0 177093 16RBEVMuk+ shas ucb-gener+ 24 TIMEOUT 1:0 177093.batch batch ucb-gener+ 24 CANCELLED 0:15 177093.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177093.0 uNS ucb-gener+ 24 CANCELLED 0:15 177100 gs-riser-+ shas ucb-gener+ 9 PENDING 0:0 177101_[1-1+ sim_M2_m_+ shas csu-gener+ 1 PENDING 0:0 177102_[1-1+ sim_M2_m_+ shas csu-gener+ 1 PENDING 0:0 177140 JO4_cT4_k+ shas ucb-gener+ 24 COMPLETED 0:0 177140.batch batch ucb-gener+ 24 COMPLETED 0:0 177140.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177140.0 pmi_proxy ucb-gener+ 1 COMPLETED 0:0 177144 f10l24t24+ shas ucb16_sum+ 36 PENDING 0:0 177148 16SIG2 shas ucb-gener+ 24 PENDING 0:0 177149 16VRE2 shas ucb-gener+ 24 PENDING 0:0 177151 32VRE2 shas ucb-gener+ 72 PENDING 0:0 177152 32WAL2 shas ucb-gener+ 72 PENDING 0:0 177158 FeAl2O4_R+ shas ucb-gener+ 96 COMPLETED 0:0 177158.batch batch ucb-gener+ 24 COMPLETED 0:0 177158.exte+ extern ucb-gener+ 96 COMPLETED 0:0 177164 f12l32t32+ shas ucb16_sum+ 22 PENDING 0:0 177166 raxml shas ucb-gener+ 24 PENDING 0:0 177167 raxml_boo+ smem ucb-gener+ 96 PENDING 0:0 177169 raxml_boo+ smem ucb-gener+ 96 PENDING 0:0 177170 raxml shas ucb-gener+ 24 PENDING 0:0 177172 1024.super shas ucb-gener+ 48 PENDING 0:0 177174 f12l32t32+ shas ucb-gener+ 528 TIMEOUT 1:0 177174.batch batch ucb-gener+ 24 CANCELLED 0:15 177174.exte+ extern ucb-gener+ 528 COMPLETED 0:0 177175 f12l32t32+ shas ucb-gener+ 528 COMPLETED 0:0 177175.batch batch ucb-gener+ 24 COMPLETED 0:0 177175.exte+ extern ucb-gener+ 528 COMPLETED 0:0 177176 1024.super shas ucb-gener+ 48 PENDING 0:0 177177 1024.super shas ucb-gener+ 48 PENDING 0:0 177178 16dVMSk1k+ shas ucb-gener+ 24 COMPLETED 0:0 177178.batch batch ucb-gener+ 24 COMPLETED 0:0 177178.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177178.0 uNS ucb-gener+ 24 COMPLETED 0:0 177180 16RBEVMuk+ shas ucb-gener+ 24 COMPLETED 0:0 177180.batch batch ucb-gener+ 24 COMPLETED 0:0 177180.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177180.0 uNS ucb-gener+ 24 COMPLETED 0:0 177181 16RBEVMuk+ shas ucb-gener+ 24 COMPLETED 0:0 177181.batch batch ucb-gener+ 24 COMPLETED 0:0 177181.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177181.0 uNS ucb-gener+ 24 COMPLETED 0:0 177183 16aVMSk2k+ shas ucb-gener+ 24 COMPLETED 0:0 177183.batch batch ucb-gener+ 24 COMPLETED 0:0 177183.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177183.0 uNS ucb-gener+ 24 COMPLETED 0:0 177188 16aVMSk2k+ shas ucb-gener+ 24 COMPLETED 0:0 177188.batch batch ucb-gener+ 24 COMPLETED 0:0 177188.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177188.0 uNS ucb-gener+ 24 COMPLETED 0:0 177190 f10l16t16+ shas ucb-gener+ 264 COMPLETED 0:0 177190.batch batch ucb-gener+ 24 COMPLETED 0:0 177190.exte+ extern ucb-gener+ 264 COMPLETED 0:0 177191 f10l24t24+ shas ucb-gener+ 864 TIMEOUT 1:0 177191.batch batch ucb-gener+ 24 CANCELLED 0:15 177191.exte+ extern ucb-gener+ 864 COMPLETED 0:0 177192 S20S13G shas ucb-gener+ 216 COMPLETED 0:0 177192.batch batch ucb-gener+ 24 COMPLETED 0:0 177192.exte+ extern ucb-gener+ 216 COMPLETED 0:0 177192.0 pmi_proxy ucb-gener+ 9 COMPLETED 0:0 177193 f12l32t32+ shas ucb-gener+ 528 COMPLETED 0:0 177193.batch batch ucb-gener+ 24 COMPLETED 0:0 177193.exte+ extern ucb-gener+ 528 COMPLETED 0:0 177194 f12l32t32+ shas ucb16_sum+ 22 PENDING 0:0 177195 f10l16t16+ shas ucb-gener+ 264 COMPLETED 0:0 177195.batch batch ucb-gener+ 24 COMPLETED 0:0 177195.exte+ extern ucb-gener+ 264 COMPLETED 0:0 177196 S20S13H shas ucb-gener+ 216 COMPLETED 0:0 177196.batch batch ucb-gener+ 24 COMPLETED 0:0 177196.exte+ extern ucb-gener+ 216 COMPLETED 0:0 177196.0 pmi_proxy ucb-gener+ 9 COMPLETED 0:0 177197 f12l32t32+ shas ucb-gener+ 528 COMPLETED 0:0 177197.batch batch ucb-gener+ 24 COMPLETED 0:0 177197.exte+ extern ucb-gener+ 528 COMPLETED 0:0 177198 f10l16t16+ shas ucb-gener+ 264 COMPLETED 0:0 177198.batch batch ucb-gener+ 24 COMPLETED 0:0 177198.exte+ extern ucb-gener+ 264 COMPLETED 0:0 177199 S20S13I shas ucb-gener+ 216 COMPLETED 0:0 177199.batch batch ucb-gener+ 24 COMPLETED 0:0 177199.exte+ extern ucb-gener+ 216 COMPLETED 0:0 177199.0 pmi_proxy ucb-gener+ 9 COMPLETED 0:0 177200 f12l32t32+ shas ucb-gener+ 528 COMPLETED 0:0 177200.batch batch ucb-gener+ 24 COMPLETED 0:0 177200.exte+ extern ucb-gener+ 528 COMPLETED 0:0 177201 f10l16t16+ shas ucb-gener+ 264 COMPLETED 0:0 177201.batch batch ucb-gener+ 24 COMPLETED 0:0 177201.exte+ extern ucb-gener+ 264 COMPLETED 0:0 177202 S20S13J shas ucb-gener+ 216 PENDING 0:0 177203 HDDP shas ucb-gener+ 24 COMPLETED 0:0 177203.batch batch ucb-gener+ 24 COMPLETED 0:0 177203.exte+ extern ucb-gener+ 24 COMPLETED 0:0 177204 f12l32t32+ shas ucb-gener+ 22 PENDING 0:0 177205 f10l16t16+ shas ucb-gener+ 11 CANCELLED+ 0:0 177207 16RBEVMuk+ shas ucb-gener+ 24 PENDING 0:0 177209 _interact+ shas ucb-gener+ 1 COMPLETED 0:0 177210 sbatch-ex+ shas ucb-gener+ 1 PENDING 0:0 177211 _interact+ shas ucb-gener+ 1 CANCELLED+ 0:0 177212 sinteract+ shas ucb-testi+ 1 CANCELLED+ 0:0 177213 16RBEVMuk+ shas ucb-gener+ 24 PENDING 0:0 177214 16RBEVMuk+ shas ucb-gener+ 24 PENDING 0:0 177215 16RBEVMuk+ shas ucb-gener+ 24 PENDING 0:0 177216 16dVMSk2k+ shas ucb-gener+ 24 PENDING 0:0 177217 16dVMSk2k+ shas ucb-gener+ 24 PENDING 0:0 177218 TC32DNS2 shas ucb-gener+ 48 PENDING 0:0 177219 TC32LES shas ucb-gener+ 48 PENDING 0:0 177221 sinteract+ shas ucb-testi+ 1 CANCELLED+ 0:0 177222 OPA-netwo+ shas ucb-testi+ 4 PENDING 0:0 177223 32DNSk1 shas ucb-gener+ 48 PENDING 0:0 177224 32DNSk2 shas ucb-gener+ 48 PENDING 0:0 177225 32LESk1 shas ucb-gener+ 48 PENDING 0:0 177226 32aVMSk1k+ shas ucb-gener+ 48 PENDING 0:0 177227 32aVMSk1k+ shas ucb-gener+ 48 PENDING 0:0 177228 32aVMSk2k+ shas ucb-gener+ 48 PENDING 0:0 177229 32aVMSk2k+ shas ucb-gener+ 48 PENDING 0:0 177230 32dVMSk1k+ shas ucb-gener+ 48 PENDING 0:0 177231 32dVMSk1k+ shas ucb-gener+ 48 PENDING 0:0 177232 32dVMSk2k+ shas ucb-gener+ 48 PENDING 0:0 177233 32dVMSk2k+ shas ucb-gener+ 48 PENDING 0:0 177234 32RBEVMsk+ shas ucb-gener+ 48 PENDING 0:0 177235 32RBEVMsk+ shas ucb-gener+ 48 PENDING 0:0 177236 32RBEVMsk+ shas ucb-gener+ 48 PENDING 0:0 177237 32RBEVMsk+ shas ucb-gener+ 48 PENDING 0:0 177238 32RBEVMuk+ shas ucb-gener+ 48 PENDING 0:0 177239 32RBEVMuk+ shas ucb-gener+ 48 PENDING 0:0 177240 32RBEVMuk+ shas ucb-gener+ 48 PENDING 0:0 177241 32RBEVMuk+ shas ucb-gener+ 48 PENDING 0:0 177243 Pyrad smem ucb-gener+ 1 CANCELLED+ 0:0 177244 Pyrad smem ucb-gener+ 1 CANCELLED+ 0:0 177245 Pyrad smem ucb-gener+ 1 CANCELLED+ 0:0 177246 Pyrad smem ucb-gener+ 1 CANCELLED+ 0:0 177247 Pyrad smem ucb-gener+ 1 PENDING 0:0 177248 Gcut smem ucb-gener+ 1 PENDING 0:0 177249 Gcut smem ucb-gener+ 1 PENDING 0:0 177250 Gcut smem ucb-gener+ 1 PENDING 0:0 177251 Gcut smem ucb-gener+ 1 PENDING 0:0 177252 Gcut smem ucb-gener+ 1 PENDING 0:0 177253 Gcut smem ucb-gener+ 1 PENDING 0:0 177254 Gcut smem ucb-gener+ 1 PENDING 0:0 177255 Gcut smem ucb-gener+ 1 PENDING 0:0 177256 Gcut smem ucb-gener+ 1 PENDING 0:0 177257 Gcut smem ucb-gener+ 1 PENDING 0:0 177258 Gcut smem ucb-gener+ 1 PENDING 0:0 177259 Gcut smem ucb-gener+ 1 PENDING 0:0 177260 Gcut smem ucb-gener+ 1 PENDING 0:0 177261 Gcut smem ucb-gener+ 1 PENDING 0:0 177262 Gcut smem ucb-gener+ 1 PENDING 0:0 177263 Gcut smem ucb-gener+ 1 PENDING 0:0 177264 Gcut smem ucb-gener+ 1 PENDING 0:0 177265 Gcut smem ucb-gener+ 1 PENDING 0:0 177266 Gcut smem ucb-gener+ 1 PENDING 0:0 177267 Gcut smem ucb-gener+ 1 PENDING 0:0 177268 Gcut smem ucb-gener+ 1 PENDING 0:0 177269 Gcut smem ucb-gener+ 1 PENDING 0:0 177270 Gcut smem ucb-gener+ 1 PENDING 0:0 177271 Gcut smem ucb-gener+ 1 PENDING 0:0 177272 Gcut smem ucb-gener+ 1 PENDING 0:0 177273 Gcut smem ucb-gener+ 1 PENDING 0:0 177274 Gcut smem ucb-gener+ 1 PENDING 0:0 177275 Gcut smem ucb-gener+ 1 PENDING 0:0 177276 Gcut smem ucb-gener+ 1 PENDING 0:0 177277 Gcut smem ucb-gener+ 1 PENDING 0:0 177278 Gcut smem ucb-gener+ 1 PENDING 0:0 177279 Gcut smem ucb-gener+ 1 PENDING 0:0 177280 Gcut smem ucb-gener+ 1 PENDING 0:0 177281 Gcut smem ucb-gener+ 1 PENDING 0:0 177282 Gcut smem ucb-gener+ 1 PENDING 0:0 177283 Gcut smem ucb-gener+ 1 PENDING 0:0 177284 Gcut smem ucb-gener+ 1 PENDING 0:0 177285 Gcut smem ucb-gener+ 1 PENDING 0:0 177286 Gcut smem ucb-gener+ 1 PENDING 0:0 177287 Gcut smem ucb-gener+ 1 PENDING 0:0 177288 Gcut smem ucb-gener+ 1 PENDING 0:0 177289 Gcut smem ucb-gener+ 1 PENDING 0:0 177290 Gcut smem ucb-gener+ 1 PENDING 0:0 177291 Gcut smem ucb-gener+ 1 PENDING 0:0 177292 Gcut smem ucb-gener+ 1 PENDING 0:0 sacct: debug: slurmdbd: Sent fini msg I have just restarted slurmdbd with DebugLevel set to debug2 and will be uploading the log file in another 15 minutes.
(In reply to Tim Wickberg from comment #17) > Has anything else changed on the configuration recently? We made two changes to slurm.conf today. # Setup prolog and epilog scripts -Epilog=/etc/slurm/epilog -TaskEpilog=/etc/slurm/taskepilog -TaskProlog=/etc/slurm/taskprolog +Epilog=/etc/slurm/epilog.d/* +TaskEpilog=/etc/slurm/taskepilog.d/* +TaskProlog=/etc/slurm/taskprolog.d/* -PartitionName=shas Nodes=shas[01-05][01-60],shas06[05-64],shas08[01-60],shas09[01-28] Default=YES DefMemPerCPU=5250 MaxMemPerCPU=5250 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=shas AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=1.0" +PartitionName=shas Nodes=shas[01-05][01-60],shas06[05-64],shas08[01-60],shas09[01-28] Default=YES DefMemPerCPU=4944 MaxMemPerCPU=4944 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=shas AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=1.0" -PartitionName=sgpu Nodes=sgpu[01-05][01-02],sgpu0801 DefMemPerCPU=5250 MaxMemPerCPU=5250 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sgpu AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=2.5" Shared=Exclusive +PartitionName=sgpu Nodes=sgpu[01-05][01-02],sgpu0801 DefMemPerCPU=4944 MaxMemPerCPU=4944 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sgpu AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=2.5" Shared=Exclusive -PartitionName=sknl Nodes=sknl07[01-20] DefMemPerCPU=5250 MaxMemPerCPU=5250 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sknl AllowQos=long,debug,normal TRESBillingWeights="CPU=1.0" Shared=Exclusive +PartitionName=sknl Nodes=sknl07[01-20] DefMemPerCPU=4944 MaxMemPerCPU=4944 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sknl AllowQos=long,debug,normal TRESBillingWeights="CPU=1.0" Shared=Exclusive > I'm guessing you're > in a maintenance window at the moment as you'd mentioned having no jobs > running? That's correct. John's already posted other responses.
Created attachment 4880 [details] slurmdb1-debug2.log Added the slurmdbd log output with DebugLevel set to debug2
The "We have more time than is possible" error should not be causing any immediate issues for the system, and is the only "error" level message that's showing now. Are you still seeing warnings in the slurmctld.log file after that restart? I don't see any immediate issues in slurmdbd.log aside from a large number of node down updates (bug 3365 may be related to that, although that does not have a fix identified yet).
Created attachment 4881 [details] slurm5 controller daemon log after slurmdb restart Attached is the slurm5 controller log after we had restarted slurmdb with DebugLevel set to debug2. Slurm6 (the backup controller) had produced no additional log information after the slurmdb restart.
I see a spike / slurmdbd communication stall when the nodes are brought back online, and then it looks like things have settled down now. If you use sacct, I imagine you're seeing normal accounting statistics at this point? I suspect this is a duplicate issue to bug 3365 - there's something within slurmctld that generates a flood of node state messages if started up with all nodes in the cluster offline, which then overwhelms slurmdbd and gets things stuck. Bringing nodes online seems to mitigate that behavior, and get things to settle back down to normal - which is where you appear to be now? If so, do you mind downgrading the severity level some, and we can further work through this during business hours tomorrow?
We have been able to start the partitions again and jobs seem to be running without issue now. I have downgraded the severity in light of jobs being able to run now.
Just as an update we saw this same behaviour again during our lastest planned maintenance.
John, I will make note we fixed a memory leak in commit 6c7b9ba161efe. This is in 17.02.7, but should be easily backported to 16.05. Have you seen this happen lately? Also do you have a plan to migrate to 17.02?
Anything else needed on this?
We've continued to experience this issue every time we've had to take the cluster down (which, unrelatedly, has happened a lot recently, but should be behind us now). We want to upgrade to the latest 17; but we've got another cluster on the same slurmdbd that's running 14, so we have to upgrade that one first.
No problem Jonathon, thanks for the heads up. Upgrading that 14.[03|11] system would definitely be a priority. I would expect you to have this happen until upgrading or patching. As noted the latest 17.02.7 has this patch in it which should fix your issue. Would you be ok to close this and reopen if you still have issue after upgrading? I am fine either way.
If it is your assertion that this issue is fixed in a later release, than closing the case is probably fine. As an aside, can you provide a link to some information about the process of upgrading from 14.x to 17.x through 15/16.x? Namely, if our slurmdbd is currently 16.05, can we upgrade the 14.x slurmctld's to 16.05, and then do the 14.x slurmd's in batches? Would we be fools to try to do this while keeping the cluster up? Any chance we could get some guidance on such an upgrade in general? (I'm happy to move that conversation out of this case, though.)
(In reply to Jonathon Anderson from comment #31) > If it is your assertion that this issue is fixed in a later release, than > closing the case is probably fine. > > As an aside, can you provide a link to some information about the process of > upgrading from 14.x to 17.x through 15/16.x? Namely, if our slurmdbd is > currently 16.05, can we upgrade the 14.x slurmctld's to 16.05, and then do > the 14.x slurmd's in batches? Would we be fools to try to do this while > keeping the cluster up? Any chance we could get some guidance on such an > upgrade in general? (I'm happy to move that conversation out of this case, > though.) Yes, the upgrade docs are found here https://slurm.schedmd.com/quickstart_admin.html#upgrade I would upgrade the Note 14.03 would need to upgrade to 15.08 first whereas 14.11 could go to 16.05. Since both 15.08 and 16.05 can talk to a 17.02 slurmdbd that shouldn't really be an issue. Your idea of having the slurmctld running 16.05 and the slurmd's still be 14.11 (assuming 14.11 is the version you have) and update them in batches should work fine. People do hot upgrades all the time. The main issue is order, meaning always make sure the slurmdbd is the highest version, followed by the slurmctld, then the slurmds. Note a 14.11 slurmstepd can still talk to a 16.05 slurmd, so running jobs should persist. I would try this out on a test system to raise your confidence. Once you get the hang of it you should be able to down Slurm (with running jobs) upgrade everything and restart the daemons resulting in new Slurm running everywhere with no job loss. If you would like more help on the matter I agree a new case would be a better idea just so we don't muddy the waters on this bug. Hopefully the webpage will help make it easier. Following those directly should provide a very safe way to update.
As noted, this issue should be fixed in 17.02.7. Please reopen if anything further is needed on this topic.