Ticket 3960

Summary: slurmdbd: agent queue is full
Product: Slurm Reporter: john.blaas
Component: slurmdbdAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: john.blaas, jonathon.anderson
Version: 16.05.10   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6623
Site: University of Colorado Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: Summit CLE Version:
Version Fixed: 17.02.7 17.11.0-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd.log
slurm5-ctld.log
slurm6-ctld.log
slurmdb1-debug2.log
slurm5 controller daemon log after slurmdb restart

Description john.blaas 2017-07-05 18:54:52 MDT
Created attachment 4877 [details]
slurmdbd.log

We had a maintenance today and as we were about to start bringing our computes online we started seeing errors coming out of slurmdbd to this effect: 

Jul 05 15:29:13 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more allocated time than is possible (36178096 > 34560000) for cluster summit(9600) from 2017-02-17T14:00:00 - 2017-02-17T15:00:00 tres 1 Jul 05 15:29:13 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34560000+691200+0)(35251200) > 34560000 for cluster summit(9600) from 2017-02-17T14:00:00 - 2017-02-17T15:00:00 tres 1 Jul 05 15:29:14 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more allocated time than is possible (36905760 > 34560000) for cluster summit(9600) from 2017-02-17T15:00:00 - 2017-02-17T16:00:00 tres 1 Jul 05 15:29:14 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34560000+691200+0)(35251200) > 34560000 for cluster summit(9600) from 2017-02-17T15:00:00 - 2017-02-17T16:00:00 tres 1 Jul 05 15:29:16 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34118928+725352+0)(34844280) > 34560000 for cluster summit(9600) from 2017-02-17T16:00:00 - 2017-02-17T17:00:00 tres 1 Jul 05 15:29:19 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34070472+777600+0)(34848072) > 34560000 for cluster summit(9600) from 2017-02-17T19:00:00 - 2017-02-17T20:00:00 tres 1 Jul 05 15:29:21 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (33897744+777600+0)(34675344) > 34560000 for cluster summit(9600) from 2017-02-17T20:00:00 - 2017-02-17T21:00:00 tres 1 Jul 05 15:29:24 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (34011932+777600+0)(34789532) > 34560000 for cluster summit(9600) from 2017-02-17T23:00:00 - 2017-02-18T00:00:00 tres 1 Jul 05 15:29:27 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (33890880+777600+0)(34668480) > 34560000 for cluster summit(9600) from 2017-02-18T01:00:00 - 2017-02-18T02:00:00 tres 1 Jul 05 15:29:28 slurmdb1.rc.int.colorado.edu slurmdbd[1888]: error: We have more time than is possible (33987552+777600+0)(34765152) > 34560000 for cluster summit(9600) from 2017-02-18T02:00:00 - 2017-02-18T03:00:00 tres 1

We also noticed off the slurm controller daemon nodes repeated messages about the slurmdbd having it's agent queue filled.

We checked for runaway jobs and cleaned them up using "sacctmgr show runaway"

However even after doing a restart of the slurmdb daemon and slurm controller daemons we are still throwing errors and jobs will not start.

The db server is fine on space, and we have already checked for any stuck db processes.

I have attached the log output for the slurmctld's and slurmdbd

Any timely advice would be appreciated.
Comment 1 john.blaas 2017-07-05 18:56:49 MDT
Created attachment 4878 [details]
slurm5-ctld.log
Comment 2 john.blaas 2017-07-05 18:57:36 MDT
Created attachment 4879 [details]
slurm6-ctld.log
Comment 3 Tim Wickberg 2017-07-05 19:26:44 MDT
Regardless of what you do, can you please set aside a copy of the dbd.messages file in the StateSaveLocation on your systems?

If you're able to attach them, we'd like to see what caused this. I believe this is likely a recurrence of the issues seen in bug 3891 and bug 3939, and led to the 17.02.6 release today.

You can either:

a) Wait for us to review the file and confirm the issue, and see if we can work around the bad message causing this problem.

or 

b) Stop slurmctld on the systems, move dbd.messages out of the way, restart slurmctld, and check that accounting data is moving correctly again. The only downside is that you'll loose the records cached there; the upside is that new job records will start flowing. In my experience, it's more likely that continuing to loose data will be more harmful than getting a chance to restore whatever else is cached in dbd.messages.
Comment 5 Jonathon Anderson 2017-07-05 19:31:59 MDT
Worth noting that these log messages are appearing on a completely idle system (with jobs queued, but not running).
Comment 6 Jonathon Anderson 2017-07-05 19:33:27 MDT
Any chance, if we move dbd.messages out of the way, that we could process it manually later to recover what's cached there?
Comment 7 Tim Wickberg 2017-07-05 19:34:39 MDT
Did this happen during a maintenance window?

Is it possible a firewall was changed that is now blocking communication between the slurmctld host and slurmdbd?

The output from 'sacctmgr list cluster' would help us understand what's going on, as well as additional slurmctld logs with SlurmctldDebug=debug?
Comment 8 Tim Wickberg 2017-07-05 19:35:54 MDT
(In reply to Jonathon Anderson from comment #6)
> Any chance, if we move dbd.messages out of the way, that we could process it
> manually later to recover what's cached there?

No, that's generally not possible.
Comment 10 Jonathon Anderson 2017-07-05 19:37:25 MDT
Both slurm5 (slurmctld) and slurmdb1 (slurmdbd) have empty iptables with default policies ACCEPT.

~jonathon


On 7/5/17, 7:34 PM, "bugs@schedmd.com" <bugs@schedmd.com> wrote:

    Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=3960#c7> on
    
    bug 3960 <https://bugs.schedmd.com/show_bug.cgi?id=3960> from Tim Wickberg <mailto:tim@schedmd.com>Did this happen during a maintenance window?
    
    Is it possible a firewall was changed that is now blocking communication
    between the slurmctld host and slurmdbd?
    
    The output from 'sacctmgr list cluster' would help us understand what's going
    on, as well as additional slurmctld logs with SlurmctldDebug=debug?
    ________________________________________
    You are receiving this mail because:
    * You are on the CC list for the bug.
Comment 11 Tim Wickberg 2017-07-05 19:39:27 MDT
Actually, the bug I've referenced here does not apply on further review - the issues described won't apply. Please disregard comment 1, and leave dbd.messages in place.

If you could, I'd like to see additional logs from a clean restart of slurmctld, including a newly submitted job. There should be some indication of the communication issues from that.

Can you also verify that you can run "sacctmgr show clusters" from the slurmctld hosts, and attach that output?
Comment 13 Jonathon Anderson 2017-07-05 19:40:18 MDT
[root@slurm5 ~]# sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
    blanca   10.225.160.46         7820  7168         1                                                                                           blanca           
    summit   10.225.160.83         6820  7680         1                                                                                           normal
Comment 14 john.blaas 2017-07-05 19:41:44 MDT
# sacctmgr list cluster
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
    blanca   10.225.160.46         7820  7168         1                                                                                           blanca           
    summit   10.225.160.83         6820  7680         1                                                                                           normal
Comment 17 Tim Wickberg 2017-07-05 19:47:31 MDT
That's a good sign, but leaves several open questions.

Has anything else changed on the configuration recently? I'm guessing you're in a maintenance window at the moment as you'd mentioned having no jobs running?

Can you increase DebugLevel in slurmdbd.conf, restart it, and attach the new logs?

Output from 'sacct -vvv' may also shed some light.
Comment 18 john.blaas 2017-07-05 19:55:47 MDT
# sacct -vvv
sacct: Jobs eligible from Wed Jul 05 00:00:00 2017 - Now
sacct: debug:  Options selected:
	opt_completion=0
	opt_dup=0
	opt_field_list=(null)
	opt_help=0
	opt_allocs=0
sacct: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
sacct: debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
sacct: debug:  slurmdbd: Sent DbdInit msg
sacct: debug2: Clusters requested:	summit
sacct: debug2: Userids requested:	all
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
141909       NEB_Cu_Al+       shas ucb-gener+        480    TIMEOUT      1:0 
141909.batch      batch            ucb-gener+         24  CANCELLED     0:15 
141909.exte+     extern            ucb-gener+        480  COMPLETED      0:0 
141912       NEB_Co_Al+       shas ucb-gener+        480    PENDING      0:0 
141916       NEB_Fe_Al+       shas ucb-gener+        480    PENDING      0:0 
141924            Cr2O3       shas ucb-gener+         48    PENDING      0:0 
142001       f10l20t20+       shas ucb-gener+         11    PENDING      0:0 
142314         W_H1.log       shas ucb-gener+        192    PENDING      0:0 
142315         W_H2.log       shas ucb-gener+        192    PENDING      0:0 
143342       N-1000-ipo       shas ucb-gener+         18    PENDING      0:0 
143343       N-1000-ipo       shas ucb-gener+         18    PENDING      0:0 
143400           test35       shas ucb-gener+          1    PENDING      0:0 
143447         seed_sim       shas ucb-gener+         12    PENDING      0:0 
144974         exp_gen7       shas ucb-gener+          1    PENDING      0:0 
145232         mpi_equi       shas ucb-gener+         96    PENDING      0:0 
145233           test37       shas ucb-gener+         96    PENDING      0:0 
145234           test38       shas ucb-gener+         96    PENDING      0:0 
145255       Tio2_ads_+       shas ucb-gener+         72    PENDING      0:0 
145256       Tio2_deh2+       shas ucb-gener+         72    PENDING      0:0 
145372       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145373       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145477       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145478       4-thio_28+       shas ucb-gener+        120    PENDING      0:0 
145484       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145503       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145504       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145510       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145514       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145519       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145521       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145539       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145549       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145551       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145552       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
145669       raxml_boo+       smem ucb-gener+         96    PENDING      0:0 
145687       gs-riser-+       shas ucb-gener+         24    PENDING      0:0 
146366       15_15_-50+       shas ucb-gener+        384    PENDING      0:0 
146373       19_0_-50_+       shas ucb-gener+        384    PENDING      0:0 
146374       7_7_-50_1+       shas ucb-gener+        384    PENDING      0:0 
146378       8_4_-50_1+       shas ucb-gener+        384    PENDING      0:0 
147322         init2DNS       shas ucb-gener+         96  COMPLETED      0:0 
147322.batch      batch            ucb-gener+         23  COMPLETED      0:0 
147322.exte+     extern            ucb-gener+         96  COMPLETED      0:0 
147322.0            uNS            ucb-gener+         96  COMPLETED      0:0 
147323         init2LES       shas ucb-gener+         96    PENDING      0:0 
147693            raxml       shas ucb-gener+         35    PENDING      0:0 
147785       002_x008_+       shas ucb-gener+        384    PENDING      0:0 
147843         Audef_OO       shas ucb-gener+         96    PENDING      0:0 
147844       Audef_OTia       shas ucb-gener+         96    PENDING      0:0 
147845       Audef_OTic       shas ucb-gener+         96    PENDING      0:0 
147846       Audef_OTi+       shas ucb-gener+         96    PENDING      0:0 
147847       Audef_OTi+       shas ucb-gener+         96    PENDING      0:0 
147848       Cupv_OTiOb       shas ucb-gener+         96    PENDING      0:0 
147850       Cupv_OTiOa       shas ucb-gener+         96    PENDING      0:0 
151878            raxml       shas ucb-gener+         24    PENDING      0:0 
157962       PLCpep7_p+       shas ucb-gener+          1    PENDING      0:0 
160675       cr2_12e_dz       shas ucb-summi+        160    PENDING      0:0 
160710         TC32DNS2       shas ucb-gener+         48    TIMEOUT      1:0 
160710.batch      batch            ucb-gener+          4  CANCELLED     0:15 
160710.exte+     extern            ucb-gener+         48  COMPLETED      0:0 
160710.0            uNS            ucb-gener+         48  CANCELLED     0:15 
160711          TC32DNS       shas ucb-gener+         48  COMPLETED      0:0 
160711.batch      batch            ucb-gener+         16  COMPLETED      0:0 
160711.exte+     extern            ucb-gener+         48  COMPLETED      0:0 
160711.0            uNS            ucb-gener+         48  COMPLETED      0:0 
160712          TC32LES       shas ucb-gener+         48  COMPLETED      0:0 
160712.batch      batch            ucb-gener+         18  COMPLETED      0:0 
160712.exte+     extern            ucb-gener+         48  COMPLETED      0:0 
160712.0            uNS            ucb-gener+         48  COMPLETED      0:0 
160713         TC32LES2       shas ucb-gener+         48    PENDING      0:0 
166639       raxml_boo+       smem ucb-gener+         96    PENDING      0:0 
168351            S2Cpp       shas ucb-gener+         24    PENDING      0:0 
176001           no_stk       shas ucb-gener+        512 CANCELLED+      0:0 
176006            La030       shas ucb-gener+        512 CANCELLED+      0:0 
176007           no_stk       shas ucb-gener+        256 CANCELLED+      0:0 
176008            La020       shas ucb-gener+        256    PENDING      0:0 
176009            La030       shas ucb-gener+        256 CANCELLED+      0:0 
176013            La040       shas ucb-gener+        256 CANCELLED+      0:0 
176065        ipyrad-s3       shas ucb-gener+         48    PENDING      0:0 
176070       SH2B1-Exp+       shas ucb-gener+          1    PENDING      0:0 
176071       SH2B1-Exp+       shas ucb-gener+          1    PENDING      0:0 
176072       SH2B1-Exp+       shas ucb-gener+          1    PENDING      0:0 
176074         angsdCer       smem ucb-summi+          1    PENDING      0:0 
176106          BiTeSe1       shas ucb-gener+        288    PENDING      0:0 
176107          BiTeSe2       shas ucb-gener+        288    PENDING      0:0 
176108          BiTeSe3       shas ucb-gener+        288    PENDING      0:0 
176110          BiTeSe4       shas ucb-gener+        288    PENDING      0:0 
176111          BiTeSe5       shas ucb-gener+        288    PENDING      0:0 
176112       tab_blast+       smem ucb-gener+          1    PENDING      0:0 
176113       text_blas+       smem ucb-gener+          1    PENDING      0:0 
176155       tab_pref_+       smem ucb-gener+          1    PENDING      0:0 
176158       text_pref+       smem ucb-gener+          1    PENDING      0:0 
176318       n96_03_lo+       shas ucb-summi+         12    PENDING      0:0 
176319           n64_04       shas ucb-summi+         12    PENDING      0:0 
176321       WV_MOAT_c+       shas ucb-gener+         24    PENDING      0:0 
176322       WV_MOAT_c+       shas ucb-gener+         24    PENDING      0:0 
176323       WV_MOAT_c+       shas ucb-gener+         24    PENDING      0:0 
176343         Ek5Ra2e7       shas ucb-gener+        144  COMPLETED      0:0 
176343.batch      batch            ucb-gener+         24  COMPLETED      0:0 
176343.exte+     extern            ucb-gener+        144  COMPLETED      0:0 
176344         Ek5Ra5e7       shas ucb-gener+        144  COMPLETED      0:0 
176344.batch      batch            ucb-gener+         24  COMPLETED      0:0 
176344.exte+     extern            ucb-gener+        144  COMPLETED      0:0 
176345         Ek5Ra1e8       shas ucb-gener+        144    PENDING      0:0 
176766       f12l32t32+       shas ucb16_sum+        528  COMPLETED      0:0 
176766.batch      batch            ucb16_sum+         24  COMPLETED      0:0 
176766.exte+     extern            ucb16_sum+        528  COMPLETED      0:0 
176768       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
176771       gs-riser-+       shas ucb-gener+         24    PENDING      0:0 
176780           E5R8XD       shas ucb-gener+        384    TIMEOUT      1:0 
176780.batch      batch            ucb-gener+         24  CANCELLED     0:15 
176780.exte+     extern            ucb-gener+        384  COMPLETED      0:0 
176780.0      pmi_proxy            ucb-gener+         16  COMPLETED      0:0 
176781          E5R10XD       shas ucb-gener+        384    TIMEOUT      1:0 
176781.batch      batch            ucb-gener+         24  CANCELLED     0:15 
176781.exte+     extern            ucb-gener+        384  COMPLETED      0:0 
176781.0      pmi_proxy            ucb-gener+         16  COMPLETED      0:0 
176806            RAxML       shas ucb-gener+         20    PENDING      0:0 
176856       f12l32t32+       shas ucb16_sum+         22    PENDING      0:0 
176925       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
176926       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
176927       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
176930            Pyrad       smem ucb-gener+          1    PENDING      0:0 
176997               nR       shas ucb-gener+         24    PENDING      0:0 
176999       gs-bo_vs_+       shas ucb-gener+         24    PENDING      0:0 
177000       ionizatio+       shas csu-testi+       1536    PENDING      0:0 
177002          mpivasp       shas csu-gener+         96    PENDING      0:0 
177007       SPW_24.txt       shas ucb-gener+        120    PENDING      0:0 
177011       Cleaved_1+       shas ucb-gener+         24    PENDING      0:0 
177012       Water_002+       shas ucb-gener+         24    PENDING      0:0 
177013        N-1000-O3       shas ucb-gener+         18    PENDING      0:0 
177014        N-1000-O3       shas ucb-gener+         18    PENDING      0:0 
177015        N-1000-O3       shas ucb-gener+         18    PENDING      0:0 
177016       N-1000-su+       shas ucb-gener+         18    PENDING      0:0 
177017       N-1000-su+       shas ucb-gener+         18    PENDING      0:0 
177018       N-1000-su+       shas ucb-gener+         18    PENDING      0:0 
177019          simcore       shas ucb-summi+          4    PENDING      0:0 
177020          simcore       shas ucb-summi+          4    PENDING      0:0 
177021          simcore       shas ucb-summi+          4    PENDING      0:0 
177022          simcore       shas ucb-summi+          4    PENDING      0:0 
177023          simcore       shas ucb-summi+          4    PENDING      0:0 
177024          simcore       shas ucb-summi+          4    PENDING      0:0 
177025          simcore       shas ucb-summi+          4    PENDING      0:0 
177026          simcore       shas ucb-summi+          4    PENDING      0:0 
177027          simcore       shas ucb-summi+          4    PENDING      0:0 
177028          simcore       shas ucb-summi+          4    PENDING      0:0 
177029          simcore       shas ucb-summi+          4    PENDING      0:0 
177030          simcore       shas ucb-summi+          4    PENDING      0:0 
177031          simcore       shas ucb-summi+          4    PENDING      0:0 
177032          simcore       shas ucb-summi+          4    PENDING      0:0 
177033          simcore       shas ucb-summi+          4    PENDING      0:0 
177086       16dVMSk2k+       shas ucb-gener+         24  COMPLETED      0:0 
177086.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177086.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177086.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177087       16dVMSk2k+       shas ucb-gener+         24    TIMEOUT      1:0 
177087.batch      batch            ucb-gener+         24  CANCELLED     0:15 
177087.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177087.0            uNS            ucb-gener+         24  CANCELLED     0:15 
177088_[1]   summary_s+       shas csu-gener+          1    PENDING      0:0 
177091       16RBEVMuk+       shas ucb-gener+         24  COMPLETED      0:0 
177091.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177091.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177091.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177092_[1]   summary_s+       shas csu-gener+          1    PENDING      0:0 
177093       16RBEVMuk+       shas ucb-gener+         24    TIMEOUT      1:0 
177093.batch      batch            ucb-gener+         24  CANCELLED     0:15 
177093.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177093.0            uNS            ucb-gener+         24  CANCELLED     0:15 
177100       gs-riser-+       shas ucb-gener+          9    PENDING      0:0 
177101_[1-1+ sim_M2_m_+       shas csu-gener+          1    PENDING      0:0 
177102_[1-1+ sim_M2_m_+       shas csu-gener+          1    PENDING      0:0 
177140       JO4_cT4_k+       shas ucb-gener+         24  COMPLETED      0:0 
177140.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177140.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177140.0      pmi_proxy            ucb-gener+          1  COMPLETED      0:0 
177144       f10l24t24+       shas ucb16_sum+         36    PENDING      0:0 
177148           16SIG2       shas ucb-gener+         24    PENDING      0:0 
177149           16VRE2       shas ucb-gener+         24    PENDING      0:0 
177151           32VRE2       shas ucb-gener+         72    PENDING      0:0 
177152           32WAL2       shas ucb-gener+         72    PENDING      0:0 
177158       FeAl2O4_R+       shas ucb-gener+         96  COMPLETED      0:0 
177158.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177158.exte+     extern            ucb-gener+         96  COMPLETED      0:0 
177164       f12l32t32+       shas ucb16_sum+         22    PENDING      0:0 
177166            raxml       shas ucb-gener+         24    PENDING      0:0 
177167       raxml_boo+       smem ucb-gener+         96    PENDING      0:0 
177169       raxml_boo+       smem ucb-gener+         96    PENDING      0:0 
177170            raxml       shas ucb-gener+         24    PENDING      0:0 
177172       1024.super       shas ucb-gener+         48    PENDING      0:0 
177174       f12l32t32+       shas ucb-gener+        528    TIMEOUT      1:0 
177174.batch      batch            ucb-gener+         24  CANCELLED     0:15 
177174.exte+     extern            ucb-gener+        528  COMPLETED      0:0 
177175       f12l32t32+       shas ucb-gener+        528  COMPLETED      0:0 
177175.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177175.exte+     extern            ucb-gener+        528  COMPLETED      0:0 
177176       1024.super       shas ucb-gener+         48    PENDING      0:0 
177177       1024.super       shas ucb-gener+         48    PENDING      0:0 
177178       16dVMSk1k+       shas ucb-gener+         24  COMPLETED      0:0 
177178.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177178.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177178.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177180       16RBEVMuk+       shas ucb-gener+         24  COMPLETED      0:0 
177180.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177180.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177180.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177181       16RBEVMuk+       shas ucb-gener+         24  COMPLETED      0:0 
177181.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177181.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177181.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177183       16aVMSk2k+       shas ucb-gener+         24  COMPLETED      0:0 
177183.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177183.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177183.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177188       16aVMSk2k+       shas ucb-gener+         24  COMPLETED      0:0 
177188.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177188.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177188.0            uNS            ucb-gener+         24  COMPLETED      0:0 
177190       f10l16t16+       shas ucb-gener+        264  COMPLETED      0:0 
177190.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177190.exte+     extern            ucb-gener+        264  COMPLETED      0:0 
177191       f10l24t24+       shas ucb-gener+        864    TIMEOUT      1:0 
177191.batch      batch            ucb-gener+         24  CANCELLED     0:15 
177191.exte+     extern            ucb-gener+        864  COMPLETED      0:0 
177192          S20S13G       shas ucb-gener+        216  COMPLETED      0:0 
177192.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177192.exte+     extern            ucb-gener+        216  COMPLETED      0:0 
177192.0      pmi_proxy            ucb-gener+          9  COMPLETED      0:0 
177193       f12l32t32+       shas ucb-gener+        528  COMPLETED      0:0 
177193.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177193.exte+     extern            ucb-gener+        528  COMPLETED      0:0 
177194       f12l32t32+       shas ucb16_sum+         22    PENDING      0:0 
177195       f10l16t16+       shas ucb-gener+        264  COMPLETED      0:0 
177195.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177195.exte+     extern            ucb-gener+        264  COMPLETED      0:0 
177196          S20S13H       shas ucb-gener+        216  COMPLETED      0:0 
177196.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177196.exte+     extern            ucb-gener+        216  COMPLETED      0:0 
177196.0      pmi_proxy            ucb-gener+          9  COMPLETED      0:0 
177197       f12l32t32+       shas ucb-gener+        528  COMPLETED      0:0 
177197.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177197.exte+     extern            ucb-gener+        528  COMPLETED      0:0 
177198       f10l16t16+       shas ucb-gener+        264  COMPLETED      0:0 
177198.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177198.exte+     extern            ucb-gener+        264  COMPLETED      0:0 
177199          S20S13I       shas ucb-gener+        216  COMPLETED      0:0 
177199.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177199.exte+     extern            ucb-gener+        216  COMPLETED      0:0 
177199.0      pmi_proxy            ucb-gener+          9  COMPLETED      0:0 
177200       f12l32t32+       shas ucb-gener+        528  COMPLETED      0:0 
177200.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177200.exte+     extern            ucb-gener+        528  COMPLETED      0:0 
177201       f10l16t16+       shas ucb-gener+        264  COMPLETED      0:0 
177201.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177201.exte+     extern            ucb-gener+        264  COMPLETED      0:0 
177202          S20S13J       shas ucb-gener+        216    PENDING      0:0 
177203             HDDP       shas ucb-gener+         24  COMPLETED      0:0 
177203.batch      batch            ucb-gener+         24  COMPLETED      0:0 
177203.exte+     extern            ucb-gener+         24  COMPLETED      0:0 
177204       f12l32t32+       shas ucb-gener+         22    PENDING      0:0 
177205       f10l16t16+       shas ucb-gener+         11 CANCELLED+      0:0 
177207       16RBEVMuk+       shas ucb-gener+         24    PENDING      0:0 
177209       _interact+       shas ucb-gener+          1  COMPLETED      0:0 
177210       sbatch-ex+       shas ucb-gener+          1    PENDING      0:0 
177211       _interact+       shas ucb-gener+          1 CANCELLED+      0:0 
177212       sinteract+       shas ucb-testi+          1 CANCELLED+      0:0 
177213       16RBEVMuk+       shas ucb-gener+         24    PENDING      0:0 
177214       16RBEVMuk+       shas ucb-gener+         24    PENDING      0:0 
177215       16RBEVMuk+       shas ucb-gener+         24    PENDING      0:0 
177216       16dVMSk2k+       shas ucb-gener+         24    PENDING      0:0 
177217       16dVMSk2k+       shas ucb-gener+         24    PENDING      0:0 
177218         TC32DNS2       shas ucb-gener+         48    PENDING      0:0 
177219          TC32LES       shas ucb-gener+         48    PENDING      0:0 
177221       sinteract+       shas ucb-testi+          1 CANCELLED+      0:0 
177222       OPA-netwo+       shas ucb-testi+          4    PENDING      0:0 
177223          32DNSk1       shas ucb-gener+         48    PENDING      0:0 
177224          32DNSk2       shas ucb-gener+         48    PENDING      0:0 
177225          32LESk1       shas ucb-gener+         48    PENDING      0:0 
177226       32aVMSk1k+       shas ucb-gener+         48    PENDING      0:0 
177227       32aVMSk1k+       shas ucb-gener+         48    PENDING      0:0 
177228       32aVMSk2k+       shas ucb-gener+         48    PENDING      0:0 
177229       32aVMSk2k+       shas ucb-gener+         48    PENDING      0:0 
177230       32dVMSk1k+       shas ucb-gener+         48    PENDING      0:0 
177231       32dVMSk1k+       shas ucb-gener+         48    PENDING      0:0 
177232       32dVMSk2k+       shas ucb-gener+         48    PENDING      0:0 
177233       32dVMSk2k+       shas ucb-gener+         48    PENDING      0:0 
177234       32RBEVMsk+       shas ucb-gener+         48    PENDING      0:0 
177235       32RBEVMsk+       shas ucb-gener+         48    PENDING      0:0 
177236       32RBEVMsk+       shas ucb-gener+         48    PENDING      0:0 
177237       32RBEVMsk+       shas ucb-gener+         48    PENDING      0:0 
177238       32RBEVMuk+       shas ucb-gener+         48    PENDING      0:0 
177239       32RBEVMuk+       shas ucb-gener+         48    PENDING      0:0 
177240       32RBEVMuk+       shas ucb-gener+         48    PENDING      0:0 
177241       32RBEVMuk+       shas ucb-gener+         48    PENDING      0:0 
177243            Pyrad       smem ucb-gener+          1 CANCELLED+      0:0 
177244            Pyrad       smem ucb-gener+          1 CANCELLED+      0:0 
177245            Pyrad       smem ucb-gener+          1 CANCELLED+      0:0 
177246            Pyrad       smem ucb-gener+          1 CANCELLED+      0:0 
177247            Pyrad       smem ucb-gener+          1    PENDING      0:0 
177248             Gcut       smem ucb-gener+          1    PENDING      0:0 
177249             Gcut       smem ucb-gener+          1    PENDING      0:0 
177250             Gcut       smem ucb-gener+          1    PENDING      0:0 
177251             Gcut       smem ucb-gener+          1    PENDING      0:0 
177252             Gcut       smem ucb-gener+          1    PENDING      0:0 
177253             Gcut       smem ucb-gener+          1    PENDING      0:0 
177254             Gcut       smem ucb-gener+          1    PENDING      0:0 
177255             Gcut       smem ucb-gener+          1    PENDING      0:0 
177256             Gcut       smem ucb-gener+          1    PENDING      0:0 
177257             Gcut       smem ucb-gener+          1    PENDING      0:0 
177258             Gcut       smem ucb-gener+          1    PENDING      0:0 
177259             Gcut       smem ucb-gener+          1    PENDING      0:0 
177260             Gcut       smem ucb-gener+          1    PENDING      0:0 
177261             Gcut       smem ucb-gener+          1    PENDING      0:0 
177262             Gcut       smem ucb-gener+          1    PENDING      0:0 
177263             Gcut       smem ucb-gener+          1    PENDING      0:0 
177264             Gcut       smem ucb-gener+          1    PENDING      0:0 
177265             Gcut       smem ucb-gener+          1    PENDING      0:0 
177266             Gcut       smem ucb-gener+          1    PENDING      0:0 
177267             Gcut       smem ucb-gener+          1    PENDING      0:0 
177268             Gcut       smem ucb-gener+          1    PENDING      0:0 
177269             Gcut       smem ucb-gener+          1    PENDING      0:0 
177270             Gcut       smem ucb-gener+          1    PENDING      0:0 
177271             Gcut       smem ucb-gener+          1    PENDING      0:0 
177272             Gcut       smem ucb-gener+          1    PENDING      0:0 
177273             Gcut       smem ucb-gener+          1    PENDING      0:0 
177274             Gcut       smem ucb-gener+          1    PENDING      0:0 
177275             Gcut       smem ucb-gener+          1    PENDING      0:0 
177276             Gcut       smem ucb-gener+          1    PENDING      0:0 
177277             Gcut       smem ucb-gener+          1    PENDING      0:0 
177278             Gcut       smem ucb-gener+          1    PENDING      0:0 
177279             Gcut       smem ucb-gener+          1    PENDING      0:0 
177280             Gcut       smem ucb-gener+          1    PENDING      0:0 
177281             Gcut       smem ucb-gener+          1    PENDING      0:0 
177282             Gcut       smem ucb-gener+          1    PENDING      0:0 
177283             Gcut       smem ucb-gener+          1    PENDING      0:0 
177284             Gcut       smem ucb-gener+          1    PENDING      0:0 
177285             Gcut       smem ucb-gener+          1    PENDING      0:0 
177286             Gcut       smem ucb-gener+          1    PENDING      0:0 
177287             Gcut       smem ucb-gener+          1    PENDING      0:0 
177288             Gcut       smem ucb-gener+          1    PENDING      0:0 
177289             Gcut       smem ucb-gener+          1    PENDING      0:0 
177290             Gcut       smem ucb-gener+          1    PENDING      0:0 
177291             Gcut       smem ucb-gener+          1    PENDING      0:0 
177292             Gcut       smem ucb-gener+          1    PENDING      0:0 
sacct: debug:  slurmdbd: Sent fini msg

I have just restarted slurmdbd with DebugLevel set to debug2 and will be uploading the log file in another 15 minutes.
Comment 19 Jonathon Anderson 2017-07-05 19:57:18 MDT
(In reply to Tim Wickberg from comment #17)
> Has anything else changed on the configuration recently?

We made two changes to slurm.conf today.


 # Setup prolog and epilog scripts
-Epilog=/etc/slurm/epilog
-TaskEpilog=/etc/slurm/taskepilog
-TaskProlog=/etc/slurm/taskprolog
+Epilog=/etc/slurm/epilog.d/*
+TaskEpilog=/etc/slurm/taskepilog.d/*
+TaskProlog=/etc/slurm/taskprolog.d/*


-PartitionName=shas Nodes=shas[01-05][01-60],shas06[05-64],shas08[01-60],shas09[01-28] Default=YES DefMemPerCPU=5250 MaxMemPerCPU=5250 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=shas AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=1.0"
+PartitionName=shas Nodes=shas[01-05][01-60],shas06[05-64],shas08[01-60],shas09[01-28] Default=YES DefMemPerCPU=4944 MaxMemPerCPU=4944 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=shas AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=1.0"
 
-PartitionName=sgpu Nodes=sgpu[01-05][01-02],sgpu0801 DefMemPerCPU=5250 MaxMemPerCPU=5250 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sgpu AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=2.5" Shared=Exclusive
+PartitionName=sgpu Nodes=sgpu[01-05][01-02],sgpu0801 DefMemPerCPU=4944 MaxMemPerCPU=4944 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sgpu AllowQos=long,debug,condo,normal TRESBillingWeights="CPU=2.5" Shared=Exclusive
 
-PartitionName=sknl Nodes=sknl07[01-20] DefMemPerCPU=5250 MaxMemPerCPU=5250 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sknl AllowQos=long,debug,normal TRESBillingWeights="CPU=1.0" Shared=Exclusive 
+PartitionName=sknl Nodes=sknl07[01-20] DefMemPerCPU=4944 MaxMemPerCPU=4944 DefaultTime=4:00:00 maxtime=1-00:00:00 QOS=sknl AllowQos=long,debug,normal TRESBillingWeights="CPU=1.0" Shared=Exclusive



> I'm guessing you're
> in a maintenance window at the moment as you'd mentioned having no jobs
> running?

That's correct.


John's already posted other responses.
Comment 20 john.blaas 2017-07-05 20:09:47 MDT
Created attachment 4880 [details]
slurmdb1-debug2.log

Added the slurmdbd log output with DebugLevel set to debug2
Comment 22 Tim Wickberg 2017-07-05 20:31:38 MDT
The "We have more time than is possible" error should not be causing any immediate issues for the system, and is the only "error" level message that's showing now.

Are you still seeing warnings in the slurmctld.log file after that restart? I don't see any immediate issues in slurmdbd.log aside from a large number of node down updates (bug 3365 may be related to that, although that does not have a fix identified yet).
Comment 23 john.blaas 2017-07-05 20:41:25 MDT
Created attachment 4881 [details]
slurm5 controller daemon log after slurmdb restart

Attached is the slurm5 controller log after we had restarted slurmdb with DebugLevel set to debug2.

Slurm6 (the backup controller) had produced no additional log information after the slurmdb restart.
Comment 24 Tim Wickberg 2017-07-05 21:06:39 MDT
I see a spike / slurmdbd communication stall when the nodes are brought back online, and then it looks like things have settled down now.

If you use sacct, I imagine you're seeing normal accounting statistics at this point?

I suspect this is a duplicate issue to bug 3365 - there's something within slurmctld that generates a flood of node state messages if started up with all nodes in the cluster offline, which then overwhelms slurmdbd and gets things stuck.

Bringing nodes online seems to mitigate that behavior, and get things to settle back down to normal - which is where you appear to be now?

If so, do you mind downgrading the severity level some, and we can further work through this during business hours tomorrow?
Comment 25 john.blaas 2017-07-05 22:12:13 MDT
We have been able to start the partitions again and jobs seem to be running without issue now.

I have downgraded the severity in light of jobs being able to run now.
Comment 26 john.blaas 2017-07-25 12:06:33 MDT
Just as an update we saw this same behaviour again during our lastest planned maintenance.
Comment 27 Danny Auble 2017-08-16 16:49:31 MDT
John, I will make note we fixed a memory leak in commit 6c7b9ba161efe.  This is in 17.02.7, but should be easily backported to 16.05.

Have you seen this happen lately?  Also do you have a plan to migrate to 17.02?
Comment 28 Danny Auble 2017-08-22 15:35:23 MDT
Anything else needed on this?
Comment 29 Jonathon Anderson 2017-08-23 10:14:56 MDT
We've continued to experience this issue every time we've had to take the cluster down (which, unrelatedly, has happened a lot recently, but should be behind us now). We want to upgrade to the latest 17; but we've got another cluster on the same slurmdbd that's running 14, so we have to upgrade that one first.
Comment 30 Danny Auble 2017-08-23 12:16:11 MDT
No problem Jonathon, thanks for the heads up.  Upgrading that 14.[03|11] system would definitely be a priority. I would expect you to have this happen until upgrading or patching.  As noted the latest 17.02.7 has this patch in it which should fix your issue.

Would you be ok to close this and reopen if you still have issue after upgrading?  I am fine either way.
Comment 31 Jonathon Anderson 2017-08-25 10:51:25 MDT
If it is your assertion that this issue is fixed in a later release, than closing the case is probably fine.

As an aside, can you provide a link to some information about the process of upgrading from 14.x to 17.x through 15/16.x? Namely, if our slurmdbd is currently 16.05, can we upgrade the 14.x slurmctld's to 16.05, and then do the 14.x slurmd's in batches? Would we be fools to try to do this while keeping the cluster up? Any chance we could get some guidance on such an upgrade in general? (I'm happy to move that conversation out of this case, though.)
Comment 32 Danny Auble 2017-08-25 11:11:28 MDT
(In reply to Jonathon Anderson from comment #31)
> If it is your assertion that this issue is fixed in a later release, than
> closing the case is probably fine.
> 
> As an aside, can you provide a link to some information about the process of
> upgrading from 14.x to 17.x through 15/16.x? Namely, if our slurmdbd is
> currently 16.05, can we upgrade the 14.x slurmctld's to 16.05, and then do
> the 14.x slurmd's in batches? Would we be fools to try to do this while
> keeping the cluster up? Any chance we could get some guidance on such an
> upgrade in general? (I'm happy to move that conversation out of this case,
> though.)

Yes, the upgrade docs are found here https://slurm.schedmd.com/quickstart_admin.html#upgrade

I would upgrade the Note 14.03 would need to upgrade to 15.08 first whereas 14.11 could go to 16.05.  Since both 15.08 and 16.05 can talk to a 17.02 slurmdbd that shouldn't really be an issue.

Your idea of having the slurmctld running 16.05 and the slurmd's still be 14.11 (assuming 14.11 is the version you have) and update them in batches should work fine.

People do hot upgrades all the time.  The main issue is order, meaning always make sure the slurmdbd is the highest version, followed by the slurmctld, then the slurmds.  Note a 14.11 slurmstepd can still talk to a 16.05 slurmd, so running jobs should persist.

I would try this out on a test system to raise your confidence.  Once you get the hang of it you should be able to down Slurm (with running jobs) upgrade everything and restart the daemons resulting in new Slurm running everywhere with no job loss.

If you would like more help on the matter I agree a new case would be a better idea just so we don't muddy the waters on this bug.  Hopefully the webpage will help make it easier.  Following those directly should provide a very safe way to update.
Comment 33 Danny Auble 2017-08-25 11:13:32 MDT
As noted, this issue should be fixed in 17.02.7.  Please reopen if anything further is needed on this topic.