Ticket 8978

Summary:	Scheduler Drowning in RPC Storm
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	Scheduling	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart, fordste5, jblomqvist, nate, zweidner
Version:	19.05.5
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	20.02.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Current slurm.conf slurmctld log full day slurmctld log Log for slurmd on holy7c08102 slurmctld log for 2200 Slurmctld log for 2300-2320 slurmctld log for 2300-0012 slurmctld log for 05/02 0000-0830 lsof fd slurm_epilog slurmctld log for 05/02 for 0000-1530 scontrol show job 54739752

Description Paul Edmon 2020-05-01 18:33:35 MDT

Around 12:30pm this afternoon our scheduler has been in the throes of a RPC storm of unknown origin.  We keep seeing the following:

May  1 20:29:20 holy-slurm02 slurmctld[47325]: cleanup_completing: JobId=54733331_278(54735341) completion process took 1268 seconds
May  1 20:29:20 holy-slurm02 slurmctld[47325]: error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=54720338, this should never happen
May  1 20:29:20 holy-slurm02 slurmctld[47325]: Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=129516505 began=20:27:10.542
May  1 20:29:20 holy-slurm02 slurmctld[47325]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=129555406 began=20:27:10.503
May  1 20:29:20 holy-slurm02 slurmctld[47325]: _job_complete: JobId=54733336_62(54735925) WEXITSTATUS 0
May  1 20:29:20 holy-slurm02 slurmctld[47325]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=41078436 began=20:28:38.981

Completions are taking inordinately long and the scheduler has functionally ground to a halt.  Looking at sdiag it appears that completions and prolog completions are making up most of the time and that root has most of the RPC calls.

I've set all the partitions to INACTIVE and put the scheduler in defer mode to try to flush the traffic.  This worked once before but when the scheduler was reopened it got drowned in traffic again.  I suspect it may be a user jobs that are either exiting creating a completion storm, or a job that is pinging the scheduler inordinately.

It is our plan to upgrade to 20.02.2 on Monday so if we can stabilize we will be moving to the new version.  That said I am willing to move early if that will help resolve this issue.

Any assistance is appreciated to figure out what is causing the problem.

Comment 1 Paul Edmon 2020-05-01 19:01:06 MDT

Finally got a sdiag to return.  Here is what it looks like:
[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Fri May 01 21:00:08 2020 (1588381208)
Data since      Fri May 01 20:56:54 2020 (1588381014)
*******************************************************
Server thread count:  256
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 4

Jobs submitted: 0
Jobs started:   0
Jobs completed: 46
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Fri May 01 20:59:50 2020 (1588381190)
Jobs pending:   41684
Jobs running:   3038

Main schedule statistics (microseconds):
        Last cycle:   30326
        Max cycle:    42371
        Total cycles: 5
        Mean cycle:   32196
        Mean depth cycle:  0
        Cycles per minute: 1
        Last queue length: 0

Backfilling stats
        Total backfilled jobs (since last slurm start): 0
        Total backfilled jobs (since last stats cycle start): 0
        Total backfilled heterogeneous job components: 0
        Total cycles: 0
        Last cycle when: Wed Dec 31 19:00:00 1969 (0)
        Last cycle: 0
        Max cycle:  0
        Last depth cycle: 0
        Last depth cycle (try sched): 0
        Last queue length: 0

Latency for 1000 calls to gettimeofday(): 28 microseconds

Remote Procedure Call statistics by message type
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2119   ave_time:1403473 total_time:2973960568
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:333    ave_time:65274240 total_time:21736322001
        REQUEST_STEP_COMPLETE                   ( 5016) count:294    ave_time:15850743 total_time:4660118512
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:237    ave_time:6835661 total_time:1620051867
        REQUEST_PARTITION_INFO                  ( 2009) count:154    ave_time:12923  total_time:1990195
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:151    ave_time:3479016 total_time:525331456
        REQUEST_NODE_INFO                       ( 2007) count:145    ave_time:118578 total_time:17193917
        REQUEST_FED_INFO                        ( 2049) count:118    ave_time:180    total_time:21257
        REQUEST_JOB_INFO                        ( 2003) count:104    ave_time:38759825 total_time:4031021849
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:12     ave_time:49075701 total_time:588908420
        REQUEST_JOB_USER_INFO                   ( 2039) count:5      ave_time:13642526 total_time:68212631
        REQUEST_SHARE_INFO                      ( 2022) count:4      ave_time:40979  total_time:163918
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:4      ave_time:20502351 total_time:82009404
        REQUEST_UPDATE_JOB                      ( 3001) count:4      ave_time:4157839 total_time:16631356
        REQUEST_STATS_INFO                      ( 2035) count:3      ave_time:3824   total_time:11473
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:2      ave_time:12488591 total_time:24977183
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:1      ave_time:79234  total_time:79234

Remote Procedure Call statistics by user
        root            (       0) count:3283   ave_time:10681802 total_time:35068357100
        sgprasath       (   21136) count:53     ave_time:652570 total_time:34586236
        dingma          (   40306) count:47     ave_time:367572 total_time:17275926
        kholton         (   20682) count:22     ave_time:15233  total_time:335131
        simonx          (   12821) count:20     ave_time:865928 total_time:17318569
        mengzhang       (   12484) count:20     ave_time:847308 total_time:16946176
        acoran          (   22029) count:19     ave_time:910703 total_time:17303371
        gtorri          (   57089) count:19     ave_time:929695 total_time:17664212
        ballen          ( 5899723) count:15     ave_time:2306049 total_time:34590743
        jingnandu       (   20922) count:11     ave_time:1541796 total_time:16959758
        mpolansky       (   13716) count:10     ave_time:4227   total_time:42276
        u_219006_g_65249( 6000076) count:10     ave_time:1712869 total_time:17128692
        jseeley         (   14063) count:10     ave_time:3362528 total_time:33625287
        bostdiek        (   20817) count:10     ave_time:3388593 total_time:33885933
        jaehwan         (  555445) count:9      ave_time:1866485 total_time:16798366
        nthor           (   20593) count:9      ave_time:1913447 total_time:17221023
        sychen          (   13448) count:9      ave_time:1879294 total_time:16913648
        ncolombi        (   21080) count:9      ave_time:3924203 total_time:35317829
        agomezlievano   (   12518) count:8      ave_time:30274  total_time:242198
        jyang12         (   12394) count:8      ave_time:2121857 total_time:16974858
        zmorris         (  558739) count:8      ave_time:2107202 total_time:16857619
        fseixas         (   14101) count:7      ave_time:2421317 total_time:16949222
        atrevino        (   10871) count:7      ave_time:21673  total_time:151717
        ajcukierman     (   12585) count:7      ave_time:2433861 total_time:17037030
        ctedijanto      (   21988) count:7      ave_time:2931   total_time:20517
        chsu            (   20796) count:7      ave_time:2412941 total_time:16890587
        rmattheis       (   13299) count:6      ave_time:2889130 total_time:17334785
        lsepulvedaduran (   10976) count:6      ave_time:6096   total_time:36578
        dkakkar         (   20839) count:5      ave_time:19894847 total_time:99474237
        tfranzmeyer     (   20760) count:4      ave_time:56098417 total_time:224393671
        pacargile       (  556471) count:4      ave_time:4157839 total_time:16631356
        daldarondo      (   13304) count:3      ave_time:26914508 total_time:80743525
        lshen           (   58176) count:3      ave_time:38485790 total_time:115457370
        pryke           (   40838) count:3      ave_time:1947   total_time:5843
        qxz125          (  556345) count:2      ave_time:49872903 total_time:99745806
        atsang          (   14422) count:2      ave_time:42334  total_time:84668
        bmatejek        ( 5900266) count:2      ave_time:37281327 total_time:74562655
        cclapp          (   57329) count:2      ave_time:19433481 total_time:38866963
        fpinter         (   21371) count:1      ave_time:25116489 total_time:25116489
        ruzhang         (  558295) count:1      ave_time:16089529 total_time:16089529
        e88bu2009       (   21880) count:1      ave_time:24971749 total_time:24971749
        jang            (   12650) count:1      ave_time:16095993 total_time:16095993

Pending RPC statistics
        No pending RPCs

Comment 2 Paul Edmon 2020-05-01 19:04:15 MDT

Created attachment 14083 [details]
Current slurm.conf

Comment 3 Nate Rini 2020-05-01 19:23:58 MDT

Please increase your logging level:
> scontrol setdebug debug3
> scontrol setdebugflags +agent

Please upload new logs after 15 min and revert:
> scontrol setdebug info
> scontrol setdebugflags -agent

Comment 4 Paul Edmon 2020-05-01 19:27:11 MDT

Thanks.  I've managed to stabilize and right now I'm testing out 
individual partitions and jobs to see what is causing the storm. I will 
enable the debug when I have a moment.

My suspicion is that a user (or more likely users) submitted thousands 
of jobs to multiple partitions that all Failed immediately causing the 
completion storm.  Is there a way for slurm to automatically put users 
in a penalty box if they have job failures?  (i.e. all jobs in the 
pending queue are held until the admin releases?)

-Paul Edmon-

On 5/1/2020 9:23 PM, bugs@schedmd.com wrote:
> Nate Rini <mailto:nate@schedmd.com> changed bug 8978 
> <https://bugs.schedmd.com/show_bug.cgi?id=8978>
> What 	Removed 	Added
> CC 		nate@schedmd.com
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c3> on bug 
> 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini 
> <mailto:nate@schedmd.com> *
> Please increase your logging level:
> > scontrol setdebug debug3 > scontrol setdebugflags +agent
>
> Please upload new logs after 15 min and revert:
> > scontrol setdebug info > scontrol setdebugflags -agent
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Nate Rini 2020-05-01 19:30:54 MDT

(In reply to Paul Edmon from comment #4)
> My suspicion is that a user (or more likely users) submitted thousands 
> of jobs to multiple partitions that all Failed immediately causing the 
> completion storm.  Is there a way for slurm to automatically put users 
> in a penalty box if they have job failures?  (i.e. all jobs in the 
> pending queue are held until the admin releases?)

The job submit throttle plugin is the closest to this request I'm aware of:

https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/throttle/job_submit_throttle.c

Comment 6 Paul Edmon 2020-05-01 19:35:11 MDT

Yeah, this feature would be when the job ran not on submission time.  It 
would be handy as it would prevent completion storms like this as if you 
have a user who submits thousands of jobs or a job array of a thousand 
jobs and the first fails, odds are all or most of them will.  So it 
would be great to be able to put a hold on the rest of that users jobs.

I've put the additional debug flagging on and I will let it run for 
another 15 min before turning it off.  I will then send you the log.  I 
only have a few partitions running out of the 140 that we have 
configured as I'm trying to see if there are any problematic jobs.  
Already found one user.

-Paul Edmon-

On 5/1/2020 9:30 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c5> on bug 
> 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini 
> <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #4  <show_bug.cgi?id=8978#c4>)
> > My suspicion is that a user (or more likely users) submitted thousands > of jobs to multiple partitions that all Failed immediately causing 
> the > completion storm.  Is there a way for slurm to automatically put 
> users > in a penalty box if they have job failures?  (i.e. all jobs in 
> the > pending queue are held until the admin releases?)
>
> The job submit throttle plugin is the closest to this request I'm aware of:
>
> https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/throttle/job_submit_throttle.c
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 7 Paul Edmon 2020-05-01 19:51:08 MDT

Created attachment 14084 [details]
slurmctld log

The period with higher debugging is from 21:30-21:45

Comment 8 Paul Edmon 2020-05-01 20:08:03 MDT

Some additional forensic work.  I've tracked down 4 users thus far whose 
jobs exited basically immediately either due to Failure, OOM, or their 
job just ran instantly and completed.  So I think what we ended up with 
was a perfect completion storm of a ton of jobs just continually 
thrashing the scheduler as the scheduler would throw them in and then 
they would exit immediately, creating room for the scheduler to throw 
more in and then exit.

On 5/1/2020 9:30 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c5> on bug 
> 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini 
> <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #4  <show_bug.cgi?id=8978#c4>)
> > My suspicion is that a user (or more likely users) submitted thousands > of jobs to multiple partitions that all Failed immediately causing 
> the > completion storm.  Is there a way for slurm to automatically put 
> users > in a penalty box if they have job failures?  (i.e. all jobs in 
> the > pending queue are held until the admin releases?)
>
> The job submit throttle plugin is the closest to this request I'm aware of:
>
> https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/throttle/job_submit_throttle.c
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 9 Nate Rini 2020-05-01 20:11:07 MDT

(In reply to Paul Edmon from comment #8)
> So I think what we ended up with 
> was a perfect completion storm of a ton of jobs just continually 
> thrashing the scheduler as the scheduler would throw them in and then 
> they would exit immediately, creating room for the scheduler to throw 
> more in and then exit.

Is the cluster current working?

Comment 10 Paul Edmon 2020-05-01 20:15:58 MDT

In fits and starts.  I'm canceling the culprits jobs which is wedging 
the scheduler due to the completion storm of the cancelation... I should 
I have used a bash loop instead of using a bulk scancel.  If I can clear 
the stack up of RPC's things seem to work as expected.  Its just a 
matter of getting off the mountain. That said I don't have it working 
yet at full capacity where we have everything back to where it was prior 
to the RPC storm.

-Paul Edmon-

On 5/1/2020 10:11 PM, bugs@schedmd.com wrote:
>
> *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c9> on bug 
> 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini 
> <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #8  <show_bug.cgi?id=8978#c8>)
> > So I think what we ended up with > was a perfect completion storm of a ton of jobs just continually > 
> thrashing the scheduler as the scheduler would throw them in and then 
> > they would exit immediately, creating room for the scheduler to 
> throw > more in and then exit.
>
> Is the cluster current working?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 11 Nate Rini 2020-05-01 20:18:35 MDT

Please also take a core of slurmctld:
> pgrep slurmctld| xargs -i gcore {}

I assume there is only 1 slurmctld running but this will grab any. Please then provide the following from gdb loaded with the core:
> set pagination off
> set print pretty on
> t a a bt full

Please save the core for later, we may ask for more info from gdb.

Please also attach the whole slurmctld log during the incident.

Comment 12 Nate Rini 2020-05-01 20:24:39 MDT

(In reply to Paul Edmon from comment #10)
> If I can clear 
> the stack up of RPC's things seem to work as expected.  Its just a 
> matter of getting off the mountain. That said I don't have it working 
> yet at full capacity where we have everything back to where it was prior 
> to the RPC storm.

Please add these to SchedulerParameters in slurm.conf:
> bf_max_job_start=100 #(update instead of add)
> bf_max_job_user=50
> bf_max_job_user_part=10
> sched_max_job_start=100

This should stop the scheduler from launching so many jobs to give it a chance to recover from the rush of job completes.

Please upload new slurmctld logs and sdiag after the change.

Comment 13 Paul Edmon 2020-05-01 20:25:03 MDT

Created attachment 14085 [details]
backtrace for core dump of slurmctld

Comment 14 Paul Edmon 2020-05-01 20:34:11 MDT

Added those updates. I will get you the sdiag once things settle down.

-paul Edmon-

On 5/1/2020 10:24 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c12> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #10  <show_bug.cgi?id=8978#c10>)
> > If I can clear > the stack up of RPC's things seem to work as expected.  Its just a > 
> matter of getting off the mountain. That said I don't have it working 
> > yet at full capacity where we have everything back to where it was 
> prior > to the RPC storm.
>
> Please add these to SchedulerParameters in slurm.conf:
> > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100
>
> This should stop the scheduler from launching so many jobs to give it a chance
> to recover from the rush of job completes.
>
> Please upload new slurmctld logs and sdiag after the change.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 15 Paul Edmon 2020-05-01 20:35:28 MDT

Created attachment 14086 [details]
full day slurmctld log

Incident starts around 12:30

Comment 16 Paul Edmon 2020-05-01 20:42:24 MDT

Here is the current sdiag, mind you I'm in a RPC storm right now due to 
me bulk canceling a users jobs:

[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Fri May 01 22:40:24 2020 (1588387224)
Data since      Fri May 01 22:35:11 2020 (1588386911)
*******************************************************
Server thread count:  256
Agent queue size:     14
Agent count:          0
DBD Agent queue size: 2

Jobs submitted: 0
Jobs started:   0
Jobs completed: 54
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Fri May 01 22:39:55 2020 (1588387195)
Jobs pending:   28879
Jobs running:   3779

Main schedule statistics (microseconds):
         Last cycle:   19297
         Max cycle:    32428
         Total cycles: 4
         Mean cycle:   25048
         Mean depth cycle:  0
         Cycles per minute: 0
         Last queue length: 0

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
         Total backfilled jobs (since last slurm start): 0
         Total backfilled jobs (since last stats cycle start): 0
         Total backfilled heterogeneous job components: 0
         Total cycles: 0
         Last cycle when: Wed Dec 31 19:00:00 1969 (0)
         Last cycle: 0
         Max cycle:  0
         Last depth cycle: 0
         Last depth cycle (try sched): 0
         Last queue length: 0

Latency for 1000 calls to gettimeofday(): 29 microseconds

Remote Procedure Call statistics by message type
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:3052   
ave_time:1136570 total_time:3468814243
         REQUEST_STEP_COMPLETE                   ( 5016) count:586    
ave_time:110719367 total_time:64881549280
         REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:273    
ave_time:1181923 total_time:322665212
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:174    
ave_time:15460826 total_time:2690183831
         REQUEST_NODE_INFO                       ( 2007) count:119    
ave_time:130587 total_time:15539894
         REQUEST_JOB_STEP_CREATE                 ( 5001) count:116    
ave_time:7865802 total_time:912433049
         REQUEST_FED_INFO                        ( 2049) count:111    
ave_time:494    total_time:54878
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:99     
ave_time:1159243 total_time:114765057
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:76     
ave_time:2672367 total_time:203099956
         REQUEST_PARTITION_INFO                  ( 2009) count:47     
ave_time:7680   total_time:360978
         REQUEST_JOB_INFO                        ( 2003) count:31     
ave_time:27205848 total_time:843381313
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:10     
ave_time:15663028 total_time:156630284
         REQUEST_SHARE_INFO                      ( 2022) count:3      
ave_time:38724  total_time:116174
         REQUEST_STATS_INFO                      ( 2035) count:3      
ave_time:253    total_time:761
         REQUEST_JOB_USER_INFO                   ( 2039) count:2      
ave_time:1895804 total_time:3791609
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:1      
ave_time:296898 total_time:296898
         REQUEST_KILL_JOB                        ( 5032) count:1      
ave_time:27150565 total_time:27150565

Remote Procedure Call statistics by user
         root            (       0) count:4056   ave_time:17769687 
total_time:72073854002
         schlafly        (   35075) count:389    ave_time:3175059 
total_time:1235098261
         kholton         (   20682) count:19     ave_time:392498 
total_time:7457479
         acoran          (   22029) count:17     ave_time:6340 
total_time:107794
         sgprasath       (   21136) count:17     ave_time:1649 
total_time:28033
         dingma          (   40306) count:16     ave_time:371161 
total_time:5938588
         dkakkar         (   20839) count:15     ave_time:606991 
total_time:9104868
         simonx          (   12821) count:15     ave_time:513206 
total_time:7698100
         gtorri          (   57089) count:14     ave_time:266562 
total_time:3731880
         ballen          ( 5899723) count:13     ave_time:525850 
total_time:6836052
         mengzhang       (   12484) count:13     ave_time:40928 
total_time:532074
         jingnandu       (   20922) count:11     ave_time:633657 
total_time:6970235
         nthor           (   20593) count:10     ave_time:120426 
total_time:1204269
         ajcukierman     (   12585) count:9      ave_time:495968 
total_time:4463715
         jfroberg        (   13751) count:9      ave_time:649581 
total_time:5846232
         jaehwan         (  555445) count:9      ave_time:152185 
total_time:1369671
         jyang12         (   12394) count:9      ave_time:725685 
total_time:6531172
         mpolansky       (   13716) count:8      ave_time:556519 
total_time:4452155
         chsu            (   20796) count:7      ave_time:797738 
total_time:5584171
         sychen          (   13448) count:7      ave_time:649582 
total_time:4547080
         bostdiek        (   20817) count:7      ave_time:473564 
total_time:3314949
         atrevino        (   10871) count:6      ave_time:198376 
total_time:1190259
         rmattheis       (   13299) count:6      ave_time:506129 
total_time:3036775
         jseeley         (   14063) count:6      ave_time:19659 
total_time:117956
         tfranzmeyer     (   20760) count:3      ave_time:17658674 
total_time:52976022
         bmatejek        ( 5900266) count:2      ave_time:24987538 
total_time:49975076
         jeastman        ( 5899955) count:2      ave_time:28925574 
total_time:57851148
         daldarondo      (   13304) count:2      ave_time:12527300 
total_time:25054600
         jtaylor         (  558629) count:1      ave_time:331058 
total_time:331058
         benmergui       (   99690) count:1      ave_time:296898 
total_time:296898
         jianxiangqiu    (   22008) count:1      ave_time:27123857 
total_time:27123857
         e88bu2046       (   21907) count:1      ave_time:353497 
total_time:353497
         jpegues         (   21791) count:1      ave_time:27150565 
total_time:27150565
         ehand           (   14058) count:1      ave_time:670431 
total_time:670431
         pkrastev        (   56139) count:1      ave_time:35060 
total_time:35060
         xlu123          (   10076) count:0      ave_time:0 total_time:0
         ghzhang17       (   13095) count:0      ave_time:0 total_time:0
         rweinberger     (   13263) count:0      ave_time:0 total_time:0

Pending RPC statistics
         REQUEST_TERMINATE_JOB                   ( 6011) count:14

Pending RPCs
          1: REQUEST_TERMINATE_JOB                holy7c08102
          2: REQUEST_TERMINATE_JOB                holy7c06405
          3: REQUEST_TERMINATE_JOB                holy7c06405
          4: REQUEST_TERMINATE_JOB                holy7c06405
          5: REQUEST_TERMINATE_JOB                shakgpu41
          6: REQUEST_TERMINATE_JOB                shakgpu41
          7: REQUEST_TERMINATE_JOB                seasmicro23
          8: REQUEST_TERMINATE_JOB                seasmicro23
          9: REQUEST_TERMINATE_JOB                holy2c01316
         10: REQUEST_TERMINATE_JOB                holy2c01316
         11: REQUEST_TERMINATE_JOB                holy2c01316
         12: REQUEST_TERMINATE_JOB                holy2c01316
         13: REQUEST_TERMINATE_JOB                holy2c01316
         14: REQUEST_TERMINATE_JOB                holy2c01316

-Paul Edmon-


On 5/1/2020 10:24 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c12> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #10  <show_bug.cgi?id=8978#c10>)
> > If I can clear > the stack up of RPC's things seem to work as expected.  Its just a > 
> matter of getting off the mountain. That said I don't have it working 
> > yet at full capacity where we have everything back to where it was 
> prior > to the RPC storm.
>
> Please add these to SchedulerParameters in slurm.conf:
> > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100
>
> This should stop the scheduler from launching so many jobs to give it a chance
> to recover from the rush of job completes.
>
> Please upload new slurmctld logs and sdiag after the change.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 17 Paul Edmon 2020-05-01 20:43:49 MDT

When I restart it looks like this:

May  1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: 
JobId=54737409_14(54737434) in completing state
May  1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: 
rm_job_res: plugin still initializing
May  1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: 
JobId=54737409_15(54737435) in completing state
May  1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: 
rm_job_res: plugin still initializing
May  1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: 
JobId=54737409_16(54737436) in completing state
May  1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: 
rm_job_res: plugin still initializing
May  1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: 
JobId=54737409_17(54737437) in completing state
May  1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: 
rm_job_res: plugin still initializing
May  1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: 
JobId=54737409_18(54737438) in completing state
May  1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: 
rm_job_res: plugin still initializing
May  1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: 
completing 321 jobs
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Warning: Note very large 
processing time from load_part_uid_allow_list: usec=13362758 
began=22:43:00.167
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Recovered state of 
reservation doshi-velez
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Recovered state of 
reservation doshi-velez-gpu
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Recovered state of 2 
reservations
May  1 22:43:13 holy-slurm02 slurmctld[44855]: State of 0 triggers recovered
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _preserve_plugins: 
backup_controller not specified
May  1 22:43:13 holy-slurm02 slurmctld[44855]: select/cons_tres: 
select_p_reconfigure
May  1 22:43:13 holy-slurm02 slurmctld[44855]: select/cons_tres: 
select_p_node_init
May  1 22:43:13 holy-slurm02 slurmctld[44855]: select/cons_tres: 
preparing for 145 partitions
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Warning: Note very large 
processing time from read_slurm_conf: usec=36807998 began=22:42:36.828
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Running as primary controller
May  1 22:43:13 holy-slurm02 slurmctld[44855]: Registering slurmctld at 
port 6820 with slurmdbd.
May  1 22:43:13 holy-slurm02 slurmctld[44855]:    retry_list 
retry_list_size:321 
msg_type=REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB
May  1 22:43:13 holy-slurm02 slurmctld[44855]: mcs: MCSParameters = 
ondemand,ondemandselect,privatedata. ondemand set.
May  1 22:43:13 holy-slurm02 slurmctld[44855]: step_partial_comp: 
JobId=54480668 StepID=110 invalid
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: 
JobId=54733400 WEXITSTATUS 0
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: 
JobId=54733400 done
May  1 22:43:13 holy-slurm02 slurmctld[44855]: step_partial_comp: 
JobId=54708002_348(54736510) StepID=4294967295 invalid
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: 
JobId=54586861 WEXITSTATUS 0
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: 
JobId=54586861 done
May  1 22:43:13 holy-slurm02 slurmctld[44855]: cleanup_completing: 
JobId=54586418 completion process took 1757 seconds
May  1 22:43:13 holy-slurm02 slurmctld[44855]: step_partial_comp: 
JobId=54733398 StepID=4294967295 invalid
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: 
JobId=54586592 WEXITSTATUS 0
May  1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: 
JobId=54586592 done
May  1 22:43:14 holy-slurm02 slurmctld[44855]: server_thread_count over 
limit (256), waiting
May  1 22:43:14 holy-slurm02 slurmctld[44855]: step_partial_comp: 
JobId=54480668 StepID=326 invalid
May  1 22:43:14 holy-slurm02 slurmctld[44855]: error: 
_slurm_rpc_complete_batch_script: Could not find batch step for 
JobId=54552845, this should never happen
May  1 22:43:14 holy-slurm02 slurmctld[44855]: step_partial_comp: 
JobId=54648048_25(54729445) StepID=4294967295 invalid

-Paul Edmon-

On 5/1/2020 10:24 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c12> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #10  <show_bug.cgi?id=8978#c10>)
> > If I can clear > the stack up of RPC's things seem to work as expected.  Its just a > 
> matter of getting off the mountain. That said I don't have it working 
> > yet at full capacity where we have everything back to where it was 
> prior > to the RPC storm.
>
> Please add these to SchedulerParameters in slurm.conf:
> > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100
>
> This should stop the scheduler from launching so many jobs to give it a chance
> to recover from the rush of job completes.
>
> Please upload new slurmctld logs and sdiag after the change.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 18 Nate Rini 2020-05-01 20:52:37 MDT

Please upload a new log in 10 minutes. I want to see if the agent queue will clear up.

Comment 19 Nate Rini 2020-05-01 20:57:11 MDT

Please also upload slurmd log from holy7c08102.

Comment 20 Paul Edmon 2020-05-01 21:01:14 MDT

Created attachment 14087 [details]
Log for slurmd on holy7c08102

Comment 21 Paul Edmon 2020-05-01 21:03:11 MDT

Created attachment 14088 [details]
slurmctld log for 2200

Comment 22 Paul Edmon 2020-05-01 21:08:14 MDT

Looks like things have settled down.  I'm going to try reopening single 
partitions now to see if I see any other problematic jobs:

[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Fri May 01 23:07:35 2020 (1588388855)
Data since      Fri May 01 23:05:34 2020 (1588388734)
*******************************************************
Server thread count:  16
Agent queue size:     0
Agent count:          1
DBD Agent queue size: 0

Jobs submitted: 0
Jobs started:   0
Jobs completed: 44
Jobs canceled:  91
Jobs failed:    0

Job states ts:  Fri May 01 23:07:22 2020 (1588388842)
Jobs pending:   28879
Jobs running:   3416

Main schedule statistics (microseconds):
         Last cycle:   24080
         Max cycle:    29775
         Total cycles: 5
         Mean cycle:   24481
         Mean depth cycle:  0
         Cycles per minute: 2
         Last queue length: 0

Backfilling stats
         Total backfilled jobs (since last slurm start): 0
         Total backfilled jobs (since last stats cycle start): 0
         Total backfilled heterogeneous job components: 0
         Total cycles: 0
         Last cycle when: Wed Dec 31 19:00:00 1969 (0)
         Last cycle: 0
         Max cycle:  0
         Last depth cycle: 0
         Last depth cycle (try sched): 0
         Last queue length: 0

Latency for 1000 calls to gettimeofday(): 28 microseconds

Remote Procedure Call statistics by message type
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2151   
ave_time:452820 total_time:974017819
         REQUEST_PARTITION_INFO                  ( 2009) count:480    
ave_time:13369  total_time:6417400
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:474    
ave_time:213231 total_time:101071951
         REQUEST_JOB_STEP_CREATE                 ( 5001) count:471    
ave_time:7833530 total_time:3689592669
         REQUEST_STEP_COMPLETE                   ( 5016) count:465    
ave_time:1292113 total_time:600832713
         REQUEST_FED_INFO                        ( 2049) count:444    
ave_time:259    total_time:115060
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:392    
ave_time:7299999 total_time:2861599919
         REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:247    
ave_time:74224  total_time:18333474
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:140    
ave_time:16196175 total_time:2267464639
         REQUEST_KILL_JOB                        ( 5032) count:91     
ave_time:183127 total_time:16664643
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:47     
ave_time:5436944 total_time:255536411
         REQUEST_STATS_INFO                      ( 2035) count:17     
ave_time:346    total_time:5890
         REQUEST_JOB_USER_INFO                   ( 2039) count:9      
ave_time:141924 total_time:1277320
         REQUEST_NODE_INFO                       ( 2007) count:7      
ave_time:157950 total_time:1105653
         REQUEST_JOB_INFO                        ( 2003) count:4      
ave_time:18466464 total_time:73865858
         REQUEST_SHARE_INFO                      ( 2022) count:3      
ave_time:41067  total_time:123203
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:1      
ave_time:20086781 total_time:20086781

Remote Procedure Call statistics by user
         root            (       0) count:3293   ave_time:2063940 
total_time:6796556091
         schlafly        (   35075) count:718    ave_time:5164242 
total_time:3707926143
         dingma          (   40306) count:208    ave_time:7542 
total_time:1568761
         sgprasath       (   21136) count:195    ave_time:4926 
total_time:960580
         kholton         (   20682) count:89     ave_time:9910 
total_time:881997
         mengzhang       (   12484) count:88     ave_time:18747 
total_time:1649758
         ballen          ( 5899723) count:87     ave_time:5180 
total_time:450718
         gtorri          (   57089) count:87     ave_time:9662 
total_time:840594
         simonx          (   12821) count:84     ave_time:10278 
total_time:863372
         acoran          (   22029) count:74     ave_time:18433 
total_time:1364108
         jingnandu       (   20922) count:46     ave_time:11661 
total_time:536433
         nthor           (   20593) count:41     ave_time:35060 
total_time:1437465
         jaehwan         (  555445) count:41     ave_time:15745 
total_time:645555
         sychen          (   13448) count:41     ave_time:5565 
total_time:228190
         rmattheis       (   13299) count:40     ave_time:34110 
total_time:1364406
         bostdiek        (   20817) count:40     ave_time:5441 
total_time:217679
         jyang12         (   12394) count:40     ave_time:5885 
total_time:235401
         chsu            (   20796) count:40     ave_time:3056 
total_time:122276
         atrevino        (   10871) count:38     ave_time:19668 
total_time:747412
         ajcukierman     (   12585) count:38     ave_time:5325 
total_time:202356
         mpolansky       (   13716) count:35     ave_time:5377 
total_time:188205
         tfranzmeyer     (   20760) count:29     ave_time:5591371 
total_time:162149786
         bmatejek        ( 5900266) count:16     ave_time:4812152 
total_time:76994438
         jseeley         (   14063) count:14     ave_time:288302 
total_time:4036233
         daldarondo      (   13304) count:10     ave_time:8897164 
total_time:88971640
         dchan           (  558943) count:6      ave_time:79832 
total_time:478992
         jfroberg        (   13751) count:2      ave_time:6923 
total_time:13846
         ruzhang         (  558295) count:1      ave_time:20086781 
total_time:20086781
         yaowang         (   11810) count:1      ave_time:16115066 
total_time:16115066
         froy            (   12706) count:1      ave_time:277121 
total_time:277121
         jpegues         (   21791) count:0      ave_time:0 total_time:0

Pending RPC statistics
         No pending RPCs

-Paul Edmon-

On 5/1/2020 10:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c19> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Please also upload slurmd log from holy7c08102.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 23 Paul Edmon 2020-05-01 21:19:14 MDT

Here is with 3 partitions open:

[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Fri May 01 23:18:31 2020 (1588389511)
Data since      Fri May 01 23:05:34 2020 (1588388734)
*******************************************************
Server thread count:  248
Agent queue size:     0
Agent count:          8
DBD Agent queue size: 4

Jobs submitted: 108
Jobs started:   418
Jobs completed: 155
Jobs canceled:  91
Jobs failed:    0

Job states ts:  Fri May 01 23:18:06 2020 (1588389486)
Jobs pending:   28578
Jobs running:   3709

Main schedule statistics (microseconds):
         Last cycle:   215507
         Max cycle:    300186
         Total cycles: 26
         Mean cycle:   111743
         Mean depth cycle:  12
         Cycles per minute: 2
         Last queue length: 3404

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
         Total backfilled jobs (since last slurm start): 274
         Total backfilled jobs (since last stats cycle start): 274
         Total backfilled heterogeneous job components: 0
         Total cycles: 7
         Last cycle when: Fri May 01 23:18:04 2020 (1588389484)
         Last cycle: 27759586
         Max cycle:  88216689
         Mean cycle: 32782808
         Last depth cycle: 2629
         Last depth cycle (try sched): 52
         Depth Mean: 2471
         Depth Mean (try depth): 67
         Last queue length: 3422
         Queue length mean: 3437

Latency for 1000 calls to gettimeofday(): 28 microseconds

Remote Procedure Call statistics by message type
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2187   
ave_time:453748 total_time:992347916
         REQUEST_JOB_STEP_CREATE                 ( 5001) count:2147   
ave_time:25123331 total_time:53939793028
         REQUEST_PARTITION_INFO                  ( 2009) count:1870   
ave_time:100690 total_time:188290996
         REQUEST_FED_INFO                        ( 2049) count:1712   
ave_time:273    total_time:468316
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:1707   
ave_time:514885 total_time:878909349
         REQUEST_STEP_COMPLETE                   ( 5016) count:597    
ave_time:1992293 total_time:1189399345
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:516    
ave_time:7081250 total_time:3653925371
         REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:471    
ave_time:103458 total_time:48729023
         REQUEST_COMPLETE_PROLOG                 ( 6018) count:462    
ave_time:6962299 total_time:3216582553
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:253    
ave_time:15511429 total_time:3924391777
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:239    
ave_time:9470308 total_time:2263403822
         REQUEST_KILL_JOB                        ( 5032) count:91     
ave_time:183127 total_time:16664643
         REQUEST_NODE_INFO_SINGLE                ( 2040) count:87     
ave_time:120367 total_time:10471935
         REQUEST_NODE_INFO                       ( 2007) count:48     
ave_time:173608 total_time:8333203
         REQUEST_STATS_INFO                      ( 2035) count:38     
ave_time:362    total_time:13777
         REQUEST_JOB_USER_INFO                   ( 2039) count:37     
ave_time:528677 total_time:19561062
         REQUEST_JOB_INFO                        ( 2003) count:20     
ave_time:18827043 total_time:376540868
         REQUEST_SHARE_INFO                      ( 2022) count:14     
ave_time:45118  total_time:631662
         REQUEST_UPDATE_NODE                     ( 3002) count:8      
ave_time:2739537 total_time:21916302
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:4      
ave_time:16192802 total_time:64771211
         REQUEST_UPDATE_PARTITION                ( 3005) count:3      
ave_time:22761824 total_time:68285474
         REQUEST_JOB_READY                       ( 4019) count:2      
ave_time:269    total_time:539
         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:1      
ave_time:939    total_time:939

Remote Procedure Call statistics by user
         root            (       0) count:4538   ave_time:2999670 
total_time:13612506105
         schlafly        (   35075) count:2575   ave_time:20858703 
total_time:53711161254
         dingma          (   40306) count:577    ave_time:51809 
total_time:29893957
         sgprasath       (   21136) count:561    ave_time:6678 
total_time:3746765
         ballen          ( 5899723) count:349    ave_time:163438 
total_time:57039982
         kholton         (   20682) count:346    ave_time:75262 
total_time:26040928
         mengzhang       (   12484) count:346    ave_time:77776 
total_time:26910505
         simonx          (   12821) count:336    ave_time:18391 
total_time:6179416
         acoran          (   22029) count:334    ave_time:84926 
total_time:28365423
         gtorri          (   57089) count:327    ave_time:72709 
total_time:23775844
         jingnandu       (   20922) count:183    ave_time:111679 
total_time:20437382
         jyang12         (   12394) count:175    ave_time:131350 
total_time:22986272
         nthor           (   20593) count:174    ave_time:172635 
total_time:30038528
         bostdiek        (   20817) count:174    ave_time:130671 
total_time:22736889
         sychen          (   13448) count:172    ave_time:162721 
total_time:27988049
         chsu            (   20796) count:172    ave_time:141293 
total_time:24302532
         jaehwan         (  555445) count:170    ave_time:129712 
total_time:22051120
         atrevino        (   10871) count:166    ave_time:25389 
total_time:4214732
         ajcukierman     (   12585) count:160    ave_time:192365 
total_time:30778496
         mpolansky       (   13716) count:151    ave_time:145428 
total_time:21959765
         rmattheis       (   13299) count:106    ave_time:70802 
total_time:7505094
         tfranzmeyer     (   20760) count:95     ave_time:11302072 
total_time:1073696895
         lshen           (   58176) count:57     ave_time:1756328 
total_time:100110742
         kbystrom        (   20619) count:54     ave_time:622443 
total_time:33611923
         bmatejek        ( 5900266) count:51     ave_time:7631683 
total_time:389215875
         daldarondo      (   13304) count:48     ave_time:9188387 
total_time:441042588
         stacchella      (   11809) count:24     ave_time:6230649 
total_time:149535597
         mlovell         ( 5899810) count:16     ave_time:7964527 
total_time:127432435
         jseeley         (   14063) count:14     ave_time:288302 
total_time:4036233
         zeyuanhu        (   11515) count:11     ave_time:10381427 
total_time:114195704
         dchan           (  558943) count:6      ave_time:79832 
total_time:478992
         yaowang         (   11810) count:5      ave_time:14677543 
total_time:73387715
         rhong           (   21011) count:5      ave_time:18807801 
total_time:94039006
         namluu          (   20825) count:4      ave_time:31107978 
total_time:124431912
         bursley         (   59719) count:4      ave_time:10590207 
total_time:42360829
         dkakkar         (   20839) count:4      ave_time:3827 
total_time:15311
         gloewinger      (   13020) count:3      ave_time:549612 
total_time:1648836
         mprasad         (   20409) count:3      ave_time:26905909 
total_time:80717729
         jdu             (   13910) count:3      ave_time:24400632 
total_time:73201898
         jeastman        ( 5899955) count:2      ave_time:20745138 
total_time:41490276
         jfroberg        (   13751) count:2      ave_time:6923 
total_time:13846
         jpegues         (   21791) count:2      ave_time:21390403 
total_time:42780807
         francesco       (   11040) count:2      ave_time:7817828 
total_time:15635656
         jang            (   12650) count:1      ave_time:278634 
total_time:278634
         froy            (   12706) count:1      ave_time:277121 
total_time:277121
         bwilder         (   21982) count:1      ave_time:7069726 
total_time:7069726
         ghzhang17       (   13095) count:1      ave_time:28174604 
total_time:28174604
         ruzhang         (  558295) count:1      ave_time:20086781 
total_time:20086781
         e88bu2024       (   21897) count:1      ave_time:24097034 
total_time:24097034
         rjsingh         (   12725) count:1      ave_time:19749368 
total_time:19749368

Pending RPC statistics
         REQUEST_BATCH_JOB_LAUNCH                ( 4005) count:7
         REQUEST_LAUNCH_PROLOG                   ( 6017) count:6
         REQUEST_TERMINATE_JOB                   ( 6011) count:1

Pending RPCs
          1: REQUEST_BATCH_JOB_LAUNCH             shakgpu05
          2: REQUEST_LAUNCH_PROLOG                shakgpu05
          3: REQUEST_BATCH_JOB_LAUNCH             shakgpu05
          4: REQUEST_LAUNCH_PROLOG                shakgpu04
          5: REQUEST_BATCH_JOB_LAUNCH             shakgpu04
          6: REQUEST_LAUNCH_PROLOG                shakgpu04
          7: REQUEST_BATCH_JOB_LAUNCH             shakgpu04
          8: REQUEST_LAUNCH_PROLOG                shakgpu02
          9: REQUEST_BATCH_JOB_LAUNCH             shakgpu02
         10: REQUEST_LAUNCH_PROLOG                shakgpu02
         11: REQUEST_BATCH_JOB_LAUNCH             shakgpu02
         12: REQUEST_LAUNCH_PROLOG                shakgpu01
         13: REQUEST_BATCH_JOB_LAUNCH             shakgpu01
         14: REQUEST_TERMINATE_JOB                holy7c26501

I will attach the log.

-Paul Edmon-

On 5/1/2020 10:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c19> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Please also upload slurmd log from holy7c08102.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 24 Paul Edmon 2020-05-01 21:20:37 MDT

Created attachment 14089 [details]
Slurmctld log for 2300-2320

Comment 25 Paul Edmon 2020-05-01 21:38:38 MDT

Current state:

[root@holy-slurm02 log]# sdiag
*******************************************************
sdiag output at Fri May 01 23:37:21 2020 (1588390641)
Data since      Fri May 01 23:05:34 2020 (1588388734)
*******************************************************
Server thread count:  248
Agent queue size:     0
Agent count:          2
DBD Agent queue size: 40

Jobs submitted: 210
Jobs started:   1220
Jobs completed: 716
Jobs canceled:  187
Jobs failed:    0

Job states ts:  Fri May 01 23:35:57 2020 (1588390557)
Jobs pending:   27869
Jobs running:   3871

Main schedule statistics (microseconds):
         Last cycle:   233916
         Max cycle:    317117
         Total cycles: 74
         Mean cycle:   183082
         Mean depth cycle:  19
         Cycles per minute: 2
         Last queue length: 3239

Backfilling stats
         Total backfilled jobs (since last slurm start): 511
         Total backfilled jobs (since last stats cycle start): 511
         Total backfilled heterogeneous job components: 0
         Total cycles: 20
         Last cycle when: Fri May 01 23:35:55 2020 (1588390555)
         Last cycle: 85258338
         Max cycle:  88216689
         Mean cycle: 42888560
         Last depth cycle: 2459
         Last depth cycle (try sched): 50
         Depth Mean: 2443
         Depth Mean (try depth): 63
         Last queue length: 3250
         Queue length mean: 3191

Latency for 1000 calls to gettimeofday(): 28 microseconds

Remote Procedure Call statistics by message type
         REQUEST_JOB_STEP_CREATE                 ( 5001) count:5710   
ave_time:36180328 total_time:206589677163
         REQUEST_PARTITION_INFO                  ( 2009) count:4311   
ave_time:121717 total_time:524722651
         REQUEST_FED_INFO                        ( 2049) count:3196   
ave_time:275    total_time:880948
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:3128   
ave_time:712903 total_time:2229963569
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2190   
ave_time:453169 total_time:992441442
         REQUEST_STEP_COMPLETE                   ( 5016) count:1300   
ave_time:4698500 total_time:6108050692
         REQUEST_COMPLETE_PROLOG                 ( 6018) count:1277   
ave_time:6000919 total_time:7663174133
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:1110   
ave_time:9394388 total_time:10427770876
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:923    
ave_time:12989108 total_time:11988947335
         REQUEST_NODE_INFO_SINGLE                ( 2040) count:910    
ave_time:126811 total_time:115398566
         REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:511    
ave_time:204723 total_time:104613771
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:378    
ave_time:9990699 total_time:3776484365
         REQUEST_NODE_INFO                       ( 2007) count:234    
ave_time:539748 total_time:126301155
         REQUEST_KILL_JOB                        ( 5032) count:214    
ave_time:2356655 total_time:504324334
         REQUEST_JOB_USER_INFO                   ( 2039) count:84     
ave_time:842773 total_time:70792985
         REQUEST_STATS_INFO                      ( 2035) count:58     
ave_time:356    total_time:20691
         REQUEST_CANCEL_JOB_STEP                 ( 5005) count:46     
ave_time:35663429 total_time:1640517757
         REQUEST_UPDATE_NODE                     ( 3002) count:36     
ave_time:2802281 total_time:100882124
         REQUEST_JOB_INFO                        ( 2003) count:36     
ave_time:24105303 total_time:867790934
         REQUEST_SHARE_INFO                      ( 2022) count:32     
ave_time:48537  total_time:1553184
         REQUEST_JOB_READY                       ( 4019) count:12     
ave_time:211823 total_time:2541885
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:9      
ave_time:16984702 total_time:152862321
         REQUEST_UPDATE_PARTITION                ( 3005) count:4      
ave_time:17085514 total_time:68342058
         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:3      
ave_time:8397524 total_time:25192574
         REQUEST_JOB_REQUEUE                     ( 5023) count:2      
ave_time:361    total_time:723
         ACCOUNTING_UPDATE_MSG                   (10001) count:1      
ave_time:131584 total_time:131584

Remote Procedure Call statistics by user
         root            (       0) count:9692   ave_time:4066232 
total_time:39409929887
         schlafly        (   35075) count:6218   ave_time:33460498 
total_time:208057381535
         dingma          (   40306) count:982    ave_time:80883 
total_time:79427183
         sgprasath       (   21136) count:967    ave_time:35252 
total_time:34089243
         ballen          ( 5899723) count:667    ave_time:289552 
total_time:193131196
         kholton         (   20682) count:631    ave_time:97914 
total_time:61784082
         mengzhang       (   12484) count:629    ave_time:101925 
total_time:64111115
         acoran          (   22029) count:622    ave_time:95397 
total_time:59337531
         simonx          (   12821) count:621    ave_time:52437 
total_time:32563915
         gtorri          (   57089) count:603    ave_time:88341 
total_time:53269775
         jingnandu       (   20922) count:346    ave_time:197415 
total_time:68305857
         jyang12         (   12394) count:332    ave_time:136911 
total_time:45454685
         bostdiek        (   20817) count:332    ave_time:136451 
total_time:45301798
         chsu            (   20796) count:332    ave_time:203221 
total_time:67469662
         sychen          (   13448) count:331    ave_time:152377 
total_time:50437078
         nthor           (   20593) count:329    ave_time:160890 
total_time:52932920
         atrevino        (   10871) count:315    ave_time:126401 
total_time:39816558
         ajcukierman     (   12585) count:309    ave_time:190847 
total_time:58971795
         mpolansky       (   13716) count:286    ave_time:154179 
total_time:44095373
         jaehwan         (  555445) count:271    ave_time:299567 
total_time:81182770
         dkakkar         (   20839) count:162    ave_time:136002 
total_time:22032362
         tfranzmeyer     (   20760) count:151    ave_time:9757618 
total_time:1473400360
         bmatejek        ( 5900266) count:108    ave_time:9725661 
total_time:1050371420
         rmattheis       (   13299) count:106    ave_time:70802 
total_time:7505094
         daldarondo      (   13304) count:98     ave_time:11673132 
total_time:1143966977
         lshen           (   58176) count:57     ave_time:1756328 
total_time:100110742
         kbystrom        (   20619) count:54     ave_time:622443 
total_time:33611923
         stacchella      (   11809) count:24     ave_time:6230649 
total_time:149535597
         zeyuanhu        (   11515) count:19     ave_time:11172843 
total_time:212284028
         mlovell         ( 5899810) count:16     ave_time:7964527 
total_time:127432435
         jeastman        ( 5899955) count:15     ave_time:9365269 
total_time:140479039
         jseeley         (   14063) count:14     ave_time:288302 
total_time:4036233
         gloewinger      (   13020) count:7      ave_time:14513580 
total_time:101595064
         dchan           (  558943) count:7      ave_time:437885 
total_time:3065195
         benmergui       (   99690) count:7      ave_time:343504 
total_time:2404531
         yaowang         (   11810) count:5      ave_time:14677543 
total_time:73387715
         rhong           (   21011) count:5      ave_time:18807801 
total_time:94039006
         sgaynor         ( 5900197) count:5      ave_time:136195 
total_time:680978
         jpegues         (   21791) count:4      ave_time:25368430 
total_time:101473721
         namluu          (   20825) count:4      ave_time:31107978 
total_time:124431912
         bursley         (   59719) count:4      ave_time:10590207 
total_time:42360829
         mprasad         (   20409) count:3      ave_time:26905909 
total_time:80717729
         jdu             (   13910) count:3      ave_time:24400632 
total_time:73201898
         kmattioli       (  558161) count:3      ave_time:25504835 
total_time:76514505
         francesco       (   11040) count:2      ave_time:7817828 
total_time:15635656
         jgortmaker      (   21075) count:2      ave_time:589 
total_time:1179
         jfroberg        (   13751) count:2      ave_time:6923 
total_time:13846
         jang            (   12650) count:2      ave_time:8283866 
total_time:16567733
         e88bu2024       (   21897) count:2      ave_time:12079052 
total_time:24158104
         rjsingh         (   12725) count:1      ave_time:19749368 
total_time:19749368
         froy            (   12706) count:1      ave_time:277121 
total_time:277121
         slurm           (   57812) count:1      ave_time:131584 
total_time:131584
         cbischoff       (   40921) count:1      ave_time:22678 
total_time:22678
         ghzhang17       (   13095) count:1      ave_time:28174604 
total_time:28174604
         isvetlizky      (   13998) count:1      ave_time:42834247 
total_time:42834247
         bwilder         (   21982) count:1      ave_time:7069726 
total_time:7069726
         ruzhang         (  558295) count:1      ave_time:20086781 
total_time:20086781
         xlu123          (   10076) count:1      ave_time:71023942 
total_time:71023942

Pending RPC statistics
         No pending RPCs

-Paul Edmon-

On 5/1/2020 10:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c19> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Please also upload slurmd log from holy7c08102.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 26 Nate Rini 2020-05-01 21:57:51 MDT

(In reply to Paul Edmon from comment #25)
> Server thread count:  248
> Agent queue size:     0
> Agent count:          2
> DBD Agent queue size: 40

Has the thread count stayed below 256? Is DBD agent size still rising?

(In reply to Paul Edmon from comment #23)
> Here is with 3 partitions open:

I take it normal jobs are now running again?

Comment 27 Nate Rini 2020-05-01 22:01:53 MDT

(In reply to Paul Edmon from comment #24)
> Created attachment 14089 [details]
> Slurmctld log for 2300-2320

These are not expected:
> May  1 23:03:53 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
> May  1 23:04:11 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential
> May  1 23:04:11 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: REQUEST_JOB_STEP_CREATE has authentication error: Invalid authentication credential
> May  1 23:04:11 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential

Can you please make sure the munge key on all of your nodes is synced? munged might need to restarted on at least 1 node.

Comment 28 Paul Edmon 2020-05-01 22:07:00 MDT

The munge key is the same on all nodes, we have not changed it in 
years.  I did send out a global restart for munge.  From past experience 
those munge errors can occur if the RPC load builds up and it starts 
reading through old RPC's with old creds and then it sees that they are 
time shifted.

Its going back and forth between 256 and lower.  It keep stalling.  The 
DBD agent fluctuates up and down.  Here's the current snap shot:

[root@holy-slurm02 log]# sdiag
*******************************************************
sdiag output at Fri May 01 23:58:53 2020 (1588391933)
Data since      Fri May 01 23:55:43 2020 (1588391743)
*******************************************************
Server thread count:  119
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 12
Jobs started:   307
Jobs completed: 44
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Fri May 01 23:57:31 2020 (1588391851)
Jobs pending:   27737
Jobs running:   3736

Main schedule statistics (microseconds):
         Last cycle:   714794
         Max cycle:    777899
         Total cycles: 5
         Mean cycle:   465795
         Mean depth cycle:  142
         Cycles per minute: 1
         Last queue length: 4284

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
         Total backfilled jobs (since last slurm start): 57
         Total backfilled jobs (since last stats cycle start): 57
         Total backfilled heterogeneous job components: 0
         Total cycles: 1
         Last cycle when: Fri May 01 23:57:31 2020 (1588391851)
         Last cycle: 62435039
         Max cycle:  62435039
         Mean cycle: 62435039
         Last depth cycle: 873
         Last depth cycle (try sched): 112
         Depth Mean: 3704
         Depth Mean (try depth): 97
         Last queue length: 4308
         Queue length mean: 8681

Latency for 1000 calls to gettimeofday(): 28 microseconds

Remote Procedure Call statistics by message type
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2149   
ave_time:6418642 total_time:13793662474
         REQUEST_PARTITION_INFO                  ( 2009) count:513    
ave_time:13428  total_time:6889074
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:509    
ave_time:508450 total_time:258801068
         REQUEST_FED_INFO                        ( 2049) count:504    
ave_time:260    total_time:131165
         REQUEST_COMPLETE_PROLOG                 ( 6018) count:243    
ave_time:643353 total_time:156334898
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:121    
ave_time:51655963 total_time:6250371603
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:106    
ave_time:52728963 total_time:5589270179
         REQUEST_STEP_COMPLETE                   ( 5016) count:98     
ave_time:40404250 total_time:3959616513
         REQUEST_STATS_INFO                      ( 2035) count:17     
ave_time:363    total_time:6173
         REQUEST_JOB_USER_INFO                   ( 2039) count:16     
ave_time:20884  total_time:334151
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:12     
ave_time:30059163 total_time:360709964
         REQUEST_JOB_STEP_CREATE                 ( 5001) count:7      
ave_time:4066294 total_time:28464060
         REQUEST_NODE_INFO                       ( 2007) count:6      
ave_time:96975  total_time:581855
         REQUEST_SHARE_INFO                      ( 2022) count:6      
ave_time:42626  total_time:255761
         REQUEST_JOB_PACK_ALLOC_INFO             ( 4027) count:5      
ave_time:2383283 total_time:11916417
         REQUEST_JOB_READY                       ( 4019) count:4      
ave_time:177    total_time:709
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:3      
ave_time:42577422 total_time:127732266
         REQUEST_JOB_INFO                        ( 2003) count:2      
ave_time:49763000 total_time:99526000
         REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:1      
ave_time:325    total_time:325
         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:0      
ave_time:0      total_time:0

Remote Procedure Call statistics by user
         root            (       0) count:2784   ave_time:10721983 
total_time:29850001226
         dingma          (   40306) count:170    ave_time:49644 
total_time:8439544
         sgprasath       (   21136) count:161    ave_time:28255 
total_time:4549095
         mengzhang       (   12484) count:119    ave_time:41288 
total_time:4913306
         kholton         (   20682) count:119    ave_time:5409 
total_time:643712
         ballen          ( 5899723) count:118    ave_time:33120 
total_time:3908259
         simonx          (   12821) count:113    ave_time:20460 
total_time:2312002
         gtorri          (   57089) count:106    ave_time:117094 
total_time:12412022
         sychen          (   13448) count:59     ave_time:7907 
total_time:466557
         jingnandu       (   20922) count:59     ave_time:64345 
total_time:3796386
         bostdiek        (   20817) count:59     ave_time:11734 
total_time:692327
         chsu            (   20796) count:59     ave_time:60402 
total_time:3563761
         nthor           (   20593) count:59     ave_time:8925 
total_time:526607
         jyang12         (   12394) count:59     ave_time:57415 
total_time:3387487
         jaehwan         (  555445) count:59     ave_time:9494 
total_time:560203
         atrevino        (   10871) count:56     ave_time:74506 
total_time:4172341
         ajcukierman     (   12585) count:56     ave_time:46444 
total_time:2600904
         mpolansky       (   13716) count:47     ave_time:6205 
total_time:291652
         acoran          (   22029) count:23     ave_time:2176 
total_time:50060
         sgaynor         ( 5900197) count:10     ave_time:12773453 
total_time:127734533
         daldarondo      (   13304) count:9      ave_time:24144511 
total_time:217300606
         bmatejek        ( 5900266) count:8      ave_time:22954769 
total_time:183638155
         mkelp           (   13432) count:4      ave_time:882833 
total_time:3531334
         rweinberger     (   13263) count:2      ave_time:14020383 
total_time:28040767
         cbaker          (   41029) count:2      ave_time:56949119 
total_time:113898239
         jianxiangqiu    (   22008) count:1      ave_time:15805403 
total_time:15805403
         isvetlizky      (   13998) count:1      ave_time:47368167 
total_time:47368167

Pending RPC statistics
         No pending RPCs

What seems to happen is that it stalls on something, builds up RPCs 
until it hits 256 becomes unresponsive then unwedges and releases a 
bunch of stuff at once.

It's midnight here, so here is my plan.  There is no danger to the jobs, 
and if I leave the scheduler running over night it will just be 
intermittently available to users and it will still schedule jobs when 
it unwedges.  I've put all our partitions as UP with the settings you 
recommended.  We can let it run overnight and see if it magically fixes 
itself by processing through whatever jobs are in it.

That said, I don't know that it will.  I suspect there is something else 
that broke which is causing the stalls but I cannot figure it out. Any 
ideas would be appreciated.  Happy to dial up the log verbosity to get 
more data.  I will send you the latest logs I have before I go to bed.

-Paul Edmon-

Comment 30 Paul Edmon 2020-05-01 22:12:51 MDT

Created attachment 14090 [details]
slurmctld log for 2300-0012

Comment 31 Nate Rini 2020-05-01 22:13:48 MDT

(In reply to Paul Edmon from comment #28)
> Server thread count:  119
> Agent queue size:     0
> Agent count:          0
> DBD Agent queue size: 0
This looks alot better but still too many threads.

> Pending RPC statistics
>          No pending RPCs
Looks like we at least cleared one of the log jams here.

> What seems to happen is that it stalls on something, builds up RPCs 
> until it hits 256 becomes unresponsive then unwedges and releases a 
> bunch of stuff at once.

It might be required to lower this fields in SchedulerParameters in slurm.conf:
> sched_max_job_start=100
Looks like most jobs are getting started by the main scheduler and this would slow that down.

> It's midnight here, so here is my plan.  There is no danger to the jobs, 
> and if I leave the scheduler running over night it will just be 
> intermittently available to users and it will still schedule jobs when 
> it unwedges.  I've put all our partitions as UP with the settings you 
> recommended.  We can let it run overnight and see if it magically fixes 
> itself by processing through whatever jobs are in it.
I'm going to lower this ticket to SEV2 per your response.
 
> That said, I don't know that it will.  I suspect there is something else 
> that broke which is causing the stalls but I cannot figure it out. Any 
> ideas would be appreciated.  Happy to dial up the log verbosity to get 
> more data.  I will send you the latest logs I have before I go to bed.
Please keep them coming. Please also take another gcore per comment#11.

Comment 32 Paul Edmon 2020-05-01 22:19:17 MDT

Created attachment 14091 [details]
core dump at 0018 on 05/02

Comment 33 Paul Edmon 2020-05-02 06:43:54 MDT

Created attachment 14092 [details]
Current slurm.conf as of the morning of May 2nd

Comment 34 Paul Edmon 2020-05-02 06:44:25 MDT

Created attachment 14093 [details]
Core dump at 0840 on 05/02

Comment 35 Paul Edmon 2020-05-02 06:44:58 MDT

Created attachment 14094 [details]
slurmctld log for 05/02 0000-0830

Comment 36 Paul Edmon 2020-05-02 06:47:23 MDT

Good morning.  The night did not solve the problem, so whatever is wedging the scheduler is still wedging it.  I'm digging back in to see if I can figure out what is happening.  I've dropped in fresh logs, let me know if you need anything else or want me to make any changes.

Comment 37 Paul Edmon 2020-05-02 08:08:27 MDT

Some additional probing this morning shows that when I enable a partition then the RPC count just starts stacking up, even a partition that is relatively idle.  This also concerns me:

ay  2 10:06:40 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=278722116 began=10:02:01.375
May  2 10:06:41 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=279759753 began=10:02:01.974
May  2 10:06:42 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=280193002 began=10:02:01.934
May  2 10:06:44 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=282985658 began=10:02:01.470
May  2 10:06:48 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=278315968 began=10:02:09.687
May  2 10:06:48 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=286450800 began=10:02:01.683
May  2 10:06:48 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=286808882 began=10:02:01.389
May  2 10:06:52 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=291082914 began=10:02:01.198

I'm not really sure why the dump job would take so long.  We are also seeing users ping fairly frequently for job info, but that should be pretty matter of course for the scheduler.  Nothing has changed with that so far as I can tell.

Comment 38 Paul Edmon 2020-05-02 08:32:56 MDT

Created attachment 14095 [details]
Core dump at 1030 on 5/2

So this is a core dump in the middle of a stall.  The scheduler sits there for a while unresponsive and then will somehow unwedge and continue.

Comment 39 Paul Edmon 2020-05-02 08:35:56 MDT

Created attachment 14096 [details]
slurmctld coredump 1033 05/02

A few minutes later it did an even larger stall where the load of the slurmctld was sitting in the 4000's

Comment 40 Paul Edmon 2020-05-02 09:19:42 MDT

Manged to get the scheduler settled down again but all the partitions are set to DOWN.  Everytime I try to reopen it stalls.  An strace shows:

pid 18451] connect(1045, {sa_family=AF_LOCAL, sun_path="/var/run/munge/munge.socket.2"}, 110) = 0
[pid 18451] writev(1045, [{"\0`mK\4\2\0\0\0\0\24", 11}, {"\1\1\1\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 20}], 2) = 31
[pid 18451] read(1045, "\0`mK\4\3\0\0\0\0\206", 11) = 11
[pid 18451] read(1045, "\0\0\0\0\0\200MUNGE:AwQDAAA+z2cVavs9bkf6"..., 134) = 134
[pid 18451] close(1045)                 = 0
[pid 18451] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x7f9996bb7630}, {SIG_DFL, [ALRM], SA_RESTORER, 0x7f9996bb7630}, 8) = 0
[pid 18451] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, 8) = 0
[pid 18451] fcntl(1044, F_GETFL)        = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_GETFL)        = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] poll([{fd=1044, events=POLLOUT}], 1, 100000) = 1 ([{fd=1044, revents=POLLOUT}])
[pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18451] sendto(1044, "\0\1\357`", 4, 0, NULL, 0) = 4
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] fcntl(1044, F_GETFL)        = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_GETFL)        = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] poll([{fd=1044, events=POLLOUT}], 1, 100000) = 1 ([{fd=1044, revents=POLLOUT}])
[pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18451] sendto(1044, "\"\0\0\0\0\0\7\332\0\1\356\302\0\0\0\0\0\0\0\0\0\0\0\0\0e\0\0\0\200MU"..., 126816, 0, NULL, 0) = 83984
[pid 18451] poll([{fd=1044, events=POLLOUT}], 1, 100000) = 1 ([{fd=1044, revents=POLLOUT}])
[pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18451] sendto(1044, "b,davis_lab,hepl,hoekstra_lab,ir"..., 42832, 0, NULL, 0) = 42832
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, 8) = 0
[pid 18451] close(1044)                 = 0
[pid 18451] madvise(0x7f98ce70e000, 1028096, MADV_DONTNEED) = 0
[pid 18451] exit(0)                     = ?
[pid 18451] +++ exited with 0 +++
[pid  3015] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  3014] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  2892] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  3015] futex(0x6ef800, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3014] futex(0x6efa00, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3015] <... futex resumed> )       = 0
[pid  3014] <... futex resumed> )       = 0
[pid  2892] futex(0x6ef1e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3015] futex(0x6ef844, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 2207, {1588432667, 0}, ffffffff <unfinished ...>
[pid  3014] futex(0x6efa44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 2207, {1588432667, 0}, ffffffff <unfinished ...>
[pid  2892] <... futex resumed> )       = 0

I'm not sure what those time outs and unavailable resources are related to, but they are likely the source of the deadlocks.

Comment 41 Nate Rini 2020-05-02 10:33:49 MDT

(In reply to Paul Edmon from comment #40)
> Manged to get the scheduler settled down again but all the partitions are
> set to DOWN.  Everytime I try to reopen it stalls.  An strace shows:
Returning to SEV1 per your response.

> I'm not sure what those time outs and 
Slurm uses a mutexes with timeouts as timers internally.

> unavailable resources are related to,
> but they are likely the source of the deadlocks.

Please call this:
> sudo lsof -p $(pgrep slurmctld|head -1) 
> ls -la /proc/$(pgrep slurmctld|head -1)/fd

Comment 42 Nate Rini 2020-05-02 10:35:47 MDT

(In reply to Paul Edmon from comment #40)
> pid 18451] connect(1045, {sa_family=AF_LOCAL,
> sun_path="/var/run/munge/munge.socket.2"}, 110) = 0

Can you please verify how many threads munge is current running with?
> munged --num-threads=10

Comment 44 Nate Rini 2020-05-02 10:38:39 MDT

(In reply to Paul Edmon from comment #43)
Please update your UPDATE_SQUEUE_PASSWORD.

> /usr/sbin/munged --num-threads 10
Can you increase that to 15?

Comment 45 Paul Edmon 2020-05-02 10:39:02 MDT

Created attachment 14097 [details]
lsof

lsof -p $(pgrep slurmctld|head -1)

Comment 46 Paul Edmon 2020-05-02 10:39:23 MDT

Created attachment 14098 [details]
fd

ls -la /proc/$(pgrep slurmctld|head -1)/fd

Comment 47 Paul Edmon 2020-05-02 10:40:18 MDT

Blegh, forgot to sanitize that.  Sorry.

-Paul Edmon-

On 5/2/2020 12:38 PM, bugs@schedmd.com wrote:
>
> *Comment # 44 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c44> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #43  <show_bug.cgi?id=8978#c43>)
> Please update your UPDATE_SQUEUE_PASSWORD.
>
> > /usr/sbin/munged --num-threads 10
> Can you increase that to 15?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 48 Paul Edmon 2020-05-02 10:41:35 MDT

Do you want that to 15 everywhere or just the master?

-Paul Edmon-

On 5/2/2020 12:38 PM, bugs@schedmd.com wrote:
>
> *Comment # 44 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c44> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #43  <show_bug.cgi?id=8978#c43>)
> Please update your UPDATE_SQUEUE_PASSWORD.
>
> > /usr/sbin/munged --num-threads 10
> Can you increase that to 15?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 49 Nate Rini 2020-05-02 10:43:07 MDT

(In reply to Paul Edmon from comment #45)
> Created attachment 14097 [details]
> slurmctld 24649 slurm 1044u  IPv4           13313937       0t0        TCP holy-slurm02.rc.fas.harvard.edu:7298->holy7c16404.rc.fas.harvard.edu:59408 (CLOSE_WAIT

(In reply to Paul Edmon from comment #40)
> [pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN
> (Resource temporarily unavailable)

This is a connection to holy7c16404 that is waiting for more incoming data. I doubt it is the cause of the problem.

Do you see any errors in dmesg about dropped packets or otherwise.

Comment 50 Nate Rini 2020-05-02 10:44:24 MDT

(In reply to Paul Edmon from comment #47)
> Blegh, forgot to sanitize that.  Sorry.

I tagged the post private but this ticket is public so someone else could have seen it.

(In reply to Paul Edmon from comment #48)
> Do you want that to 15 everywhere or just the master?
Just the slurmctld node should be sufficent.

Comment 51 Paul Edmon 2020-05-02 10:46:40 MDT

Thanks.

I've upped munge to 15 on the master.  I did notice this:

[root@holy-slurm02 ~]# ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
         inet 10.31.20.151  netmask 255.255.255.128  broadcast 10.31.20.255
         inet6 fe80::1618:77ff:fe2f:5dfd  prefixlen 64  scopeid 0x20<link>
         ether 14:18:77:2f:5d:fd  txqueuelen 1000  (Ethernet)
         RX packets 24301328  bytes 3654469329 (3.4 GiB)
         RX errors 0  dropped 2  overruns 0  frame 0
         TX packets 47629163  bytes 51083307242 (47.5 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 172.17.0.1  netmask 255.255.0.0  broadcast 0.0.0.0
         inet6 fe80::42:53ff:fe8d:acec  prefixlen 64  scopeid 0x20<link>
         ether 02:42:53:8d:ac:ec  txqueuelen 0  (Ethernet)
         RX packets 2506775  bytes 3221673745 (3.0 GiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 2610484  bytes 4951298627 (4.6 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

em1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
         ether 14:18:77:2f:5d:fd  txqueuelen 1000  (Ethernet)
         RX packets 14200613  bytes 1559840151 (1.4 GiB)
         RX errors 0  dropped 1  overruns 0  frame 0
         TX packets 26682796  bytes 30077933325 (28.0 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
         device interrupt 18

em3: flags=6147<UP,BROADCAST,SLAVE,MULTICAST>  mtu 1500
         ether 14:18:77:2f:5d:fd  txqueuelen 1000  (Ethernet)
         RX packets 0  bytes 0 (0.0 B)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 0  bytes 0 (0.0 B)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
         device interrupt 19

em4: flags=6147<UP,BROADCAST,SLAVE,MULTICAST>  mtu 1500
         ether 14:18:77:2f:5d:fd  txqueuelen 1000  (Ethernet)
         RX packets 0  bytes 0 (0.0 B)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 0  bytes 0 (0.0 B)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
         device interrupt 16

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
         inet 127.0.0.1  netmask 255.0.0.0
         inet6 ::1  prefixlen 128  scopeid 0x10<host>
         loop  txqueuelen 1000  (Local Loopback)
         RX packets 460506  bytes 3147757537 (2.9 GiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 460506  bytes 3147757537 (2.9 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vetha1f8bc7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet6 fe80::50e7:82ff:fe35:54be  prefixlen 64  scopeid 0x20<link>
         ether 52:e7:82:35:54:be  txqueuelen 0  (Ethernet)
         RX packets 2506775  bytes 3256768595 (3.0 GiB)
         RX errors 0  dropped 0  overruns 0  frame 0
         TX packets 2610492  bytes 4951299283 (4.6 GiB)
         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


Nothing in dmesg about dropped packets though but it has at least 
dropped 2.  I will ping our network engineer.

On the holy7c16104 side I didn't see any dropped packets in dmesg or in 
ifconfig.

-Paul Edmon-

On 5/2/2020 12:44 PM, bugs@schedmd.com wrote:
>
> *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c50> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #47  <show_bug.cgi?id=8978#c47>)
> > Blegh, forgot to sanitize that.  Sorry.
>
> I tagged the post private but this ticket is public so someone else could have
> seen it.
>
> (In reply to Paul Edmon fromcomment #48  <show_bug.cgi?id=8978#c48>)
> > Do you want that to 15 everywhere or just the master?
> Just the slurmctld node should be sufficent.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 52 Nate Rini 2020-05-02 10:52:36 MDT

(In reply to Paul Edmon from comment #37)
> I'm not really sure why the dump job would take so long.

The core dump shows 158 threads of _slurm_rpc_dump_jobs() at once. Generally, _slurm_rpc_dump_jobs() is not slow but it does have to grab a few read locks which is a blocking event.

>  We are also seeing
> users ping fairly frequently for job info, but that should be pretty matter
> of course for the scheduler.

Doesn't look like a user is flooding the scheduler with dump requests. root on the other hand has 230 pending RPCs. Are there any scripts that root calls that would call squeue or `scontrol show jobs`? 

Please attach /usr/local/bin/slurm_epilog

Comment 54 Nate Rini 2020-05-02 10:54:21 MDT

(In reply to Paul Edmon from comment #51)
> Nothing in dmesg about dropped packets though but it has at least 
> dropped 2.  I will ping our network engineer.

That is exceptionally low drop count, doubt this is a problem.

Comment 55 Paul Edmon 2020-05-02 10:57:54 MDT

Created attachment 14099 [details]
slurm_epilog

Comment 56 Paul Edmon 2020-05-02 11:00:04 MDT

Yeah, but I'm still going to have him look in case it is symptomatic of 
some larger underlying problem.  As it would explain a lot.  We can 
operate though as if networking isn't a problem.

So far as root running squeue.  We try to avoid it.  We have scripts 
that ping sacct for data and our epilog looks at the local cgroups but 
doesn't touch squeue.  We do have a squeue buffer we built to protect 
against direct invocation of squeue (which was that docker container you 
saw earlier).

-Paul Edmon-

On 5/2/2020 12:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 54 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c54> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #51  <show_bug.cgi?id=8978#c51>)
> > Nothing in dmesg about dropped packets though but it has at least > dropped 2.  I will ping our network engineer.
>
> That is exceptionally low drop count, doubt this is a problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 57 Paul Edmon 2020-05-02 11:04:38 MDT

Just an update, even with the munge thread change it still stalls and 
then eventually throws munge errors.  I don't think the issue is munge 
perse but rather something is stalling/slowing down the slurmctld 
responding to RPC's.  It's either blocking on a specific RPC or its just 
slow in general.  At least that's my thinking.

-Paul Edmon-

On 5/2/2020 12:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 54 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c54> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #51  <show_bug.cgi?id=8978#c51>)
> > Nothing in dmesg about dropped packets though but it has at least > dropped 2.  I will ping our network engineer.
>
> That is exceptionally low drop count, doubt this is a problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 58 Paul Edmon 2020-05-02 11:05:48 MDT

Finally got it unwedged.  Here is the current sdiag:

[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Sat May 02 13:04:46 2020 (1588439086)
Data since      Sat May 02 13:04:13 2020 (1588439053)
*******************************************************
Server thread count:  32
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 5
Jobs started:   0
Jobs completed: 39
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Sat May 02 13:04:13 2020 (1588439053)
Jobs pending:   26207
Jobs running:   3219

Main schedule statistics (microseconds):
         Last cycle:   17985
         Max cycle:    22985
         Total cycles: 4
         Mean cycle:   17120
         Mean depth cycle:  0
         Last queue length: 0

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
         Total backfilled jobs (since last slurm start): 0
         Total backfilled jobs (since last stats cycle start): 0
         Total backfilled heterogeneous job components: 0
         Total cycles: 0
         Last cycle when: Wed Dec 31 19:00:00 1969 (0)
         Last cycle: 0
         Max cycle:  0
         Last depth cycle: 0
         Last depth cycle (try sched): 0
         Last queue length: 0

Latency for 1000 calls to gettimeofday(): 27 microseconds

Remote Procedure Call statistics by message type
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2138   
ave_time:63224  total_time:135175011
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:374    
ave_time:5359051 total_time:2004285165
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:262    
ave_time:42058785 total_time:11019401869
         REQUEST_STEP_COMPLETE                   ( 5016) count:226    
ave_time:5937922 total_time:1341970415
         REQUEST_PARTITION_INFO                  ( 2009) count:177    
ave_time:15757  total_time:2788990
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:176    
ave_time:323436 total_time:56924842
         REQUEST_FED_INFO                        ( 2049) count:154    
ave_time:736    total_time:113366
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:7      
ave_time:16344643 total_time:114412502
         REQUEST_STATS_INFO                      ( 2035) count:7      
ave_time:364    total_time:2553
         REQUEST_JOB_USER_INFO                   ( 2039) count:5      
ave_time:7130   total_time:35653
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:2      
ave_time:28680703 total_time:57361407
         REQUEST_NODE_INFO                       ( 2007) count:2      
ave_time:26570  total_time:53140
         REQUEST_SHARE_INFO                      ( 2022) count:2      
ave_time:58590  total_time:117181
         REQUEST_JOB_INFO                        ( 2003) count:1      
ave_time:57082435 total_time:57082435
         REQUEST_KILL_JOB                        ( 5032) count:0      
ave_time:0      total_time:0

Remote Procedure Call statistics by user
         root            (       0) count:3025   ave_time:4812612 
total_time:14558151642
         gchaudhary      (   20991) count:111    ave_time:7369 
total_time:818010
         sgprasath       (   21136) count:92     ave_time:6468 
total_time:595129
         dingma          (   40306) count:92     ave_time:5403 
total_time:497083
         mengzhang       (   12484) count:38     ave_time:6429 
total_time:244302
         kholton         (   20682) count:37     ave_time:5656 
total_time:209295
         ballen          ( 5899723) count:32     ave_time:5508 
total_time:176273
         eprotzer        (   21969) count:16     ave_time:4557 
total_time:72916
         agomezlievano   (   12518) count:14     ave_time:6645 
total_time:93041
         mpolansky       (   13716) count:14     ave_time:5197 
total_time:72761
         zmorris         (  558739) count:14     ave_time:5157 
total_time:72210
         ajcukierman     (   12585) count:14     ave_time:5183 
total_time:72572
         jyang12         (   12394) count:14     ave_time:6321 
total_time:88499
         jtaylor         (  558629) count:4      ave_time:2715 
total_time:10862
         simonx          (   12821) count:4      ave_time:7099 
total_time:28397
         bmatejek        ( 5900266) count:4      ave_time:14268212 
total_time:57072848
         daldarondo      (   13304) count:3      ave_time:18915876 
total_time:56747628
         swilson         (  556463) count:2      ave_time:28510706 
total_time:57021412
         stat115u20025   (   21481) count:1      ave_time:57108257 
total_time:57108257
         yitongqi        (   21637) count:1      ave_time:318242 
total_time:318242
         vknutson        (  558621) count:1      ave_time:253150 
total_time:253150
         dongyuyang      (   54022) count:0      ave_time:0 total_time:0

Pending RPC statistics
         No pending RPCs

All the partitions are set to down.  I'm going to suspend all the jobs 
that are running so we can rule out any job traffic.

-Paul Edmon-

On 5/2/2020 12:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 54 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c54> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #51  <show_bug.cgi?id=8978#c51>)
> > Nothing in dmesg about dropped packets though but it has at least > dropped 2.  I will ping our network engineer.
>
> That is exceptionally low drop count, doubt this is a problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 59 Nate Rini 2020-05-02 11:10:48 MDT

Please attach your /etc/sssd/sssd.conf

There are 82 connections in the dump waiting on reply from /var/lib/sss/mc/initgroups.

Comment 60 Nate Rini 2020-05-02 11:13:43 MDT

(In reply to Paul Edmon from comment #57)
> Just an update, even with the munge thread change it still stalls and 
> then eventually throws munge errors.
Understood.

> rather something is stalling/slowing down the slurmctld 
> responding to RPC's

Currently, looking for that right now.

Comment 61 Paul Edmon 2020-05-02 11:14:11 MDT

Created attachment 14100 [details]
sssd config

Comment 62 Nate Rini 2020-05-02 11:20:00 MDT

(In reply to Paul Edmon from comment #61)
> Created attachment 14100 [details]
> sssd config

Please also call:
> systemctl status sssd
> time getent passwd >/dev/null
> time getent group >/dev/null
> time getent hosts >/dev/null

Comment 63 Paul Edmon 2020-05-02 11:21:50 MDT

[root@holy-slurm02 sssd]# systemctl status sssd
● sssd.service - System Security Services Daemon
    Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; 
vendor preset: disabled)
    Active: active (running) since Fri 2020-05-01 16:52:08 EDT; 20h ago
  Main PID: 3295 (sssd)
     Tasks: 5
    Memory: 95.0M
    CGroup: /system.slice/sssd.service
            ├─3295 /usr/sbin/sssd -i --logger=files
            ├─3305 /usr/libexec/sssd/sssd_be --domain rc.domain --uid 0 
--gid 0 --logger=files
            ├─3306 /usr/libexec/sssd/sssd_be --domain academic.domain 
--uid 0 --gid 0 --logger=files
            ├─3338 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
            └─3339 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files

May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu 
sssd[be[academic.domain]][3306]: Starting up
May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu 
sssd[be[rc.domain]][3305]: Starting up
May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu sssd[pam][3339]: Starting up
May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu sssd[nss][3338]: Starting up
May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu systemd[1]: Started 
System Security Services Daemon.
May 01 16:52:39 holy-slurm02.rc.fas.harvard.edu 
sssd[be[academic.domain]][3306]: Backend is offline
May 01 16:52:39 holy-slurm02.rc.fas.harvard.edu 
sssd[be[rc.domain]][3305]: Backend is offline
May 01 16:52:44 holy-slurm02.rc.fas.harvard.edu sssd[nss][3338]: 
Enumeration requested but not enabled
May 01 16:52:46 holy-slurm02.rc.fas.harvard.edu 
sssd[be[academic.domain]][3306]: Backend is online
May 01 16:52:46 holy-slurm02.rc.fas.harvard.edu 
sssd[be[rc.domain]][3305]: Backend is online
[root@holy-slurm02 sssd]# time getent passwd > /dev/null

real    0m0.003s
user    0m0.002s
sys     0m0.002s
[root@holy-slurm02 sssd]# time getent group > /dev/null

real    0m0.004s
user    0m0.000s
sys     0m0.003s
[root@holy-slurm02 sssd]# time getent hosts > /dev/null

real    0m0.003s
user    0m0.003s
sys     0m0.000s

On 5/2/2020 1:20 PM, bugs@schedmd.com wrote:
>
> *Comment # 62 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c62> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #61  <show_bug.cgi?id=8978#c61>)
> > Created attachment 14100 [details] <attachment.cgi?id=14100> [details] 
> <attachment.cgi?id=14100&action=edit> > sssd config
>
> Please also call:
> > systemctl status sssd > time getent passwd >/dev/null > time getent group >/dev/null > time 
> getent hosts >/dev/null
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 66 Nate Rini 2020-05-02 11:38:50 MDT

Looking at the backtrace and the logs, dumping the jobs is taking ~20s (_slurm_rpc_dump_jobs). Is it possible to decrease the number of jobs in the system?

(In reply to Paul Edmon from comment #58)
> Jobs pending:   26207
> Jobs running:   3219

Current setting:
> MaxJobCount=300000
> MinJobAge=600

Can we change this to:
> MaxJobCount=30000
> MinJobAge=60

This shouldn't kill any submitted jobs but should cause slurmctld to purge the internal list of completed jobs. It will very likely freeze to rpcs after this change while it writes all the jobs out.

Comment 67 Paul Edmon 2020-05-02 11:44:05 MDT

Yes, I will make those changes.  The MinJobAge=600 was for a HTCondor 
set up we don't run anymore, so that can be dropped as low as we want.  
As for MaxJobCount we will need to go up to 300000 when we are back in 
production as we have roughly 100,000 cores and we are serial job 
dominated.  However for us figuring this out I will reduce it to 30000

-paul Edmon-

On 5/2/2020 1:38 PM, bugs@schedmd.com wrote:
>
> *Comment # 66 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c66> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Looking at the backtrace and the logs, dumping the jobs is taking ~20s
> (_slurm_rpc_dump_jobs). Is it possible to decrease the number of jobs in the
> system?
>
> (In reply to Paul Edmon fromcomment #58  <show_bug.cgi?id=8978#c58>)
> > Jobs pending:   26207 > Jobs running:   3219
>
> Current setting:
> > MaxJobCount=300000 > MinJobAge=600
>
> Can we change this to:
> > MaxJobCount=30000 > MinJobAge=60
>
> This shouldn't kill any submitted jobs but should cause slurmctld to purge the
> internal list of completed jobs. It will very likely freeze to rpcs after this
> change while it writes all the jobs out.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 68 Paul Edmon 2020-05-02 11:51:03 MDT

With both those changes, the scheduler still stalls periodically.  Here 
is the latest sdiag:

[root@holy-slurm02 slurm]# sdiag
*******************************************************
sdiag output at Sat May 02 13:50:48 2020 (1588441848)
Data since      Sat May 02 13:46:26 2020 (1588441586)
*******************************************************
Server thread count:  105
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 43
Jobs started:   0
Jobs completed: 30
Jobs canceled:  1
Jobs failed:    0

Job states ts:  Sat May 02 13:47:34 2020 (1588441654)
Jobs pending:   26300
Jobs running:   1822

Main schedule statistics (microseconds):
         Last cycle:   19484
         Max cycle:    36271
         Total cycles: 8
         Mean cycle:   21381
         Mean depth cycle:  0
         Cycles per minute: 2
         Last queue length: 0

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
         Total backfilled jobs (since last slurm start): 0
         Total backfilled jobs (since last stats cycle start): 0
         Total backfilled heterogeneous job components: 0
         Total cycles: 0
         Last cycle when: Wed Dec 31 19:00:00 1969 (0)
         Last cycle: 0
         Max cycle:  0
         Last depth cycle: 0
         Last depth cycle (try sched): 0
         Last queue length: 0

Latency for 1000 calls to gettimeofday(): 27 microseconds

Remote Procedure Call statistics by message type
         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2138   
ave_time:2040193 total_time:4361934122
         REQUEST_PARTITION_INFO                  ( 2009) count:1595   
ave_time:10651  total_time:16988863
         REQUEST_FED_INFO                        ( 2049) count:1286   
ave_time:259    total_time:333690
         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:1278   
ave_time:268734 total_time:343442068
         REQUEST_SUSPEND                         ( 5014) count:288    
ave_time:1196416 total_time:344567982
         REQUEST_STATS_INFO                      ( 2035) count:225    
ave_time:342    total_time:77062
         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:61     
ave_time:71827286 total_time:4381464453
         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:55     
ave_time:32937763 total_time:1811577013
         REQUEST_STEP_COMPLETE                   ( 5016) count:53     
ave_time:21116692 total_time:1119184683
         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:50     
ave_time:18395768 total_time:919788427
         REQUEST_JOB_USER_INFO                   ( 2039) count:22     
ave_time:13417  total_time:295181
         REQUEST_NODE_INFO                       ( 2007) count:10     
ave_time:36381  total_time:363817
         REQUEST_SHARE_INFO                      ( 2022) count:8      
ave_time:46348  total_time:370785
         REQUEST_JOB_INFO                        ( 2003) count:5      
ave_time:57311490 total_time:286557454
         REQUEST_KILL_JOB                        ( 5032) count:3      
ave_time:86745980 total_time:260237941
         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:2      
ave_time:51794471 total_time:103588943
         REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:2      
ave_time:219    total_time:438
         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:0      
ave_time:0      total_time:0

Remote Procedure Call statistics by user
         root            (       0) count:2899   ave_time:4245000 
total_time:12306257331
         gchaudhary      (   20991) count:503    ave_time:4886 
total_time:2457915
         sgprasath       (   21136) count:374    ave_time:4902 
total_time:1833554
         vsemenova       (   13255) count:368    ave_time:4946 
total_time:1820475
         dingma          (   40306) count:368    ave_time:4918 
total_time:1809926
         wji             (   13315) count:310    ave_time:86234 
total_time:26732845
         jtaylor         (  558629) count:267    ave_time:2582 
total_time:689548
         mengzhang       (   12484) count:257    ave_time:5084 
total_time:1306588
         kholton         (   20682) count:257    ave_time:4895 
total_time:1258220
         ballen          ( 5899723) count:256    ave_time:4995 
total_time:1278894
         u_219006_g_65249( 6000076) count:198    ave_time:1316492 
total_time:260665532
         yitongqi        (   21637) count:134    ave_time:5176 
total_time:693632
         jyang12         (   12394) count:128    ave_time:5000 
total_time:640118
         eprotzer        (   21969) count:128    ave_time:4915 
total_time:629243
         agomezlievano   (   12518) count:128    ave_time:5087 
total_time:651155
         zmorris         (  558739) count:128    ave_time:4966 
total_time:635672
         ajcukierman     (   12585) count:119    ave_time:5085 
total_time:605226
         mpolansky       (   13716) count:113    ave_time:5015 
total_time:566761
         dkakkar         (   20839) count:78     ave_time:5098 
total_time:397652
         yisoe           (   11692) count:37     ave_time:1035729 
total_time:38321999
         daldarondo      (   13304) count:13     ave_time:26376497 
total_time:342894470
         bmatejek        ( 5900266) count:9      ave_time:37364346 
total_time:336279114
         dhur            (   20296) count:3      ave_time:86531605 
total_time:259594815
         svc-cmdbquery   (   13174) count:2      ave_time:18689 
total_time:37378
         dongyuyang      (   54022) count:1      ave_time:161173996 
total_time:161173996
         stat115u20072   (   21528) count:1      ave_time:15424280 
total_time:15424280
         stat115u20076   (   21532) count:1      ave_time:186116347 
total_time:186116347
         yuntian         (   11729) count:1      ave_time:236 total_time:236

Pending RPC statistics
         No pending RPCs

On 5/2/2020 1:38 PM, bugs@schedmd.com wrote:
>
> *Comment # 66 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c66> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Looking at the backtrace and the logs, dumping the jobs is taking ~20s
> (_slurm_rpc_dump_jobs). Is it possible to decrease the number of jobs in the
> system?
>
> (In reply to Paul Edmon fromcomment #58  <show_bug.cgi?id=8978#c58>)
> > Jobs pending:   26207 > Jobs running:   3219
>
> Current setting:
> > MaxJobCount=300000 > MinJobAge=600
>
> Can we change this to:
> > MaxJobCount=30000 > MinJobAge=60
>
> This shouldn't kill any submitted jobs but should cause slurmctld to purge the
> internal list of completed jobs. It will very likely freeze to rpcs after this
> change while it writes all the jobs out.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 69 Paul Edmon 2020-05-02 11:54:41 MDT

Created attachment 14102 [details]
core dump at 1354 on 05/02

This is taken after the recommendations of Comment 66 are implemented.  This during a stall.

Comment 70 Paul Edmon 2020-05-02 12:35:59 MDT

Just an update.  I'm back in the 256 thread stall zone.  I'm trying to get off of that.  When I do a restart it just stalls immediately and goes to 256 threads.  The slurmctld load goes to 4800%, while the load on the box goes up to around 120. It usually then starts to unwedge a bit and the load drops.  But the stall at very high load goes from about 5 minutes.  Even after the load drops it still sits at 256 threads.

Comment 74 Paul Edmon 2020-05-02 13:27:44 MDT

Created attachment 14103 [details]
slurmctld log for 05/02 for 0000-1530

Comment 75 Paul Edmon 2020-05-02 13:29:29 MDT

Managed to suspend all the jobs.  However the stall persists.  So its not any specific job.  Here is the latest sdiag:

[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Sat May 02 15:28:51 2020 (1588447731)
Data since      Sat May 02 15:06:01 2020 (1588446361)
*******************************************************
Server thread count:  20
Agent queue size:     0
Agent count:          0
DBD Agent queue size: 0

Jobs submitted: 142
Jobs started:   0
Jobs completed: 60
Jobs canceled:  9
Jobs failed:    0

Job states ts:  Sat May 02 15:24:19 2020 (1588447459)
Jobs pending:   27048
Jobs running:   0

Main schedule statistics (microseconds):
        Last cycle:   19239
        Max cycle:    156645
        Total cycles: 45
        Mean cycle:   35135
        Mean depth cycle:  0
        Cycles per minute: 2
        Last queue length: 0

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
        Total backfilled jobs (since last slurm start): 0
        Total backfilled jobs (since last stats cycle start): 0
        Total backfilled heterogeneous job components: 0
        Total cycles: 0
        Last cycle when: Wed Dec 31 19:00:00 1969 (0)
        Last cycle: 0
        Max cycle:  0
        Last depth cycle: 0
        Last depth cycle (try sched): 0
        Last queue length: 0

Latency for 1000 calls to gettimeofday(): 28 microseconds

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:3658   ave_time:456506 total_time:1669899695
        REQUEST_FED_INFO                        ( 2049) count:2739   ave_time:1596   total_time:4373005
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:2691   ave_time:677461 total_time:1823050189
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:2177   ave_time:3626470 total_time:7894826940
        REQUEST_SUSPEND                         ( 5014) count:1120   ave_time:786604 total_time:880996772
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:818    ave_time:3222   total_time:2636219
        REQUEST_STATS_INFO                      ( 2035) count:580    ave_time:1584   total_time:918945
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:174    ave_time:2138754 total_time:372143307
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:160    ave_time:51374209 total_time:8219873469
        REQUEST_STEP_COMPLETE                   ( 5016) count:159    ave_time:1284302 total_time:204204069
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:151    ave_time:14300013 total_time:2159302070
        REQUEST_NODE_INFO                       ( 2007) count:127    ave_time:1639406 total_time:208204616
        REQUEST_UPDATE_NODE                     ( 3002) count:95     ave_time:43641858 total_time:4145976587
        REQUEST_JOB_USER_INFO                   ( 2039) count:67     ave_time:565256 total_time:37872199
        REQUEST_KILL_JOB                        ( 5032) count:10     ave_time:32622486 total_time:326224869
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:9      ave_time:78333787 total_time:705004088
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:8      ave_time:62078130 total_time:496625046
        REQUEST_JOB_INFO                        ( 2003) count:8      ave_time:58218129 total_time:465745038
        REQUEST_JOB_ALLOCATION_INFO             ( 4014) count:2      ave_time:1333   total_time:2667
        REQUEST_SHARE_INFO                      ( 2022) count:2      ave_time:36794  total_time:73588

Remote Procedure Call statistics by user
        root            (       0) count:6507   ave_time:3643257 total_time:23706676488
        gchaudhary      (   20991) count:1453   ave_time:35487  total_time:51563785
        dingma          (   40306) count:757    ave_time:20715  total_time:15681613
        ballen          ( 5899723) count:698    ave_time:135215 total_time:94380140
        kholton         (   20682) count:682    ave_time:66109  total_time:45086354
        mengzhang       (   12484) count:678    ave_time:68578  total_time:46495968
        sgprasath       (   21136) count:591    ave_time:84162  total_time:49739818
        yitongqi        (   21637) count:359    ave_time:135091 total_time:48497780
        jyang12         (   12394) count:341    ave_time:127801 total_time:43580388
        mpolansky       (   13716) count:341    ave_time:129903 total_time:44296944
        fernandes       (  556465) count:335    ave_time:145685 total_time:48804673
        eprotzer        (   21969) count:333    ave_time:270713 total_time:90147616
        ajcukierman     (   12585) count:320    ave_time:138905 total_time:44449666
        barkley         (  558696) count:315    ave_time:370040 total_time:116562790
        dkakkar         (   20839) count:300    ave_time:262018 total_time:78605437
        u_159900_g_65249( 6000078) count:247    ave_time:563578 total_time:139203854
        jcornelison     ( 5899776) count:151    ave_time:233926 total_time:35322939
        jwalker         (   20775) count:124    ave_time:3709167 total_time:459936759
        ejones          (   20412) count:50     ave_time:626788 total_time:31339445
        agomezlievano   (   12518) count:40     ave_time:31172  total_time:1246882
        daldarondo      (   13304) count:39     ave_time:28450317 total_time:1109562391
        bmatejek        ( 5900266) count:34     ave_time:27072757 total_time:920473750
        mengdong        (   14125) count:25     ave_time:37434697 total_time:935867437
        cechavarria     (  558213) count:12     ave_time:65728708 total_time:788744500
        jtaylor         (  558629) count:11     ave_time:2477174 total_time:27248914
        sstanley        (   21855) count:2      ave_time:90418654 total_time:180837309
        sgaynor         ( 5900197) count:1      ave_time:37601466 total_time:37601466
        e88bu2004       (   21874) count:1      ave_time:36053763 total_time:36053763
        bgoulet         (   10657) count:1      ave_time:124476679 total_time:124476679
        benmergui       (   99690) count:1      ave_time:47096860 total_time:47096860
        junjiema2       (   21868) count:1      ave_time:53584760 total_time:53584760
        zzhang01        (   58426) count:1      ave_time:26583108 total_time:26583108
        qxz125          (  556345) count:1      ave_time:112538440 total_time:112538440
        atsang          (   14422) count:1      ave_time:25591074 total_time:25591074
        tstgermaine     (  558600) count:1      ave_time:51370  total_time:51370
        cbischoff       (   40921) count:1      ave_time:22218  total_time:22218

Pending RPC statistics
        No pending RPCs

Comment 77 Dominik Bartkiewicz 2020-05-02 13:35:23 MDT

Hi

Could you send me an output from those gdb commands run on the last backtrace?

t 119
2 f
p *job_ptr

I suspect that jobs submitted by this user generate this problem.

Dominik

Comment 78 Paul Edmon 2020-05-02 13:38:11 MDT

[root@holy-slurm02 bug-8978]# gdb /usr/sbin/slurmctld core.41849
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 41851]
[New LWP 41852]
[New LWP 41854]
[New LWP 41856]
[New LWP 41918]
[New LWP 41919]
[New LWP 41920]
[New LWP 41921]
[New LWP 41922]
[New LWP 41923]
[New LWP 41925]
[New LWP 1429]
[New LWP 1442]
[New LWP 1657]
[New LWP 1662]
[New LWP 1669]
[New LWP 1755]
[New LWP 1904]
[New LWP 1933]
[New LWP 1937]
[New LWP 2083]
[New LWP 2366]
[New LWP 2451]
[New LWP 2452]
[New LWP 2453]
[New LWP 2454]
[New LWP 2458]
[New LWP 2460]
[New LWP 2461]
[New LWP 2494]
[New LWP 2498]
[New LWP 2500]
[New LWP 2506]
[New LWP 2511]
[New LWP 2516]
[New LWP 2517]
[New LWP 2518]
[New LWP 2519]
[New LWP 2521]
[New LWP 2524]
[New LWP 2525]
[New LWP 2526]
[New LWP 2528]
[New LWP 2529]
[New LWP 2531]
[New LWP 2533]
[New LWP 2536]
[New LWP 2537]
[New LWP 2538]
[New LWP 2539]
[New LWP 2540]
[New LWP 2544]
[New LWP 2548]
[New LWP 2549]
[New LWP 2550]
[New LWP 2553]
[New LWP 2555]
[New LWP 2558]
[New LWP 2561]
[New LWP 2566]
[New LWP 2569]
[New LWP 2570]
[New LWP 2571]
[New LWP 2576]
[New LWP 2579]
[New LWP 2582]
[New LWP 2587]
[New LWP 2591]
[New LWP 2595]
[New LWP 2598]
[New LWP 2600]
[New LWP 2601]
[New LWP 2602]
[New LWP 2603]
[New LWP 2605]
[New LWP 2611]
[New LWP 2615]
[New LWP 2619]
[New LWP 2621]
[New LWP 2623]
[New LWP 2628]
[New LWP 2635]
[New LWP 2637]
[New LWP 2639]
[New LWP 2644]
[New LWP 2648]
[New LWP 2649]
[New LWP 2650]
[New LWP 2652]
[New LWP 2653]
[New LWP 2654]
[New LWP 2655]
[New LWP 2656]
[New LWP 2667]
[New LWP 2690]
[New LWP 2783]
[New LWP 2931]
[New LWP 2933]
[New LWP 3069]
[New LWP 3081]
[New LWP 3113]
[New LWP 3206]
[New LWP 3215]
[New LWP 3356]
[New LWP 3361]
[New LWP 3521]
[New LWP 3541]
[New LWP 3559]
[New LWP 3620]
[New LWP 3646]
[New LWP 3662]
[New LWP 3765]
[New LWP 3787]
[New LWP 3902]
[New LWP 3961]
[New LWP 3962]
[New LWP 4020]
[New LWP 4069]
[New LWP 4073]
[New LWP 4089]
[New LWP 4110]
[New LWP 4189]
[New LWP 4201]
[New LWP 4255]
[New LWP 4263]
[New LWP 4270]
[New LWP 4276]
[New LWP 4394]
[New LWP 4489]
[New LWP 4492]
[New LWP 4494]
[New LWP 4506]
[New LWP 41849]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
#0  0x00007fc7f8d61de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-19.05.5-1fasrc01.el7.x86_64
(gdb) t 119
[Switching to thread 119 (Thread 0x7fc7d8943700 (LWP 4073))]
#0  0x00007fc7f92f62b3 in _makespace (str=str@entry=0x7fc7d89428c0, 
str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110
110     xstring.c: No such file or directory.
(gdb) 2 f
Undefined command: "2".  Try "help".
(gdb) p *job_ptr
No symbol "job_ptr" in current context.
(gdb) quit

On 5/2/2020 3:35 PM, bugs@schedmd.com wrote:
>
> *Comment # 77 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c77> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> Could you send me an output from those gdb commands run on the last backtrace?
>
> t 119
> 2 f
> p *job_ptr
>
> I suspect that jobs submitted by this user generate this problem.
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 79 Paul Edmon 2020-05-02 13:42:15 MDT

I think this is the command you wanted:

[root@holy-slurm02 bug-8978]# gdb /usr/sbin/slurmctld core.41849
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/slurmctld...done.
[New LWP 41851]
[New LWP 41852]
[New LWP 41854]
[New LWP 41856]
[New LWP 41918]
[New LWP 41919]
[New LWP 41920]
[New LWP 41921]
[New LWP 41922]
[New LWP 41923]
[New LWP 41925]
[New LWP 1429]
[New LWP 1442]
[New LWP 1657]
[New LWP 1662]
[New LWP 1669]
[New LWP 1755]
[New LWP 1904]
[New LWP 1933]
[New LWP 1937]
[New LWP 2083]
[New LWP 2366]
[New LWP 2451]
[New LWP 2452]
[New LWP 2453]
[New LWP 2454]
[New LWP 2458]
[New LWP 2460]
[New LWP 2461]
[New LWP 2494]
[New LWP 2498]
[New LWP 2500]
[New LWP 2506]
[New LWP 2511]
[New LWP 2516]
[New LWP 2517]
[New LWP 2518]
[New LWP 2519]
[New LWP 2521]
[New LWP 2524]
[New LWP 2525]
[New LWP 2526]
[New LWP 2528]
[New LWP 2529]
[New LWP 2531]
[New LWP 2533]
[New LWP 2536]
[New LWP 2537]
[New LWP 2538]
[New LWP 2539]
[New LWP 2540]
[New LWP 2544]
[New LWP 2548]
[New LWP 2549]
[New LWP 2550]
[New LWP 2553]
[New LWP 2555]
[New LWP 2558]
[New LWP 2561]
[New LWP 2566]
[New LWP 2569]
[New LWP 2570]
[New LWP 2571]
[New LWP 2576]
[New LWP 2579]
[New LWP 2582]
[New LWP 2587]
[New LWP 2591]
[New LWP 2595]
[New LWP 2598]
[New LWP 2600]
[New LWP 2601]
[New LWP 2602]
[New LWP 2603]
[New LWP 2605]
[New LWP 2611]
[New LWP 2615]
[New LWP 2619]
[New LWP 2621]
[New LWP 2623]
[New LWP 2628]
[New LWP 2635]
[New LWP 2637]
[New LWP 2639]
[New LWP 2644]
[New LWP 2648]
[New LWP 2649]
[New LWP 2650]
[New LWP 2652]
[New LWP 2653]
[New LWP 2654]
[New LWP 2655]
[New LWP 2656]
[New LWP 2667]
[New LWP 2690]
[New LWP 2783]
[New LWP 2931]
[New LWP 2933]
[New LWP 3069]
[New LWP 3081]
[New LWP 3113]
[New LWP 3206]
[New LWP 3215]
[New LWP 3356]
[New LWP 3361]
[New LWP 3521]
[New LWP 3541]
[New LWP 3559]
[New LWP 3620]
[New LWP 3646]
[New LWP 3662]
[New LWP 3765]
[New LWP 3787]
[New LWP 3902]
[New LWP 3961]
[New LWP 3962]
[New LWP 4020]
[New LWP 4069]
[New LWP 4073]
[New LWP 4089]
[New LWP 4110]
[New LWP 4189]
[New LWP 4201]
[New LWP 4255]
[New LWP 4263]
[New LWP 4270]
[New LWP 4276]
[New LWP 4394]
[New LWP 4489]
[New LWP 4492]
[New LWP 4494]
[New LWP 4506]
[New LWP 41849]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/slurmctld'.
#0  0x00007fc7f8d61de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install 
slurm-slurmctld-19.05.5-1fasrc01.el7.x86_64
(gdb) t 119
[Switching to thread 119 (Thread 0x7fc7d8943700 (LWP 4073))]
#0  0x00007fc7f92f62b3 in _makespace (str=str@entry=0x7fc7d89428c0, 
str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110
110     xstring.c: No such file or directory.
(gdb) f 2
#2  0x000000000044037b in _pack_default_job_details 
(job_ptr=job_ptr@entry=0x67aeff0, buffer=buffer@entry=0x7fc5d4000c90, 
protocol_version=protocol_version@entry=8704) at job_mgr.c:10624
10624   job_mgr.c: No such file or directory.
(gdb) p *job_ptr
$1 = {magic = 4038539564, account = 0x67af630 "olveczky_lab", 
admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x67af600 
"boslogin01", alloc_resp_port = 0, alloc_sid = 241239, array_job_id = 
54727756, array_task_id = 1594,
   array_recs = 0x0, assoc_id = 8941, assoc_ptr = 0x2079990, 
batch_features = 0x0, batch_flag = 2, batch_host = 0x67af6a0 
"holystat12", billable_tres = 2.0869140625, bit_flags = 444596224, 
burst_buffer = 0x0, burst_buffer_state = 0x0,
   check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, 
comment = 0x0, cpu_cnt = 0, cpus_per_tres = 0x0, cr_enabled = 0, 
db_flags = 4, db_index = 765607413, deadline = 0, delay_boot = 0, 
derived_ec = 0,
   details = 0x67af400, direct_set_prio = 0, end_time = 0, end_time_exp 
= 0, epilog_running = false, exit_code = 1, fed_details = 0x0, 
front_end_ptr = 0x0, gres_list = 0x0, gres_alloc = 0x67af660 "", 
gres_detail_cnt = 0,
   gres_detail_str = 0x0, gres_req = 0x67af680 "", gres_used = 0x0, 
group_id = 40013, job_id = 54739752, job_next = 0x0, job_array_next_j = 
0x674e1e0, job_array_next_t = 0x0, job_resrcs = 0x67af700, job_state = 
0, kill_on_node_fail = 1,
   last_sched_eval = 1588436032, licenses = 0x0, license_list = 0x0, 
limit_set = {qos = 0, time = 0, tres = 0x67aefc0}, mail_type = 0, 
mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x67aee70 
"compute_stac", network = 0x0,
   next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, 
node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 
0x0, origin_cluster = 0x0, other_port = 0, pack_details = 0x0, 
pack_job_id = 0,
   pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, 
partition = 0x67aedf0 "serial_requeue,olveczky,shared", part_ptr_list = 
0x61fa050, part_nodes_missing = false, part_ptr = 0x2732f70, power_flags 
= 0 '\000',
   pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, 
priority = 2517554, priority_array = 0x7fc7c8006980, prio_factors = 
0x67aef20, profile = 0, qos_id = 1, qos_ptr = 0x1f3b720, 
qos_blocking_ptr = 0x0, reboot = 0 '\000',
   restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, 
resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, 
select_jobinfo = 0x67af6d0, site_factor = 2147483648, spank_job_env = 0x0,
   spank_job_env_size = 0, start_protocol_ver = 8704, start_time = 0, 
state_desc = 0x0, state_reason = 6, state_reason_prev = 6, 
state_reason_prev_db = 6, step_list = 0x61fa0a0, suspend_time = 0, 
system_comment = 0x0,
   time_last_active = 1588441391, time_limit = 300, time_min = 0, 
tot_sus_time = 0, total_cpus = 0, total_nodes = 0, tres_bind = 0x0, 
tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0, 
tres_per_socket = 0x0, tres_per_task = 0x0,
   tres_req_cnt = 0x680fcc0, tres_req_str = 0x680fb90 
"1=1,2=13000,4=1,5=4", tres_fmt_req_str = 0x680fbc0 
"cpu=1,mem=13000M,node=1,billing=4", tres_alloc_cnt = 0x0, 
tres_alloc_str = 0x0, tres_fmt_alloc_str = 0x0, user_id = 13304,
   user_name = 0x67af5d0 "daldarondo", wait_all_nodes = 0, warn_flags = 
0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, 
wait4switch = 0, best_switch = true, wait4switch_start = 0}
(gdb)

On 5/2/2020 3:35 PM, bugs@schedmd.com wrote:
>
> *Comment # 77 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c77> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> Could you send me an output from those gdb commands run on the last backtrace?
>
> t 119
> 2 f
> p *job_ptr
>
> I suspect that jobs submitted by this user generate this problem.
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 80 Paul Edmon 2020-05-02 13:46:32 MDT

Created attachment 14104 [details]
scontrol show job 54739752

This looks all sorts of bad.  What do you want me to do?  Cancel?

Comment 81 Dominik Bartkiewicz 2020-05-02 13:48:47 MDT

Sorry for the typo. Could you send me also  detail_ptr?
t 119
2 f
p *detail_ptr

Comment 82 Paul Edmon 2020-05-02 13:49:45 MDT

(gdb) t 119
[Switching to thread 119 (Thread 0x7fc7d8943700 (LWP 4073))]
#0  0x00007fc7f92f62b3 in _makespace (str=str@entry=0x7fc7d89428c0, 
str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110
110     xstring.c: No such file or directory.
(gdb) f 2
#2  0x000000000044037b in _pack_default_job_details 
(job_ptr=job_ptr@entry=0x67aeff0, buffer=buffer@entry=0x7fc5d4000c90, 
protocol_version=protocol_version@entry=8704) at job_mgr.c:10624
10624   job_mgr.c: No such file or directory.
(gdb) p *detail_ptr
$1 = {magic = 0, acctg_freq = 0x0, accrue_time = 1588402586, argc = 
7806, argv = 0x67afb60, begin_time = 1588402586, ckpt_dir = 0x67afae0 
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac", cluster_features = 
0x0, contiguous = 0,
   core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534, 
cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 
4294967294, cpus_per_task = 1, orig_cpus_per_task = 1, depend_list = 
0x0, dependency = 0x0,
   orig_dependency = 0x0, env_cnt = 4, env_sup = 0x680fa70, 
exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, extra = 
0x0, feature_list = 0x0, features = 0x0, max_cpus = 4294967294, 
orig_max_cpus = 4294967294, max_nodes = 1,
   mc_ptr = 0x67afb30, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 1, 
orig_min_cpus = 1, min_gres_cpu = 0, min_nodes = 1, nice = 2147483648, 
ntasks_per_node = 0, num_tasks = 1, open_mode = 0 '\000', overcommit = 0 
'\000',
   plane_size = 0, pn_min_cpus = 1, orig_pn_min_cpus = 1, pn_min_memory 
= 13000, orig_pn_min_memory = 13000, pn_min_tmp_disk = 0, prolog_running 
= 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0, 
preempt_start_time = 0,
   req_nodes = 0x0, requeue = 1, restart_dir = 0x0, share_res = 254 
'\376', std_err = 0x67af980 
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac/logs/Job.compute_stac.%N.%j.err", 
std_in = 0x67af9f0 "/dev/null",
   std_out = 0x67afa20 
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac/logs/Job.compute_stac.%N.%j.out", 
submit_time = 1588402465, task_dist = 8192, usable_nodes = 0, whole_node 
= 0 '\000',
   work_dir = 0x67afa90 
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac", x11 = 0, 
x11_magic_cookie = 0x0, x11_target = 0x0, x11_target_port = 0}
(gdb)

On 5/2/2020 3:48 PM, bugs@schedmd.com wrote:
>
> *Comment # 81 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c81> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Sorry for the typo. Could you send me also  detail_ptr?
> t 119
> 2 f
> p *detail_ptr
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 83 Dominik Bartkiewicz 2020-05-02 13:54:56 MDT

How many jobs this type is in the queue now?
It looks those jobs have crazy sizes of argv.
argc = 7806, argv = 0x67afb60
Slurmctld packs this in a not optimal way and I suspect that this can be the root of the problem.

Comment 84 Paul Edmon 2020-05-02 13:57:46 MDT

The user in question:

[root@holy-slurm02 bug-8978]# /usr/bin/squeue -u daldarondo
              JOBID PARTITION     NAME     USER ST       TIME NODES 
NODELIST(REASON)
54727756_[99-105,1 serial_re compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_278 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_279 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_272 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_282 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_269 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_270 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_275 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_277 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_280 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_283 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_281 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_268 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_285 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_276 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_273 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_271 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_284 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_274 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)
       54727756_313 olveczky, compute_ daldaron PD       0:00 1 
(PartitionDown)

I can cancel all those and have a conversation with him. Obviously that 
submission is line is crazy, I have never seen one like it in our 
environment.  Let me know if you want me to axe those jobs, as I want to 
make sure you are done looking.  Once they clear I can try restarting 
the scheduler.  I'm sure it will perform much better once these are purged.

Please confirm I can cancel, as I will wait for your go ahead.

-Paul Edmon-

On 5/2/2020 3:54 PM, bugs@schedmd.com wrote:
>
> *Comment # 83 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c83> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> How many jobs this type is in the queue now?
> It looks those jobs have crazy sizes of argv.
> argc = 7806, argv = 0x67afb60
> Slurmctld packs this in a not optimal way and I suspect that this can be the
> root of the problem.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 85 Dominik Bartkiewicz 2020-05-02 14:02:50 MDT

Yes, please cancel those jobs. If this is a source of this problem, we have enough data to analyze.

Dominik

Comment 86 Paul Edmon 2020-05-02 14:29:53 MDT

Yes, that did it.  We are back at full capacity.  Thank you so much for 
your assistance.  I would have never found those jobs unless you dug 
into that.  Those were just bizarre.

Can you add a limit to the size of argv and that line to prevent this 
from happening again?  I don't really have a good way of stopping this 
sort of abuse, and if it can grind the scheduler to a halt, it should be 
prevented by the scheduler.

Thank you again!

-Paul Edmon-

On 5/2/2020 4:02 PM, bugs@schedmd.com wrote:
>
> *Comment # 85 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c85> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Yes, please cancel those jobs. If this is a source of this problem, we have
> enough data to analyze.
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 87 Dominik Bartkiewicz 2020-05-02 14:46:15 MDT

I'm glad to hear that. For sure, this can be blocked by submit plugin.

We will also try to make the packing code smarter. We will discuss this internally on Monday, and I will let you know what we will do with this. 

Can we drop the severity of this bug now?

Comment 88 Paul Edmon 2020-05-02 14:47:03 MDT

Yes, this is resolved.

-Paul Edmon-

On 5/2/2020 4:46 PM, bugs@schedmd.com wrote:
>
> *Comment # 87 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c87> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> I'm glad to hear that. For sure, this can be blocked by submit plugin.
>
> We will also try to make the packing code smarter. We will discuss this
> internally on Monday, and I will let you know what we will do with this.
>
> Can we drop the severity of this bug now?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 89 Nate Rini 2020-05-02 15:09:33 MDT

(In reply to Paul Edmon from comment #88)
> Yes, this is resolved.

Dropping to sev4 per response.

Comment 92 Dominik Bartkiewicz 2020-05-07 05:23:13 MDT

Hi

This commit should prevent form a similar problem in the future.
It will be included  in 20.02.3 and above
https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87
Let me know if you have any additional questions about this or if we can close this ticket.

Dominik

Comment 93 Paul Edmon 2020-05-07 07:49:35 MDT

Thanks.  Appreciate the fix.  You can close this.

-Paul Edmon-

On 5/7/2020 7:23 AM, bugs@schedmd.com wrote:
>
> *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c92> on 
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik 
> Bartkiewicz <mailto:bart@schedmd.com> *
> Hi
>
> This commit should prevent form a similar problem in the future.
> It will be included  in 20.02.3 and above
> https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87
> Let me know if you have any additional questions about this or if we can close
> this ticket.
>
> Dominik
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 95 Janne Blomqvist 2020-06-08 13:41:58 MDT

This looks like similar to https://bugs.schedmd.com/show_bug.cgi?id=8246 . At least the fix in https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87 looks like it should fix #8246 as well (except for the poor performance for lots of mmap of small files thing, though my patch for #8246 didn't address that one either).

Comment 96 Marshall Garey 2020-06-11 11:11:58 MDT

*** Ticket 9217 has been marked as a duplicate of this ticket. ***

Comment 97 Dominik Bartkiewicz 2020-11-09 07:51:02 MST

*** Ticket 10178 has been marked as a duplicate of this ticket. ***