Ticket 7532

Summary:	slurmctld crash on pthread_create error
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmctld	Assignee:	Marcin Stolarek <cinek>
Status:	RESOLVED FIXED	QA Contact:
Severity:	2 - High Impact
Priority:	---	CC:	ben, cinek, etomeo, jbooth, justinriley, marshall, whowell
Version:	19.05.1
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7360 https://bugs.schedmd.com/show_bug.cgi?id=7571
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	19.05.3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Current slurm.conf Current topology.conf dmesg for slurm master thread apply all bt slurmctld log from crash at 20190809 ~8:30AM 7532_gdb.py 7532_gdb.py 7532_gdb.py output sdiag Output of 7532_gdb.py with a core dump generated from live scheduler. 7532_gdb.py (v3) Updated gdb_py run with the v3 script. 7532_gdb.py (v4) 7532_gdb.py (v5) Contains info about crash at 08-14-19 at 2:30am Contains info about crash at 08-14-19 at 13:26 Contains info about crash at 08-14-19 at 19:03 MemFree_Slurmctld.png ServerMemory.png Data for period of 08/14 21:22 to 08/15 10:58 Contains info about crash at 08-15-19 at 12:41 Slurmctld prolog slurmdbd slurm_epilog Contains info about crash at 08-16-19 at 16:51

Description Paul Edmon 2019-08-07 05:52:51 MDT

We recently upgraded our cluster to 19.05.1 and turned on topology.  Overnight the scheduler crashed with the error below.  It did not leave a core file.

Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=11056468 began=01:37:23.427
Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: Warning: Note very large processing time from _slurm_rpc_step_complete: usec=3655661 began=01:37:30.828
Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: Warning: Note very large processing time from _slurm_rpc_epilog_complete: usec=23748179 began=01:37:10.736
Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: _job_complete: JobId=17902001_7425(17906128) WEXITSTATUS 1
Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: _job_complete: JobId=17902001_7425(17906128) done
Aug  7 01:37:34 holy-slurm02 slurmctld[23420]: Warning: Note very large processing time from _slurm_rpc_complete_batch_script: usec=5702613 began=01:37:28.785
Aug  7 01:37:35 holy-slurm02 slurmdbd[61969]: error: persistent connection experienced an error
Aug  7 01:37:35 holy-slurm02 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
Aug  7 01:37:35 holy-slurm02 docker[138961]: 2019-08-07 05:37:35,150 squeue command /usr/bin/squeue --format='%B:::%C:::%D:::%E:::%F:::%K:::%L:::%M:::%N:::%O:::%P:::%Q:::%R:::%S:::%T:::%U:::%V:::%W:::%Z:::%a:::%b:::%c:::%d:::%e:::%f:::%g:::%h:::%i:::%j:::%k:::%l:::%m:::%n:::%o:::%q:::%r:::%u:::%v:::%y' -h has failed.
Aug  7 01:37:35 holy-slurm02 docker[138961]: slurm_load_jobs error: Zero Bytes were transmitted or received
Aug  7 01:37:37 holy-slurm02 systemd[1]: Unit slurmctld.service entered failed state.
Aug  7 01:37:37 holy-slurm02 systemd[1]: slurmctld.service failed.

We need a fix for this ASAP as the scheduler when restarted also hung on restart.  It took a full restart of slurmdbd and slurmctld to get it going again.

I will attach our topology and slurm confs in a moment.

-Paul Edmon-

Comment 1 Paul Edmon 2019-08-07 06:00:24 MDT

Created attachment 11135 [details]
Current slurm.conf

Comment 2 Paul Edmon 2019-08-07 06:00:44 MDT

Created attachment 11136 [details]
Current topology.conf

Comment 3 Marcin Stolarek 2019-08-07 08:30:03 MDT

Paul, 

Could you please execute dmesg command and share the results? You can attach to the bug report as a file instead of copy&paste into the message since the output may be more than a few lines.

cheers,
Marcin

Comment 4 Paul Edmon 2019-08-07 08:36:38 MDT

Created attachment 11139 [details]
dmesg for slurm master

Looks to be full of NFS errors.  We mount out our slurm.conf from holy-slurm02 as a NFS mount that all our hosts use.  So that error is due to a number of connections exceeding the number of NFS threads we have which is 256.

Comment 5 Marcin Stolarek 2019-08-07 08:48:09 MDT

Treating this together with bug 7528 I'd recommend starting with:
-) increase number of NFS threads
-) increase system limits for number of open files

cheers,
Marcin

Comment 6 Paul Edmon 2019-08-07 09:21:33 MDT

We doubled the limit on open files and NFS.

[root@holy-slurm02 ~]# cat /proc/30434/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited unlimited            seconds
Max file size             unlimited unlimited            bytes
Max data size             unlimited unlimited            bytes
Max stack size            unlimited unlimited            bytes
Max core file size        unlimited unlimited            bytes
Max resident set          unlimited unlimited            bytes
Max processes             1030065 1030065              processes
Max open files            16384 16384                files
Max locked memory         65536 65536                bytes
Max address space         unlimited unlimited            bytes
Max file locks            unlimited unlimited            locks
Max pending signals       1030065 1030065              signals
Max msgqueue size         819200 819200               bytes
Max nice priority         0 0
Max realtime priority     0 0
Max realtime timeout      unlimited unlimited            us

-Paul Edmon-

On 8/7/19 10:48 AM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c5> on bug 
> 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Treating this together withbug 7528  <show_bug.cgi?id=7528>  I'd recommend starting with:
> -) increase number of NFS threads
> -) increase system limits for number of open files
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 7 Jason Booth 2019-08-07 11:05:49 MDT

Hi Paul - I am just checking up on your status with regards to the cluster and the changes you made. Are things behaving better now that you have doubled the limit on open files and NFS?

Comment 8 Justin Riley 2019-08-07 15:15:03 MDT

We just had another slurmctld crash but this time simply restarting the slurmctld process seems to have worked (up for ~30 min anyway). This was the error on console after the crash before restarting the service:

slurmctld[30434]: email msg to <redacted>: Slurm Job_id=17848691 Name=k-113.1 Failed, Run time 00:31:32, PREEMPTED, ExitCode 0
slurmctld[30434]: preempted JobId=17848691 has been requeued to reclaim resources for JobId=17955815
slurmctld[30434]: job_signal: 10 of running JobId=17848692 successful 0x1
slurmctld[30434]: email msg to <redacted>: Slurm Job_id=17848692 Name=k-114.1 Failed, Run time 00:31:32, PREEMPTED, ExitCode 0
slurmctld[30434]: preempted JobId=17848692 has been requeued to reclaim resources for JobId=17955815
slurmctld[30434]: fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable
slurmctld[30434]: fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable
systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Unit slurmctld.service entered failed state.
systemd[1]: slurmctld.service failed.

Comment 9 Jason Booth 2019-08-07 15:26:23 MDT

Can you tell us a bit more about this system? What OS is this, and is Slurm running in a virtual environment? 

This issue may be related to 7360. 

In bug #7360 comment #32, we mention a diagnostic patch. I will let Marcin make the decision on if you should apply this as well or not. For now, I will relate this issue with 7360.

Comment 10 Justin Riley 2019-08-07 17:17:46 MDT

System is running CentOS 7.6.1810 on physical hardware (24 cores + HT and 256GB RAM)

Comment 11 Marcin Stolarek 2019-08-08 04:05:48 MDT

If you can apply the diagnostic patch from bug #7360 comment #32 it will provide us more insights in case of another failure. I recommend that. 

In the first message I see that there was a squeue command executed under docker? Are you running slurm daemons in containers as well? If yes could you please execute `ulimit -a` in daemons containers and share results with us? The command should be something like:
>docker exec -it CONTAINERID /bin/bash -c "ulimit -a"

Will you kindly share output of following commands:
>cat /proc/sys/kernel/threads-max 
>cat /proc/sys/kernel/pid_max 
>ps auxH | wc -l 
>dmesg -T 


cheers,
Marcin

Comment 12 Marcin Stolarek 2019-08-08 04:39:06 MDT

Additionally I'd like to ask for the output of sdiag.

cheers,
Marcin

Comment 13 Justin Riley 2019-08-08 08:23:48 MDT

Created attachment 11157 [details]
output requested by marcin (sdiag, max threads, max pids, etc)

We're not running slurm in docker containers at the moment (although would be nice to get there eventually) - the container you see in the logs is a side car that caches squeue output in a db.

I've attached the output you requested. I will work on applying the diagnostic patch for the next crash (we had another one this morning). Please note we've had these NFS warnings for a while without issue. We also doubled the number of NFS threads yesterday from 64 to 128 and haven't noticed any difference.

Comment 14 Marcin Stolarek 2019-08-08 09:16:16 MDT

Thanks for the output. Is it possible to get your slurmctld log? If you consider information like user names private you may want to anonymize it.

cheers,
Marcin

Comment 15 Justin Riley 2019-08-08 09:36:21 MDT

Would you mind deleting the attachment I posted (#11157) and I will reupload? The attachment has usernames. I will work on getting you a slurmctld log. Do you want the logs for today or just around the crash or ? We have lots of slurmctld logs... (gigabytes)

Comment 16 Marcin Stolarek 2019-08-08 10:09:17 MDT

Justin, 

I switched it to visible only to SchedMD employees, is it OK for you?

cheers,
Marcin

Comment 17 Justin Riley 2019-08-09 07:59:33 MDT

Yes that's fine thank you! I applied the patch yesterday and this morning the slurmctld daemon crashed again so we now have a 55G core dump. What would you like to see next?

Comment 18 Marcin Stolarek 2019-08-09 08:07:51 MDT

Justin, 

Can you try to execute following commands:
gdb /path/to/slurmctld /path/to/core
(gdb)thread apply all bt


cheers,
Marcin

Comment 19 Justin Riley 2019-08-09 08:21:55 MDT

Created attachment 11178 [details]
thread apply all bt

Here's the output requested. I think I did the gdb logging to file correctly but let me know if this looks incomplete. Thanks.

Comment 20 Marcin Stolarek 2019-08-09 08:35:53 MDT

Thanks, is it possible to get slurmctld logs as well?

cheers,
Marcin

Comment 21 Justin Riley 2019-08-09 10:07:58 MDT

Created attachment 11181 [details]
slurmctld log from crash at 20190809 ~8:30AM

This is the log from midnight last night up until this morning's crash at ~8:30AM.

Comment 22 Marcin Stolarek 2019-08-09 12:32:57 MDT

Justin,

Could you please enable Route debugflag by:
>scontrol setdebugflag +Route

cheers,
Marcin

Comment 23 Justin Riley 2019-08-09 13:20:10 MDT

We might wait to do that until Monday given that SLURM has been very sluggish since the upgrade and we're afraid it might become even more unstable over the weekend with debug mode enabled...

Comment 24 Marcin Stolarek 2019-08-12 07:55:15 MDT

Justin, 

Looking at your topology.conf and back track I've found out that a majority of threads is coming from communication with nodes connected to the switch called: holyhdr. Could you please check if there are some issues with these connections? (I don't have any evidence of it, but the coincidence looks suspicious).

BTW. You have quite big MessageTimeout=100 configured. What was the reason for that? In this case, if you'll have lower MessageTimeout limit I'll expect some of those threads timed out preventing the failure. I don't mean it's the root cause and we're continuing the investigation. 

Could you please load the coredump one more time and execute attached python script by:
gdb ./slurmctl ./corefile
(gdb) source 7532_gdb.py

and share the results with us? I'll attach 7532_gdb.py to the case in a minute.

cheers,
Marcin

Comment 25 Marcin Stolarek 2019-08-12 07:57:05 MDT

Created attachment 11190 [details]
7532_gdb.py

python scrip extracting some data from core dump

Comment 26 Marcin Stolarek 2019-08-12 08:11:04 MDT

Created attachment 11191 [details]
7532_gdb.py

small fix in python gdb script

Comment 27 Paul Edmon 2019-08-12 08:46:52 MDT

I've turned on the Route debug flag, I noticed that isn't in the 
slurm.conf option list.  Is that new or undocumented?  Will it survive 
restarts of the scheduler?

As for topology, the nodes on holyhdr you are seeing are the new cluster 
that we are working to stand up.  We preemptively added those to the 
conf so that they would come in as they were built. The slowness and 
crashes occurred before we added those on Wednesday (pretty much from 
when we did the slurm upgrade).  We typically have various nodes down on 
the cluster at any given time, so the scheduler getting hung up on down 
nodes is definitely not optimal.  It should be resilient enough to 
withstand large portions of the cluster going out at any given time.  
Anyways if those nodes are causing noise which is making it hard to 
debug let me know and I can drop them from the conf until we have them 
built, let me know if you want me too.  We can also try turning topology 
off completely and see if that helps stability.

Justin is working on getting the other info you asked for and should 
have that back soon.

-Paul Edmon-

On 8/12/19 10:11 AM, bugs@schedmd.com wrote:
> Marcin Stolarek <mailto:cinek@schedmd.com> changed bug 7532 
> <https://bugs.schedmd.com/show_bug.cgi?id=7532>
> What 	Removed 	Added
> Attachment #11190 [details] is obsolete 		1
>
> *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c26> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Createdattachment 11191 <attachment.cgi?id=11191> [details] 
> <attachment.cgi?id=11191&action=edit>
> 7532_gdb.py
>
> small fix in python gdb script
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 28 Justin Riley 2019-08-12 08:51:50 MDT

Created attachment 11193 [details]
7532_gdb.py output

Hi Marcin,

Here's the output you requested from the core dump file using the latest version of the Python script you sent. Let us know if you need anything else. Thanks!

~Justin

Comment 29 Marcin Stolarek 2019-08-12 08:53:21 MDT

Paul, 

Thanks for the prompt reply. Yes - I agree that down nodes shouldn't be an issue, however, if it's possible to remove them just for debugging purpose please do that. 

The route debug flag set from the command line will not survive the restart (as for all of those). However, if we'll get the slurmctld log from the crash we may not need it anymore. You're correct it's not documented, the patch for documentation is already on the way to release.

cheers,
Marcin

Comment 30 Paul Edmon 2019-08-12 08:57:00 MDT

Cool. I will add that to the slurm.conf for now just to make sure it is 
there, and I will drop the new nodes from the conf until they are ready.

-Paul Edmon-

On 8/12/19 10:53 AM, bugs@schedmd.com wrote:
>
> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c29> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Thanks for the prompt reply. Yes - I agree that down nodes shouldn't be an
> issue, however, if it's possible to remove them just for debugging purpose
> please do that.
>
> The route debug flag set from the command line will not survive the restart (as
> for all of those). However, if we'll get the slurmctld log from the crash we
> may not need it anymore. You're correct it's not documented, the patch for
> documentation is already on the way to release.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 31 Paul Edmon 2019-08-12 10:28:39 MDT

Okay, those nodes are out of the slurm.conf and topology.

-Paul Edmon-

On 8/12/19 10:53 AM, bugs@schedmd.com wrote:
>
> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c29> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Thanks for the prompt reply. Yes - I agree that down nodes shouldn't be an
> issue, however, if it's possible to remove them just for debugging purpose
> please do that.
>
> The route debug flag set from the command line will not survive the restart (as
> for all of those). However, if we'll get the slurmctld log from the crash we
> may not need it anymore. You're correct it's not documented, the patch for
> documentation is already on the way to release.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 32 Marcin Stolarek 2019-08-13 08:26:21 MDT

Paul, 
Justin,

Did removal of down nodes improve Slurm responsiveness, did you notice another fatal error? 

I've tried to further understand the issue and as far as I agree that down nodes are normally not an issue in your case the siutation is special. The number of added nodes is high - 670 nodes. Slurmctld triggers REQUEST_NODE_REGISTRATION_STATUS for those every SlurmdTimeout/3 seconds and awaits reply for MessageTimeout seconds. In your case both are equal to 100. Moast often those are forwarded by slurmd's, however, when the forwarding node is down we're triggering parallel query to not responding nodes, which results in a lot of threads in your case.

Specifically to cover the case of nodes "in preparation" you can set "future" state of nodes in slurm.conf.

>FUTURE    Indicates the node is defined for future use and need not exist when the Slurm daemons are started. These nodes can be made available for use simply by updating the node state  using
>          the  scontrol  command rather than restarting the slurmctld daemon. After these nodes are made available, change their State in the slurm.conf file. Until these nodes are made avail‐
>          able, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.

I'm still looking into the case concentraiting now on identification of the resouces causing "Resource temporarily unavailable" error. Could you please help me answering the questions below: 
What's the time between slurmctld crashes? Do you suspect any coincidences with other tools/time of day/users activity?
Can you share slurmctld.log with Route setdebugflag enabled?
What is the conntent of /proc/SLURMCTLD_PID/limits?
What's the result of:
#ps auxH | grep slurmctld | grep -v grep | wc -l 
#ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l 
#cat /proc/meminfo
#cat /proc/SLURMCTLD_PID/statm
Are you able to monitor those over time? 

cheers,
Marcin

Comment 33 Paul Edmon 2019-08-13 08:49:44 MDT

With the removal of those nodes Slurm hasn't been more responsive, its 
still sluggish to respond to scontrol.  We have a diamond script that 
regularly polls for node state.  In the previous generation (18.08.x) we 
never had any issues and things were very responsive.  However we are 
now seeing scontrol's stacking up from our diamond collectors:

[root@holy-slurm02 ~]# ps aux | grep scontrol
root       1949  0.0  0.0  53652  5916 ?        S    03:06   0:00 
scontrol -o show node
root       9738  0.0  0.0  53560  5928 ?        S    Aug12   0:00 
scontrol -o show node
root      10580  0.0  0.0  53648  5912 ?        S    03:08   0:00 
scontrol -o show node
root      12742  0.0  0.0  53648  5908 ?        S    09:40   0:00 
scontrol -o show node
root      13226  0.0  0.0  53560  5928 ?        S    Aug12   0:00 
scontrol -o show node
root      34973  0.0  0.0  53560  5936 ?        S    Aug12   0:00 
scontrol -o show node
root      38236  0.0  0.0  53652  5928 ?        S    01:01   0:00 
scontrol -o show node
root      40510  0.0  0.0  53648  5904 ?        S    03:15   0:00 
scontrol -o show node
root      44737  0.0  0.0  53648  5912 ?        S    03:16   0:00 
scontrol -o show node
root      45990  0.0  0.0  53652  5912 ?        S    01:03   0:00 
scontrol -o show node
root      49529  0.0  0.0  53652  5920 ?        S    01:04   0:00 
scontrol -o show node
root      56725  0.0  0.0  53648  5916 ?        S    03:19   0:00 
scontrol -o show node
root      58075  0.0  0.0  53652  5928 ?        S    01:48   0:00 
scontrol -o show node

[root@holy-slurm02 ~]# ps aux | grep scontrol | wc -l
44

Diamond polls about every 60 seconds so scontrol isn't returning with in 
that window.  This is from the master itself which should be the fastest 
place to query.  We have users from across the cluster that also poll 
the master so if diamond can't keep up everyone else will be even slower.

We haven't seen a crash since yesterday but I have deliberately 
restarted the master periodically to ward off crashes.  From previous 
experience crashes happen after about 12 hours.  If you wish to see 
another crash I can let the scheduler just run with out restarting but 
it may be a half a day before it does and it may crash at an inopportune 
time (i.e. in the middle of the night).

As it stands I don't have a good idea as to the instigation of the 
crash.  If I were to hazard a guess given that I'm seeing the build up 
of scontrol's here, it may be a build up of requests or a burst of 
requests that is causing the issue.  Given that the slowness appears to 
be the responsiveness to scontrol that may be a hint as to cause.  That 
said that is all speculation and our historic thread count (we have 
graphs of sdiag that we keep) indicate no spikes in agent queue size or 
thread count before it crashes.  So it may either be sudden or a slow 
burn and accumulation of cruft that then causes the crash.  My hunch is 
the later, given that we never saw this behavior in 18.08 which was far 
more responsive.  The significant changes we made was the upgrade to 
19.05 and to turn on topology.  So it is either changes made by the 
19.05 upgrade or it is something in topology that is not agreeing with 
our set up.  We have not attempted to just turn off topology and see if 
that improves matters.  That may give a hint if we did.

As mentioned previously our slurmctld.log's are massive.  Do you have a 
time window you want to look at?  I can get you what we have of that 
window.  As for the other things:

[root@holy-slurm02 diamond]# cat /proc/75722/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited unlimited            seconds
Max file size             unlimited unlimited            bytes
Max data size             unlimited unlimited            bytes
Max stack size            unlimited unlimited            bytes
Max core file size        unlimited unlimited            bytes
Max resident set          unlimited unlimited            bytes
Max processes             1030065 1030065              processes
Max open files            16384 16384                files
Max locked memory         65536 65536                bytes
Max address space         unlimited unlimited            bytes
Max file locks            unlimited unlimited            locks
Max pending signals       1030065 1030065              signals
Max msgqueue size         819200 819200               bytes
Max nice priority         0 0
Max realtime priority     0 0
Max realtime timeout      unlimited unlimited            us

[root@holy-slurm02 diamond]# ps auxH | grep slurmctld | grep -v grep | wc -l
56

[root@holy-slurm02 diamond]# ls /proc/75722/fdinfo/ | wc -l
1049

[root@holy-slurm02 diamond]# cat /proc/meminfo
MemTotal:       263721992 kB
MemFree:        10807008 kB
MemAvailable:   201588824 kB
Buffers:            3404 kB
Cached:         189931848 kB
SwapCached:       380228 kB
Active:         138173532 kB
Inactive:       107306896 kB
Active(anon):   53674020 kB
Inactive(anon):  4863440 kB
Active(file):   84499512 kB
Inactive(file): 102443456 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4194300 kB
SwapFree:        1324152 kB
Dirty:               104 kB
Writeback:             0 kB
AnonPages:      55165232 kB
Mapped:           105256 kB
Shmem:           2992272 kB
Slab:            5099216 kB
SReclaimable:    4722928 kB
SUnreclaim:       376288 kB
KernelStack:       21184 kB
PageTables:       179824 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    136055296 kB
Committed_AS:   107473392 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      814764 kB
VmallocChunk:   34224633408 kB
HardwareCorrupted:     0 kB
AnonHugePages:   8017920 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      551632 kB
DirectMap2M:    14950400 kB
DirectMap1G:    254803968 kB

[root@holy-slurm02 diamond]# cat /proc/75722/statm
6852587 2804962 1437 237 0 6823036 0

We don't currently have any monitoring set up for those over time.  As 
noted we do dump out the stats from sdiag and track those over time.  We 
could try setting up monitoring for load and ram usage over time but 
more detailed data would be more tricky and take a bit of work to get done.

In summary if I were to surmise a guess as to where the issue lies there 
is likely some hang up in the communications stack in 19.05 with 
topology compared to 18.08 with out.  Now what I don't know is if it is 
purely with topology or not, but something in those two changes is the 
culprit.  What I also don't know is if the crash that is occurring is 
due to the slowness in the communication stack or not, they may in fact 
be independent problems.  My hunch is that they aren't but its hard to say.

-Paul Edmon-

On 8/13/19 10:26 AM, bugs@schedmd.com wrote:
>
> *Comment # 32 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c32> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
> Justin,
>
> Did removal of down nodes improve Slurm responsiveness, did you notice another
> fatal error?
>
> I've tried to further understand the issue and as far as I agree that down
> nodes are normally not an issue in your case the siutation is special. The
> number of added nodes is high - 670 nodes. Slurmctld triggers
> REQUEST_NODE_REGISTRATION_STATUS for those every SlurmdTimeout/3 seconds and
> awaits reply for MessageTimeout seconds. In your case both are equal to 100.
> Moast often those are forwarded by slurmd's, however, when the forwarding node
> is down we're triggering parallel query to not responding nodes, which results
> in a lot of threads in your case.
>
> Specifically to cover the case of nodes "in preparation" you can set "future"
> state of nodes in slurm.conf.
>
> >FUTURE    Indicates the node is defined for future use and need not exist when the Slurm daemons are started. These nodes can be made available for use simply by updating the node state  using > the scontrol command rather than restarting the slurmctld daemon. 
> After these nodes are made available, change their State in the 
> slurm.conf file. Until these nodes are made avail‐ > able, they will 
> not be seen using any Slurm commands or nor will any attempt be made 
> to contact them.
>
> I'm still looking into the case concentraiting now on identification of the
> resouces causing "Resource temporarily unavailable" error. Could you please
> help me answering the questions below:
> What's the time between slurmctld crashes? Do you suspect any coincidences with
> other tools/time of day/users activity?
> Can you share slurmctld.log with Route setdebugflag enabled?
> What is the conntent of /proc/SLURMCTLD_PID/limits?
> What's the result of:
> #ps auxH | grep slurmctld | grep -v grep | wc -l
> #ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l
> #cat /proc/meminfo
> #cat /proc/SLURMCTLD_PID/statm
> Are you able to monitor those over time?
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 34 Paul Edmon 2019-08-13 09:23:17 MDT

We just did an experiment with turning off topology and select/cons_tres (reverting it to select/cons_res) and saw no change in responsiveness.  So the slowness at least is due to the 19.05 upgrade in general and not the topology.  Doesn't rule out topology causing the crash, but the slowness of response for scontrol and other queries is a general feature of 19.05 that wasn't in 18.08.  In addition we ran 18.08 on the previous settings with no issue.  So something happened in 19.05 that is bogging things down.

Comment 35 Marcin Stolarek 2019-08-13 09:51:04 MDT

Paul, 

Thanks for the test with reverted configuration. I didn't expect that route/topology or select/cons_tres may be the reason but it's confirmed now. 

Instead of periodic slurmctld restart you can use a systemd "Restart" feature, simply adding to slurmctld.service definition:
>Restart=on-failure
followed by systemctld daemon-reload command. Should restart slurmctld when it fails. 

cheers,
Marcin

Comment 36 Paul Edmon 2019-08-13 09:54:03 MDT

We have puppet set to auto restart slurmctld if it finds it failed.  
Though sometimes when it fails it just wedges on restart and we have to 
restart the whole stack (slurmdbd and mariadb) as it appears that some 
of the slurmdbd queries get hung up and you need to clean house completely.

Still we will look at the auto restart option too.  I will leave the 
scheduler alone to fail again so we can get a fresh core and a new log roll.

-Paul Edmon-

On 8/13/19 11:51 AM, bugs@schedmd.com wrote:
>
> *Comment # 35 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c35> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Thanks for the test with reverted configuration. I didn't expect that
> route/topology or select/cons_tres may be the reason but it's confirmed now.
>
> Instead of periodic slurmctld restart you can use a systemd "Restart" feature,
> simply adding to slurmctld.service definition:
> >Restart=on-failure
> followed by systemctld daemon-reload command. Should restart slurmctld when it
> fails.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 37 Paul Edmon 2019-08-13 09:58:38 MDT

I was wondering what your best guess right now as to the pthread fault is?  As established it isn't just our instance as other people have filed tickets about this and it appears to be 19.05 series specific.  I'm happy to put in any other instrumentation into our scheduler needed to trace this back I was just curious as to the current theory and/or guess.

Comment 38 Marcin Stolarek 2019-08-13 11:04:05 MDT

I have two additional questions: 
Can you share a fresh sdiag with us?
Can you get coredump of your running slurmctld(using gcore[1]) and apply 7532_gdb.py to it as previously (comment 24)?

From the previous one, it's quite uncommon that you have so many REQUEST_BUILD_INFO RPCs. Do you have any automation periodically issuing something like: scontrol show config or sprio? If yes - can you disable it or reduce the frequency. (Just to test)

We're not sure whether other issues are related. In your case, I'd like to understand why you see so many RPC being processed slowly. It doesn't look like the issue (or it's low probability) from the previous sdiag, but as a blind guess can you change your SchedulerParameters changing bf_interval=60 and adding bf_yield_sleep=800000?

cheers,
Marcin
[1] http://man7.org/linux/man-pages/man1/gcore.1.html

Comment 39 Paul Edmon 2019-08-13 11:24:11 MDT

Created attachment 11212 [details]
sdiag

Comment 40 Paul Edmon 2019-08-13 11:25:08 MDT

Created attachment 11213 [details]
Output of 7532_gdb.py with a core dump generated from live scheduler.

Comment 41 Paul Edmon 2019-08-13 11:30:02 MDT

I just put in the outputs of sdiag and 7532_gdb.py on a current core 
dump from the live scheduler.

As I noted in my previous email we have diamond scripts polling node 
state and sdiag every 60 seconds for information.  You can see the 
scripts here:

https://github.com/fasrc/slurm-diamond-collector

I'm not keen on turning these off as they are important to us getting 
insight in to how the scheduler is doing.  Plus they have worked fine 
with out causing issue with slurm for many years.  If we need to cut 
down on noise so we can debug we can do so, but we can't leave them off 
for long periods as we need the stats.

I will add in the bf parameters you recommend and let you know if I see 
any difference.

-Paul Edmon-

On 8/13/19 1:04 PM, bugs@schedmd.com wrote:
>
> *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c38> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> I have two additional questions:
> Can you share a fresh sdiag with us?
> Can you get coredump of your running slurmctld(using gcore[1]) and apply
> 7532_gdb.py to it as previously (comment 24  <show_bug.cgi?id=7532#c24>)?
>
>  From the previous one, it's quite uncommon that you have so many
> REQUEST_BUILD_INFO RPCs. Do you have any automation periodically issuing
> something like: scontrol show config or sprio? If yes - can you disable it or
> reduce the frequency. (Just to test)
>
> We're not sure whether other issues are related. In your case, I'd like to
> understand why you see so many RPC being processed slowly. It doesn't look like
> the issue (or it's low probability) from the previous sdiag, but as a blind
> guess can you change your SchedulerParameters changing bf_interval=60 and
> adding bf_yield_sleep=800000?
>
> cheers,
> Marcin
> [1]http://man7.org/linux/man-pages/man1/gcore.1.html
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 42 Paul Edmon 2019-08-13 11:55:26 MDT

Looks like that change did not improve responsiveness to scontrol or 
other queries.  It's still slow.  I'm reverting back to our old config.

We should focus back on the pthread issue as that is really the main 
problem.  Having the scheduler crash periodically is clearly a major 
problem.  The slowness is annoying but still not as urgent as the 
pthread issue.  I think we have gotten off track looking at the slowness 
unless we really think that solving the slowness will solve the crashing 
issue.  I'm happy to have both solved but the crash is more urgent.

-Paul Edmon-

On 8/13/19 1:04 PM, bugs@schedmd.com wrote:
>
> *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c38> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> I have two additional questions:
> Can you share a fresh sdiag with us?
> Can you get coredump of your running slurmctld(using gcore[1]) and apply
> 7532_gdb.py to it as previously (comment 24  <show_bug.cgi?id=7532#c24>)?
>
>  From the previous one, it's quite uncommon that you have so many
> REQUEST_BUILD_INFO RPCs. Do you have any automation periodically issuing
> something like: scontrol show config or sprio? If yes - can you disable it or
> reduce the frequency. (Just to test)
>
> We're not sure whether other issues are related. In your case, I'd like to
> understand why you see so many RPC being processed slowly. It doesn't look like
> the issue (or it's low probability) from the previous sdiag, but as a blind
> guess can you change your SchedulerParameters changing bf_interval=60 and
> adding bf_yield_sleep=800000?
>
> cheers,
> Marcin
> [1]http://man7.org/linux/man-pages/man1/gcore.1.html
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 43 Marcin Stolarek 2019-08-13 12:45:26 MDT

Thanks for the gdb output. I'll attach new version of the script for gdb. As always please execute and share the results.

We're not sure if slowness is related with out of resources issue you faced. If you'll have, but it's definitively a regression. 

cheers,
Marcin

Comment 44 Marcin Stolarek 2019-08-13 12:46:38 MDT

Created attachment 11214 [details]
7532_gdb.py (v3)

Comment 45 Paul Edmon 2019-08-13 14:37:14 MDT

Created attachment 11216 [details]
Updated gdb_py run with the v3 script.

Comment 46 Paul Edmon 2019-08-13 14:37:51 MDT

Added a new version of the log generated from the v3 version of the script.

Comment 47 Marcin Stolarek 2019-08-14 07:38:35 MDT

Did you notice any failure? If yes please use the attached python script as previously. 

If it didn't happen can you get a new coredump using gcore, maybe repeat it after a few seconds and run gdb against all coredumps?

I'll attach updated python script in a minute.

cheers,
Marcin

Comment 48 Marcin Stolarek 2019-08-14 07:40:15 MDT

Created attachment 11226 [details]
7532_gdb.py (v4)

Comment 49 Marcin Stolarek 2019-08-14 07:43:40 MDT

Created attachment 11227 [details]
7532_gdb.py (v5)

Comment 50 Paul Edmon 2019-08-14 08:09:01 MDT

Yes, we had 2 failures over night.  I will run the script as well as 
give you a slice of the logs leading up to the crash.

-Paul Edmon-

On 8/14/19 9:38 AM, bugs@schedmd.com wrote:
>
> *Comment # 47 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c47> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Did you notice any failure? If yes please use the attached python script as
> previously.
>
> If it didn't happen can you get a new coredump using gcore, maybe repeat it
> after a few seconds and run gdb against all coredumps?
>
> I'll attach updated python script in a minute.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 51 Paul Edmon 2019-08-14 08:22:04 MDT

Created attachment 11228 [details]
Contains info about crash at 08-14-19 at 2:30am

7532_gdb_py_v5-081419T0230.log contains the log from the v5 gdb script on the core from the 2:30am crash.

messages-081419T02.txt contains a slice of the logs from 2 am when the crash occurred.

Let me know if you want any other info.

Comment 52 Jason Booth 2019-08-14 11:45:12 MDT

Paul - would you also send us the output of:

systemctld status slurmctld

We are curios to know what the tasks limit is set to by systemd.

Tasks: [value] (limit: [value])

Comment 53 Paul Edmon 2019-08-14 12:13:34 MDT

[root@holy-slurm02 ~]# systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
    Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; 
vendor preset: disabled)
   Drop-In: /etc/systemd/system/slurmctld.service.d
            └─50-ulimit.conf
    Active: active (running) since Wed 2019-08-14 13:31:04 EDT; 42min ago
   Process: 161777 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS 
(code=exited, status=0/SUCCESS)
  Main PID: 161779 (slurmctld)
     Tasks: 122
    Memory: 8.8G
    CGroup: /system.slice/slurmctld.service
            ├─106776 [slurmctld]
            ├─106787 n/a
            └─161779 /usr/sbin/slurmctld

Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza04 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza01 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza04 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza01 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
_slurm_rpc_epilog_complete: node_name = holy2a06205, JobId=18535615
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza01 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza04 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza04 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza04 tree_width 0
Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: 
ROUTE: split_hostlist: hl=holydsouza04 tree_width 0

On 8/14/2019 1:45 PM, bugs@schedmd.com wrote:
>
> *Comment # 52 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c52> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Paul - would you also send us the output of:
>
> systemctld status slurmctld
>
> We are curios to know what the tasks limit is set to by systemd.
>
> Tasks: [value] (limit: [value])
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 55 Jason Booth 2019-08-14 13:11:32 MDT

*** Ticket 7571 has been marked as a duplicate of this ticket. ***

Comment 56 Marcin Stolarek 2019-08-14 13:22:30 MDT

Paul, 

You mentioned that you had two failures last night. Could you please attach logs and backtrace for 2nd failure? I'd like to compare them, since the previous one is very different(probably due to down nodes making a lot of "noise")

Can you create a simple cron job with a script like executed every minute:
>echo =====
>date
>echo threads
>ps auxH | grep slurmctld | grep -v grep | wc -l 
>echo fdinfo
>ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l 
>echo meminfo
>cat /proc/meminfo
>echo statm
>cat /proc/SLURMCTLD_PID/statm

appending the output to the file? This may help us to figure out which resource is exhausted.

cheers,
Marcin

Comment 58 Paul Edmon 2019-08-14 13:33:15 MDT

Sadly I blew away the other one from last night as we were running short 
on space and we had a bunch of core's lying around. Fortunately(?) we 
had a crash this afternoon.  I will get you the info on that one.

-Paul Edmon-

On 8/14/19 3:22 PM, bugs@schedmd.com wrote:
>
> *Comment # 56 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c56> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> You mentioned that you had two failures last night. Could you please attach
> logs and backtrace for 2nd failure? I'd like to compare them, since the
> previous one is very different(probably due to down nodes making a lot of
> "noise")
>
> Can you create a simple cron job with a script like executed every minute:
> >echo ===== >date >echo threads >ps auxH | grep slurmctld | grep -v grep | wc -l 
> >echo fdinfo >ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l >echo meminfo 
> >cat /proc/meminfo >echo statm >cat /proc/SLURMCTLD_PID/statm
>
> appending the output to the file? This may help us to figure out which resource
> is exhausted.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 59 Paul Edmon 2019-08-14 13:38:55 MDT

Created attachment 11235 [details]
Contains info about crash at 08-14-19 at 13:26

7532_gdb_py_v5-081419T1326.log contains the log from the v5 gdb script on the core from the 13:26 crash.

messages-081419T13.txt contains a slice of the logs from 1 pm when the crash occurred.

Comment 60 Paul Edmon 2019-08-14 13:39:31 MDT

I will throw together a script to put in cron to pull that info.

Comment 61 Paul Edmon 2019-08-14 13:54:09 MDT

Okay, I have that running now.  At the next crash I will include that data.

-Paul Edmon-

On 8/14/19 3:22 PM, bugs@schedmd.com wrote:
>
> *Comment # 56 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c56> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> You mentioned that you had two failures last night. Could you please attach
> logs and backtrace for 2nd failure? I'd like to compare them, since the
> previous one is very different(probably due to down nodes making a lot of
> "noise")
>
> Can you create a simple cron job with a script like executed every minute:
> >echo ===== >date >echo threads >ps auxH | grep slurmctld | grep -v grep | wc -l 
> >echo fdinfo >ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l >echo meminfo 
> >cat /proc/meminfo >echo statm >cat /proc/SLURMCTLD_PID/statm
>
> appending the output to the file? This may help us to figure out which resource
> is exhausted.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 62 Paul Edmon 2019-08-14 19:16:08 MDT

Created attachment 11244 [details]
Contains info about crash at 08-14-19 at 19:03

7532_gdb_py_v5-081419T1903.log contains the log from the v5 gdb script on the core from the 19:03 crash.

messages-081419T18-19.txt contains a slice of the logs from 6-7 pm when the crash occurred.

data-161779.log contains the info requested about memory usage, threads, etc. at 1 minute intervals leading up to and after the crash.

Comment 63 Paul Edmon 2019-08-14 19:17:03 MDT

The latest crash report contains the additional info that was requested to be monitored regarding memory and thread usage.  We are continuing to log that data so we will have it for future crashes as well.

Comment 64 Marcin Stolarek 2019-08-15 04:24:46 MDT

Paul, 

The data you shared is in fact very interesting. It shows that just before the crash the MemFree drops rapidly (attached MemFree_Slurmctld.png), however, slurmctld VmSize looks stable. I analyzed MemInfo changes carefully noticing that at during the MemFree drop there is increase activity on WriteBack, which means that PageCache is written back to a storage device. This may indicate that OS granted memory to a process "in advance" and started to make space reducing caches. (attached ServerMemory.png).

Please make sure that the monitoring script we're using makes dynamic slurmctld pid lookup - otherwise you'll have to update the script after every restart.

You can use a snippet like:
>SLURMCTLDPID=$(ps aux | grep slurmctld$ | grep -v grep  | awk '{print $2}')
>if [ "x$SLURMCTLDPID" == "x" ]
>then
>  exit 1
>fi

Could you please add top memory users lines to the cron script:
>echo "Top VmSize"
>ps aux --sort=vsz | head -20

cheers,
Marcin

Comment 65 Marcin Stolarek 2019-08-15 04:25:53 MDT

Created attachment 11251 [details]
MemFree_Slurmctld.png

Comment 66 Marcin Stolarek 2019-08-15 04:29:39 MDT

Created attachment 11252 [details]
ServerMemory.png

Comment 67 Paul Edmon 2019-08-15 08:28:02 MDT

Justin has reworked the hacky script I threw together to do that and we 
are adding the VmSize monitoring.  At the next crash we will send you 
what we have.

-Paul Edmon-

On 8/15/19 6:24 AM, bugs@schedmd.com wrote:
>
> *Comment # 64 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c64> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> The data you shared is in fact very interesting. It shows that just before the
> crash the MemFree drops rapidly (attached MemFree_Slurmctld.png), however,
> slurmctld VmSize looks stable. I analyzed MemInfo changes carefully noticing
> that at during the MemFree drop there is increase activity on WriteBack, which
> means that PageCache is written back to a storage device. This may indicate
> that OS granted memory to a process "in advance" and started to make space
> reducing caches. (attached ServerMemory.png).
>
> Please make sure that the monitoring script we're using makes dynamic slurmctld
> pid lookup - otherwise you'll have to update the script after every restart.
>
> You can use a snippet like:
> >SLURMCTLDPID=$(ps aux | grep slurmctld$ | grep -v grep  | awk '{print $2}') >if [ "x$SLURMCTLDPID" == "x" ] >then > exit 1 >fi
>
> Could you please add top memory users lines to the cron script:
> >echo "Top VmSize" >ps aux --sort=vsz | head -20
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 68 Paul Edmon 2019-08-15 09:04:15 MDT

Created attachment 11254 [details]
Data for period of 08/14 21:22 to 08/15 10:58

Doesn't contain the vsz but does track the slurmctld pid.  Known crash at 02:03.  We've restarted the script now with vsz tracking and will send results along with next crash.

Comment 70 Paul Edmon 2019-08-15 11:00:35 MDT

Created attachment 11257 [details]
Contains info about crash at 08-15-19 at 12:41

7532_gdb_py_v5-081519T1241.log contains the log from the v5 gdb script on the core from the 12:41 crash.

messages-081519T12.txt contains a slice of the logs from 12 pm when the crash occurred.

data3-081519T1251.log contains the info requested about memory usage, vsz, threads, etc. at 1 minute intervals leading up to and after the crash.

Comment 71 Paul Edmon 2019-08-15 11:01:05 MDT

Just added new crash info from latest crash, contains additional info requested regarding vsz.

Comment 72 Marcin Stolarek 2019-08-16 01:47:32 MDT

Paul, 

My apologies. I did a mistake in the ps command used for vsz. It's ascending order now and top entries show the least memory consuming processes. Please change it to:
>ps aux --sort=-vsz | head -20
The '-' sign will make it really work as we need.

Please use a simpler and more reliable construct to get a pid of slurmctld:
>SLURMCTLDPID=$(pidof slurmctld)
The one I suggested earlier was over-engineered.

cheers,
Marcin

Comment 73 Paul Edmon 2019-08-16 08:13:53 MDT

I've updated the data script with that modification.

-Paul Edmon-

On 8/16/19 3:47 AM, bugs@schedmd.com wrote:
>
> *Comment # 72 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c72> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> My apologies. I did a mistake in the ps command used for vsz. It's ascending
> order now and top entries show the least memory consuming processes. Please
> change it to:
> >ps aux --sort=-vsz | head -20
> The '-' sign will make it really work as we need.
>
> Please use a simpler and more reliable construct to get a pid of slurmctld:
> >SLURMCTLDPID=$(pidof slurmctld)
> The one I suggested earlier was over-engineered.
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 74 Marcin Stolarek 2019-08-16 08:14:55 MDT

Could you please add also the line for RSS? Finally, memory monitoring should look like:
>ps aux --sort=-vsz | head -20
>ps aux --sort=-rss | head -20

cheers,
Marcin

Comment 75 Paul Edmon 2019-08-16 08:17:29 MDT

Yup, I've added that as well.  We will send in the log when we have 
another crash.

-Paul Edmon-

On 8/16/19 10:14 AM, bugs@schedmd.com wrote:
>
> *Comment # 74 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c74> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Could you please add also the line for RSS? Finally, memory monitoring should
> look like:
> >ps aux --sort=-vsz | head -20 >ps aux --sort=-rss | head -20
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 76 Marcin Stolarek 2019-08-16 08:41:52 MDT

Thanks Paul, Can you share with us your slurmctld prolog?

cheers,
Marcin

Comment 77 Paul Edmon 2019-08-16 08:52:42 MDT

Created attachment 11263 [details]
Slurmctld prolog

Comment 78 Marcin Stolarek 2019-08-16 09:59:19 MDT

Paul, 

Could you please attach your slurmdbd.conf and Epilog script?

cheers,
Marcin

Comment 79 Paul Edmon 2019-08-16 10:09:20 MDT

Created attachment 11268 [details]
slurmdbd

Comment 80 Paul Edmon 2019-08-16 10:09:41 MDT

Created attachment 11269 [details]
slurm_epilog

Comment 81 Marcin Stolarek 2019-08-16 13:05:00 MDT

Did you notice any failure or was the situation more stable today?

cheers,
Marcin

Comment 82 Paul Edmon 2019-08-16 14:07:40 MDT

We've had a couple of failures, but that was before we put in the new 
metrics that we are polling for.  It's' been about 5 hours or so, we 
seem to be failing every 6, so I'm expecting to see one soon.  When it 
happens I will throw the logs in as I have been.

-Paul Edmon-

On 8/16/19 3:05 PM, bugs@schedmd.com wrote:
>
> *Comment # 81 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c81> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Did you notice any failure or was the situation more stable today?
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 84 Jason Booth 2019-08-16 14:29:59 MDT

Paul - We are wondering if you might be able to disable PrologSlurmctld and run for a while to rule out a possible cause that we have been discussing. There was a change between 18.08 and 19.04 where we changed from detached threads to using track_scripts. Doing this will help us narrow down a possible cause.

Comment 85 Paul Edmon 2019-08-16 15:41:03 MDT

Created attachment 11271 [details]
Contains info about crash at 08-16-19 at 16:51

7532_gdb_py_v5-081619T1651.log contains the log from the v5 gdb script on the core from the 16:51 crash.

messages-081619T16.txt contains a slice of the logs from 4 pm when the crash occurred.

data-081619T1732.log contains the info requested about memory usage, vsz, threads, etc. at 1 minute intervals leading up to and after the crash

Comment 86 Paul Edmon 2019-08-16 15:41:25 MDT

Sure.  Will do.  I've also upload the last crash data.

-Paul Edmon-

On 8/16/2019 4:29 PM, bugs@schedmd.com wrote:
>
> *Comment # 84 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c84> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Paul - We are wondering if you might be able to disable PrologSlurmctld and run
> for a while to rule out a possible cause that we have been discussing. There
> was a change between 18.08 and 19.04 where we changed from detached threads to
> using track_scripts. Doing this will help us narrow down a possible cause.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 87 Paul Edmon 2019-08-17 07:28:53 MDT

It looks like this change stabilized the scheduler.  We had no crashes 
overnight and it had been typically crashing every 6 hours or so.

I'm going to keep an eye on it over the weekend but this looks promising.

-Paul Edmon-

On 8/16/2019 4:29 PM, bugs@schedmd.com wrote:
>
> *Comment # 84 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c84> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Paul - We are wondering if you might be able to disable PrologSlurmctld and run
> for a while to rule out a possible cause that we have been discussing. There
> was a change between 18.08 and 19.04 where we changed from detached threads to
> using track_scripts. Doing this will help us narrow down a possible cause.
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 88 Marcin Stolarek 2019-08-18 11:37:24 MDT

Paul, 

Thanks for the good news. Did no info since yesterday mean that you didn't have any crash of slurmctld with PrologSlurmctld disabled?

cheers,
Marcin

Comment 89 Paul Edmon 2019-08-18 11:42:39 MDT

It's been rock solid since we made the change.  No failures.  I think we 
found the problem.

-Paul Edmon-

On 8/18/2019 1:37 PM, bugs@schedmd.com wrote:
>
> *Comment # 88 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c88> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> Thanks for the good news. Did no info since yesterday mean that you didn't have
> any crash of slurmctld with PrologSlurmctld disabled?
>
> cheers,
> Marcin
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 92 Jason Booth 2019-08-19 13:47:59 MDT

Hi Paul,
 
We have committed a patch for this issue into 19.05. 

https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f

We hare fairly certain this resolve the issue, however, would you be willing to test this out by applying this patch and re-enabling PrologSlurmctld?

Comment 93 Paul Edmon 2019-08-19 13:53:12 MDT

Sure. I can try this out tomorrow.

Also as a request, the reason we ended up with that Slurmctld Prolog 
script is so that we can store copies of the user's job submission 
scripts for longer than the run of the job so we can do postfact 
reconstruction of what they did.  It really isn't ideal to do this this 
way.  Our preference would be for an option for slurm to automatically 
ship the script to the slurmdbd as an additional field for the job.  If 
the size ends up being too much for the database, perhaps some size 
limit for the script and some way of archiving the scripts would be 
feasible as slurmdbd already supports archive certain things on 
different time scales.  I know that this option would be of great 
service to us and the community at large as users are typically not 
savvy as to send us the scripts they used when they have problems, or 
perhaps don't even remember what they submitted.

Thanks.

-Paul Edmon-

On 8/19/19 3:47 PM, bugs@schedmd.com wrote:
>
> *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c92> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Hi Paul,
>
> We have committed a patch for this issue into 19.05.
>
> https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f
>
> We hare fairly certain this resolve the issue, however, would you be willing to
> test this out by applying this patch and re-enabling PrologSlurmctld?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 94 Paul Edmon 2019-08-20 09:02:30 MDT

Okay, this patch is live in our system and our prolog is running.  I 
will let you know if I see any issues.

-Paul Edmon-

On 8/19/19 3:47 PM, bugs@schedmd.com wrote:
>
> *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c92> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Hi Paul,
>
> We have committed a patch for this issue into 19.05.
>
> https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f
>
> We hare fairly certain this resolve the issue, however, would you be willing to
> test this out by applying this patch and re-enabling PrologSlurmctld?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 95 Paul Edmon 2019-08-21 08:09:05 MDT

I ran it with that patch for a day with ctld prolog on and it was 
stable.  Looks like that did the trick.

-Paul Edmon-

On 8/19/19 3:47 PM, bugs@schedmd.com wrote:
>
> *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c92> on 
> bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason 
> Booth <mailto:jbooth@schedmd.com> *
> Hi Paul,
>
> We have committed a patch for this issue into 19.05.
>
> https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f
>
> We hare fairly certain this resolve the issue, however, would you be willing to
> test this out by applying this patch and re-enabling PrologSlurmctld?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 96 Marcin Stolarek 2019-08-21 08:47:58 MDT

Paul, 

As I understand you didn't notice any failure after patch application. The fix is already merged into 19.05 branch and will be available in 19.05.3. I'll close the ticket now as resolved. You should remove debug flags we introduced and reenable route/topology plugin - I think it makes sense in your setup. 

For the new nodes - we recommend use of Future state as mentioned in comment 32.

If you're still experiencing different issues(slowness) or would like to request a feature (store job scripts in accounting database) the best way would be to address those in separate reports.

I'm closing it now as fixed. If you'll notice similar instability please re-open. 

cheers,
Marcin

Comment 97 Marcin Stolarek 2019-08-28 08:54:27 MDT

*** Ticket 7528 has been marked as a duplicate of this ticket. ***