Description
Paul Edmon
2019-08-07 05:52:51 MDT
Created attachment 11135 [details]
Current slurm.conf
Created attachment 11136 [details]
Current topology.conf
Paul, Could you please execute dmesg command and share the results? You can attach to the bug report as a file instead of copy&paste into the message since the output may be more than a few lines. cheers, Marcin Created attachment 11139 [details]
dmesg for slurm master
Looks to be full of NFS errors. We mount out our slurm.conf from holy-slurm02 as a NFS mount that all our hosts use. So that error is due to a number of connections exceeding the number of NFS threads we have which is 256.
Treating this together with bug 7528 I'd recommend starting with: -) increase number of NFS threads -) increase system limits for number of open files cheers, Marcin We doubled the limit on open files and NFS. [root@holy-slurm02 ~]# cat /proc/30434/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 1030065 1030065 processes Max open files 16384 16384 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 1030065 1030065 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us -Paul Edmon- On 8/7/19 10:48 AM, bugs@schedmd.com wrote: > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c5> on bug > 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Treating this together withbug 7528 <show_bug.cgi?id=7528> I'd recommend starting with: > -) increase number of NFS threads > -) increase system limits for number of open files > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul - I am just checking up on your status with regards to the cluster and the changes you made. Are things behaving better now that you have doubled the limit on open files and NFS? We just had another slurmctld crash but this time simply restarting the slurmctld process seems to have worked (up for ~30 min anyway). This was the error on console after the crash before restarting the service: slurmctld[30434]: email msg to <redacted>: Slurm Job_id=17848691 Name=k-113.1 Failed, Run time 00:31:32, PREEMPTED, ExitCode 0 slurmctld[30434]: preempted JobId=17848691 has been requeued to reclaim resources for JobId=17955815 slurmctld[30434]: job_signal: 10 of running JobId=17848692 successful 0x1 slurmctld[30434]: email msg to <redacted>: Slurm Job_id=17848692 Name=k-114.1 Failed, Run time 00:31:32, PREEMPTED, ExitCode 0 slurmctld[30434]: preempted JobId=17848692 has been requeued to reclaim resources for JobId=17955815 slurmctld[30434]: fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable slurmctld[30434]: fatal: _slurmctld_rpc_mgr: pthread_create error Resource temporarily unavailable systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE systemd[1]: Unit slurmctld.service entered failed state. systemd[1]: slurmctld.service failed. Can you tell us a bit more about this system? What OS is this, and is Slurm running in a virtual environment? This issue may be related to 7360. In bug #7360 comment #32, we mention a diagnostic patch. I will let Marcin make the decision on if you should apply this as well or not. For now, I will relate this issue with 7360. System is running CentOS 7.6.1810 on physical hardware (24 cores + HT and 256GB RAM) If you can apply the diagnostic patch from bug #7360 comment #32 it will provide us more insights in case of another failure. I recommend that. In the first message I see that there was a squeue command executed under docker? Are you running slurm daemons in containers as well? If yes could you please execute `ulimit -a` in daemons containers and share results with us? The command should be something like: >docker exec -it CONTAINERID /bin/bash -c "ulimit -a" Will you kindly share output of following commands: >cat /proc/sys/kernel/threads-max >cat /proc/sys/kernel/pid_max >ps auxH | wc -l >dmesg -T cheers, Marcin Additionally I'd like to ask for the output of sdiag. cheers, Marcin Created attachment 11157 [details]
output requested by marcin (sdiag, max threads, max pids, etc)
We're not running slurm in docker containers at the moment (although would be nice to get there eventually) - the container you see in the logs is a side car that caches squeue output in a db.
I've attached the output you requested. I will work on applying the diagnostic patch for the next crash (we had another one this morning). Please note we've had these NFS warnings for a while without issue. We also doubled the number of NFS threads yesterday from 64 to 128 and haven't noticed any difference.
Thanks for the output. Is it possible to get your slurmctld log? If you consider information like user names private you may want to anonymize it. cheers, Marcin Would you mind deleting the attachment I posted (#11157) and I will reupload? The attachment has usernames. I will work on getting you a slurmctld log. Do you want the logs for today or just around the crash or ? We have lots of slurmctld logs... (gigabytes) Justin, I switched it to visible only to SchedMD employees, is it OK for you? cheers, Marcin Yes that's fine thank you! I applied the patch yesterday and this morning the slurmctld daemon crashed again so we now have a 55G core dump. What would you like to see next? Justin, Can you try to execute following commands: gdb /path/to/slurmctld /path/to/core (gdb)thread apply all bt cheers, Marcin Created attachment 11178 [details]
thread apply all bt
Here's the output requested. I think I did the gdb logging to file correctly but let me know if this looks incomplete. Thanks.
Thanks, is it possible to get slurmctld logs as well? cheers, Marcin Created attachment 11181 [details]
slurmctld log from crash at 20190809 ~8:30AM
This is the log from midnight last night up until this morning's crash at ~8:30AM.
Justin,
Could you please enable Route debugflag by:
>scontrol setdebugflag +Route
cheers,
Marcin
We might wait to do that until Monday given that SLURM has been very sluggish since the upgrade and we're afraid it might become even more unstable over the weekend with debug mode enabled... Justin, Looking at your topology.conf and back track I've found out that a majority of threads is coming from communication with nodes connected to the switch called: holyhdr. Could you please check if there are some issues with these connections? (I don't have any evidence of it, but the coincidence looks suspicious). BTW. You have quite big MessageTimeout=100 configured. What was the reason for that? In this case, if you'll have lower MessageTimeout limit I'll expect some of those threads timed out preventing the failure. I don't mean it's the root cause and we're continuing the investigation. Could you please load the coredump one more time and execute attached python script by: gdb ./slurmctl ./corefile (gdb) source 7532_gdb.py and share the results with us? I'll attach 7532_gdb.py to the case in a minute. cheers, Marcin Created attachment 11190 [details]
7532_gdb.py
python scrip extracting some data from core dump
Created attachment 11191 [details]
7532_gdb.py
small fix in python gdb script
I've turned on the Route debug flag, I noticed that isn't in the slurm.conf option list. Is that new or undocumented? Will it survive restarts of the scheduler? As for topology, the nodes on holyhdr you are seeing are the new cluster that we are working to stand up. We preemptively added those to the conf so that they would come in as they were built. The slowness and crashes occurred before we added those on Wednesday (pretty much from when we did the slurm upgrade). We typically have various nodes down on the cluster at any given time, so the scheduler getting hung up on down nodes is definitely not optimal. It should be resilient enough to withstand large portions of the cluster going out at any given time. Anyways if those nodes are causing noise which is making it hard to debug let me know and I can drop them from the conf until we have them built, let me know if you want me too. We can also try turning topology off completely and see if that helps stability. Justin is working on getting the other info you asked for and should have that back soon. -Paul Edmon- On 8/12/19 10:11 AM, bugs@schedmd.com wrote: > Marcin Stolarek <mailto:cinek@schedmd.com> changed bug 7532 > <https://bugs.schedmd.com/show_bug.cgi?id=7532> > What Removed Added > Attachment #11190 [details] is obsolete 1 > > *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c26> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Createdattachment 11191 <attachment.cgi?id=11191> [details] > <attachment.cgi?id=11191&action=edit> > 7532_gdb.py > > small fix in python gdb script > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 11193 [details]
7532_gdb.py output
Hi Marcin,
Here's the output you requested from the core dump file using the latest version of the Python script you sent. Let us know if you need anything else. Thanks!
~Justin
Paul, Thanks for the prompt reply. Yes - I agree that down nodes shouldn't be an issue, however, if it's possible to remove them just for debugging purpose please do that. The route debug flag set from the command line will not survive the restart (as for all of those). However, if we'll get the slurmctld log from the crash we may not need it anymore. You're correct it's not documented, the patch for documentation is already on the way to release. cheers, Marcin Cool. I will add that to the slurm.conf for now just to make sure it is there, and I will drop the new nodes from the conf until they are ready. -Paul Edmon- On 8/12/19 10:53 AM, bugs@schedmd.com wrote: > > *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c29> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > Thanks for the prompt reply. Yes - I agree that down nodes shouldn't be an > issue, however, if it's possible to remove them just for debugging purpose > please do that. > > The route debug flag set from the command line will not survive the restart (as > for all of those). However, if we'll get the slurmctld log from the crash we > may not need it anymore. You're correct it's not documented, the patch for > documentation is already on the way to release. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Okay, those nodes are out of the slurm.conf and topology. -Paul Edmon- On 8/12/19 10:53 AM, bugs@schedmd.com wrote: > > *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c29> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > Thanks for the prompt reply. Yes - I agree that down nodes shouldn't be an > issue, however, if it's possible to remove them just for debugging purpose > please do that. > > The route debug flag set from the command line will not survive the restart (as > for all of those). However, if we'll get the slurmctld log from the crash we > may not need it anymore. You're correct it's not documented, the patch for > documentation is already on the way to release. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul,
Justin,
Did removal of down nodes improve Slurm responsiveness, did you notice another fatal error?
I've tried to further understand the issue and as far as I agree that down nodes are normally not an issue in your case the siutation is special. The number of added nodes is high - 670 nodes. Slurmctld triggers REQUEST_NODE_REGISTRATION_STATUS for those every SlurmdTimeout/3 seconds and awaits reply for MessageTimeout seconds. In your case both are equal to 100. Moast often those are forwarded by slurmd's, however, when the forwarding node is down we're triggering parallel query to not responding nodes, which results in a lot of threads in your case.
Specifically to cover the case of nodes "in preparation" you can set "future" state of nodes in slurm.conf.
>FUTURE Indicates the node is defined for future use and need not exist when the Slurm daemons are started. These nodes can be made available for use simply by updating the node state using
> the scontrol command rather than restarting the slurmctld daemon. After these nodes are made available, change their State in the slurm.conf file. Until these nodes are made avail‐
> able, they will not be seen using any Slurm commands or nor will any attempt be made to contact them.
I'm still looking into the case concentraiting now on identification of the resouces causing "Resource temporarily unavailable" error. Could you please help me answering the questions below:
What's the time between slurmctld crashes? Do you suspect any coincidences with other tools/time of day/users activity?
Can you share slurmctld.log with Route setdebugflag enabled?
What is the conntent of /proc/SLURMCTLD_PID/limits?
What's the result of:
#ps auxH | grep slurmctld | grep -v grep | wc -l
#ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l
#cat /proc/meminfo
#cat /proc/SLURMCTLD_PID/statm
Are you able to monitor those over time?
cheers,
Marcin
With the removal of those nodes Slurm hasn't been more responsive, its still sluggish to respond to scontrol. We have a diamond script that regularly polls for node state. In the previous generation (18.08.x) we never had any issues and things were very responsive. However we are now seeing scontrol's stacking up from our diamond collectors: [root@holy-slurm02 ~]# ps aux | grep scontrol root 1949 0.0 0.0 53652 5916 ? S 03:06 0:00 scontrol -o show node root 9738 0.0 0.0 53560 5928 ? S Aug12 0:00 scontrol -o show node root 10580 0.0 0.0 53648 5912 ? S 03:08 0:00 scontrol -o show node root 12742 0.0 0.0 53648 5908 ? S 09:40 0:00 scontrol -o show node root 13226 0.0 0.0 53560 5928 ? S Aug12 0:00 scontrol -o show node root 34973 0.0 0.0 53560 5936 ? S Aug12 0:00 scontrol -o show node root 38236 0.0 0.0 53652 5928 ? S 01:01 0:00 scontrol -o show node root 40510 0.0 0.0 53648 5904 ? S 03:15 0:00 scontrol -o show node root 44737 0.0 0.0 53648 5912 ? S 03:16 0:00 scontrol -o show node root 45990 0.0 0.0 53652 5912 ? S 01:03 0:00 scontrol -o show node root 49529 0.0 0.0 53652 5920 ? S 01:04 0:00 scontrol -o show node root 56725 0.0 0.0 53648 5916 ? S 03:19 0:00 scontrol -o show node root 58075 0.0 0.0 53652 5928 ? S 01:48 0:00 scontrol -o show node [root@holy-slurm02 ~]# ps aux | grep scontrol | wc -l 44 Diamond polls about every 60 seconds so scontrol isn't returning with in that window. This is from the master itself which should be the fastest place to query. We have users from across the cluster that also poll the master so if diamond can't keep up everyone else will be even slower. We haven't seen a crash since yesterday but I have deliberately restarted the master periodically to ward off crashes. From previous experience crashes happen after about 12 hours. If you wish to see another crash I can let the scheduler just run with out restarting but it may be a half a day before it does and it may crash at an inopportune time (i.e. in the middle of the night). As it stands I don't have a good idea as to the instigation of the crash. If I were to hazard a guess given that I'm seeing the build up of scontrol's here, it may be a build up of requests or a burst of requests that is causing the issue. Given that the slowness appears to be the responsiveness to scontrol that may be a hint as to cause. That said that is all speculation and our historic thread count (we have graphs of sdiag that we keep) indicate no spikes in agent queue size or thread count before it crashes. So it may either be sudden or a slow burn and accumulation of cruft that then causes the crash. My hunch is the later, given that we never saw this behavior in 18.08 which was far more responsive. The significant changes we made was the upgrade to 19.05 and to turn on topology. So it is either changes made by the 19.05 upgrade or it is something in topology that is not agreeing with our set up. We have not attempted to just turn off topology and see if that improves matters. That may give a hint if we did. As mentioned previously our slurmctld.log's are massive. Do you have a time window you want to look at? I can get you what we have of that window. As for the other things: [root@holy-slurm02 diamond]# cat /proc/75722/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 1030065 1030065 processes Max open files 16384 16384 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 1030065 1030065 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@holy-slurm02 diamond]# ps auxH | grep slurmctld | grep -v grep | wc -l 56 [root@holy-slurm02 diamond]# ls /proc/75722/fdinfo/ | wc -l 1049 [root@holy-slurm02 diamond]# cat /proc/meminfo MemTotal: 263721992 kB MemFree: 10807008 kB MemAvailable: 201588824 kB Buffers: 3404 kB Cached: 189931848 kB SwapCached: 380228 kB Active: 138173532 kB Inactive: 107306896 kB Active(anon): 53674020 kB Inactive(anon): 4863440 kB Active(file): 84499512 kB Inactive(file): 102443456 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4194300 kB SwapFree: 1324152 kB Dirty: 104 kB Writeback: 0 kB AnonPages: 55165232 kB Mapped: 105256 kB Shmem: 2992272 kB Slab: 5099216 kB SReclaimable: 4722928 kB SUnreclaim: 376288 kB KernelStack: 21184 kB PageTables: 179824 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 136055296 kB Committed_AS: 107473392 kB VmallocTotal: 34359738367 kB VmallocUsed: 814764 kB VmallocChunk: 34224633408 kB HardwareCorrupted: 0 kB AnonHugePages: 8017920 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 551632 kB DirectMap2M: 14950400 kB DirectMap1G: 254803968 kB [root@holy-slurm02 diamond]# cat /proc/75722/statm 6852587 2804962 1437 237 0 6823036 0 We don't currently have any monitoring set up for those over time. As noted we do dump out the stats from sdiag and track those over time. We could try setting up monitoring for load and ram usage over time but more detailed data would be more tricky and take a bit of work to get done. In summary if I were to surmise a guess as to where the issue lies there is likely some hang up in the communications stack in 19.05 with topology compared to 18.08 with out. Now what I don't know is if it is purely with topology or not, but something in those two changes is the culprit. What I also don't know is if the crash that is occurring is due to the slowness in the communication stack or not, they may in fact be independent problems. My hunch is that they aren't but its hard to say. -Paul Edmon- On 8/13/19 10:26 AM, bugs@schedmd.com wrote: > > *Comment # 32 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c32> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > Justin, > > Did removal of down nodes improve Slurm responsiveness, did you notice another > fatal error? > > I've tried to further understand the issue and as far as I agree that down > nodes are normally not an issue in your case the siutation is special. The > number of added nodes is high - 670 nodes. Slurmctld triggers > REQUEST_NODE_REGISTRATION_STATUS for those every SlurmdTimeout/3 seconds and > awaits reply for MessageTimeout seconds. In your case both are equal to 100. > Moast often those are forwarded by slurmd's, however, when the forwarding node > is down we're triggering parallel query to not responding nodes, which results > in a lot of threads in your case. > > Specifically to cover the case of nodes "in preparation" you can set "future" > state of nodes in slurm.conf. > > >FUTURE Indicates the node is defined for future use and need not exist when the Slurm daemons are started. These nodes can be made available for use simply by updating the node state using > the scontrol command rather than restarting the slurmctld daemon. > After these nodes are made available, change their State in the > slurm.conf file. Until these nodes are made avail‐ > able, they will > not be seen using any Slurm commands or nor will any attempt be made > to contact them. > > I'm still looking into the case concentraiting now on identification of the > resouces causing "Resource temporarily unavailable" error. Could you please > help me answering the questions below: > What's the time between slurmctld crashes? Do you suspect any coincidences with > other tools/time of day/users activity? > Can you share slurmctld.log with Route setdebugflag enabled? > What is the conntent of /proc/SLURMCTLD_PID/limits? > What's the result of: > #ps auxH | grep slurmctld | grep -v grep | wc -l > #ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l > #cat /proc/meminfo > #cat /proc/SLURMCTLD_PID/statm > Are you able to monitor those over time? > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > We just did an experiment with turning off topology and select/cons_tres (reverting it to select/cons_res) and saw no change in responsiveness. So the slowness at least is due to the 19.05 upgrade in general and not the topology. Doesn't rule out topology causing the crash, but the slowness of response for scontrol and other queries is a general feature of 19.05 that wasn't in 18.08. In addition we ran 18.08 on the previous settings with no issue. So something happened in 19.05 that is bogging things down. Paul,
Thanks for the test with reverted configuration. I didn't expect that route/topology or select/cons_tres may be the reason but it's confirmed now.
Instead of periodic slurmctld restart you can use a systemd "Restart" feature, simply adding to slurmctld.service definition:
>Restart=on-failure
followed by systemctld daemon-reload command. Should restart slurmctld when it fails.
cheers,
Marcin
We have puppet set to auto restart slurmctld if it finds it failed. Though sometimes when it fails it just wedges on restart and we have to restart the whole stack (slurmdbd and mariadb) as it appears that some of the slurmdbd queries get hung up and you need to clean house completely. Still we will look at the auto restart option too. I will leave the scheduler alone to fail again so we can get a fresh core and a new log roll. -Paul Edmon- On 8/13/19 11:51 AM, bugs@schedmd.com wrote: > > *Comment # 35 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c35> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > Thanks for the test with reverted configuration. I didn't expect that > route/topology or select/cons_tres may be the reason but it's confirmed now. > > Instead of periodic slurmctld restart you can use a systemd "Restart" feature, > simply adding to slurmctld.service definition: > >Restart=on-failure > followed by systemctld daemon-reload command. Should restart slurmctld when it > fails. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > I was wondering what your best guess right now as to the pthread fault is? As established it isn't just our instance as other people have filed tickets about this and it appears to be 19.05 series specific. I'm happy to put in any other instrumentation into our scheduler needed to trace this back I was just curious as to the current theory and/or guess. I have two additional questions: Can you share a fresh sdiag with us? Can you get coredump of your running slurmctld(using gcore[1]) and apply 7532_gdb.py to it as previously (comment 24)? From the previous one, it's quite uncommon that you have so many REQUEST_BUILD_INFO RPCs. Do you have any automation periodically issuing something like: scontrol show config or sprio? If yes - can you disable it or reduce the frequency. (Just to test) We're not sure whether other issues are related. In your case, I'd like to understand why you see so many RPC being processed slowly. It doesn't look like the issue (or it's low probability) from the previous sdiag, but as a blind guess can you change your SchedulerParameters changing bf_interval=60 and adding bf_yield_sleep=800000? cheers, Marcin [1] http://man7.org/linux/man-pages/man1/gcore.1.html Created attachment 11212 [details]
sdiag
Created attachment 11213 [details]
Output of 7532_gdb.py with a core dump generated from live scheduler.
I just put in the outputs of sdiag and 7532_gdb.py on a current core dump from the live scheduler. As I noted in my previous email we have diamond scripts polling node state and sdiag every 60 seconds for information. You can see the scripts here: https://github.com/fasrc/slurm-diamond-collector I'm not keen on turning these off as they are important to us getting insight in to how the scheduler is doing. Plus they have worked fine with out causing issue with slurm for many years. If we need to cut down on noise so we can debug we can do so, but we can't leave them off for long periods as we need the stats. I will add in the bf parameters you recommend and let you know if I see any difference. -Paul Edmon- On 8/13/19 1:04 PM, bugs@schedmd.com wrote: > > *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c38> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > I have two additional questions: > Can you share a fresh sdiag with us? > Can you get coredump of your running slurmctld(using gcore[1]) and apply > 7532_gdb.py to it as previously (comment 24 <show_bug.cgi?id=7532#c24>)? > > From the previous one, it's quite uncommon that you have so many > REQUEST_BUILD_INFO RPCs. Do you have any automation periodically issuing > something like: scontrol show config or sprio? If yes - can you disable it or > reduce the frequency. (Just to test) > > We're not sure whether other issues are related. In your case, I'd like to > understand why you see so many RPC being processed slowly. It doesn't look like > the issue (or it's low probability) from the previous sdiag, but as a blind > guess can you change your SchedulerParameters changing bf_interval=60 and > adding bf_yield_sleep=800000? > > cheers, > Marcin > [1]http://man7.org/linux/man-pages/man1/gcore.1.html > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Looks like that change did not improve responsiveness to scontrol or other queries. It's still slow. I'm reverting back to our old config. We should focus back on the pthread issue as that is really the main problem. Having the scheduler crash periodically is clearly a major problem. The slowness is annoying but still not as urgent as the pthread issue. I think we have gotten off track looking at the slowness unless we really think that solving the slowness will solve the crashing issue. I'm happy to have both solved but the crash is more urgent. -Paul Edmon- On 8/13/19 1:04 PM, bugs@schedmd.com wrote: > > *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c38> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > I have two additional questions: > Can you share a fresh sdiag with us? > Can you get coredump of your running slurmctld(using gcore[1]) and apply > 7532_gdb.py to it as previously (comment 24 <show_bug.cgi?id=7532#c24>)? > > From the previous one, it's quite uncommon that you have so many > REQUEST_BUILD_INFO RPCs. Do you have any automation periodically issuing > something like: scontrol show config or sprio? If yes - can you disable it or > reduce the frequency. (Just to test) > > We're not sure whether other issues are related. In your case, I'd like to > understand why you see so many RPC being processed slowly. It doesn't look like > the issue (or it's low probability) from the previous sdiag, but as a blind > guess can you change your SchedulerParameters changing bf_interval=60 and > adding bf_yield_sleep=800000? > > cheers, > Marcin > [1]http://man7.org/linux/man-pages/man1/gcore.1.html > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Thanks for the gdb output. I'll attach new version of the script for gdb. As always please execute and share the results. We're not sure if slowness is related with out of resources issue you faced. If you'll have, but it's definitively a regression. cheers, Marcin Created attachment 11214 [details]
7532_gdb.py (v3)
Created attachment 11216 [details]
Updated gdb_py run with the v3 script.
Added a new version of the log generated from the v3 version of the script. Did you notice any failure? If yes please use the attached python script as previously. If it didn't happen can you get a new coredump using gcore, maybe repeat it after a few seconds and run gdb against all coredumps? I'll attach updated python script in a minute. cheers, Marcin Created attachment 11226 [details]
7532_gdb.py (v4)
Created attachment 11227 [details]
7532_gdb.py (v5)
Yes, we had 2 failures over night. I will run the script as well as give you a slice of the logs leading up to the crash. -Paul Edmon- On 8/14/19 9:38 AM, bugs@schedmd.com wrote: > > *Comment # 47 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c47> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Did you notice any failure? If yes please use the attached python script as > previously. > > If it didn't happen can you get a new coredump using gcore, maybe repeat it > after a few seconds and run gdb against all coredumps? > > I'll attach updated python script in a minute. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 11228 [details]
Contains info about crash at 08-14-19 at 2:30am
7532_gdb_py_v5-081419T0230.log contains the log from the v5 gdb script on the core from the 2:30am crash.
messages-081419T02.txt contains a slice of the logs from 2 am when the crash occurred.
Let me know if you want any other info.
Paul - would you also send us the output of: systemctld status slurmctld We are curios to know what the tasks limit is set to by systemd. Tasks: [value] (limit: [value]) [root@holy-slurm02 ~]# systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/slurmctld.service.d └─50-ulimit.conf Active: active (running) since Wed 2019-08-14 13:31:04 EDT; 42min ago Process: 161777 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 161779 (slurmctld) Tasks: 122 Memory: 8.8G CGroup: /system.slice/slurmctld.service ├─106776 [slurmctld] ├─106787 n/a └─161779 /usr/sbin/slurmctld Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza04 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza01 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza04 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza01 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: _slurm_rpc_epilog_complete: node_name = holy2a06205, JobId=18535615 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza01 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza04 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza04 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza04 tree_width 0 Aug 14 14:13:12 holy-slurm02.rc.fas.harvard.edu slurmctld[161779]: ROUTE: split_hostlist: hl=holydsouza04 tree_width 0 On 8/14/2019 1:45 PM, bugs@schedmd.com wrote: > > *Comment # 52 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c52> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason > Booth <mailto:jbooth@schedmd.com> * > Paul - would you also send us the output of: > > systemctld status slurmctld > > We are curios to know what the tasks limit is set to by systemd. > > Tasks: [value] (limit: [value]) > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > *** Ticket 7571 has been marked as a duplicate of this ticket. *** Paul,
You mentioned that you had two failures last night. Could you please attach logs and backtrace for 2nd failure? I'd like to compare them, since the previous one is very different(probably due to down nodes making a lot of "noise")
Can you create a simple cron job with a script like executed every minute:
>echo =====
>date
>echo threads
>ps auxH | grep slurmctld | grep -v grep | wc -l
>echo fdinfo
>ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l
>echo meminfo
>cat /proc/meminfo
>echo statm
>cat /proc/SLURMCTLD_PID/statm
appending the output to the file? This may help us to figure out which resource is exhausted.
cheers,
Marcin
Sadly I blew away the other one from last night as we were running short on space and we had a bunch of core's lying around. Fortunately(?) we had a crash this afternoon. I will get you the info on that one. -Paul Edmon- On 8/14/19 3:22 PM, bugs@schedmd.com wrote: > > *Comment # 56 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c56> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > You mentioned that you had two failures last night. Could you please attach > logs and backtrace for 2nd failure? I'd like to compare them, since the > previous one is very different(probably due to down nodes making a lot of > "noise") > > Can you create a simple cron job with a script like executed every minute: > >echo ===== >date >echo threads >ps auxH | grep slurmctld | grep -v grep | wc -l > >echo fdinfo >ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l >echo meminfo > >cat /proc/meminfo >echo statm >cat /proc/SLURMCTLD_PID/statm > > appending the output to the file? This may help us to figure out which resource > is exhausted. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 11235 [details]
Contains info about crash at 08-14-19 at 13:26
7532_gdb_py_v5-081419T1326.log contains the log from the v5 gdb script on the core from the 13:26 crash.
messages-081419T13.txt contains a slice of the logs from 1 pm when the crash occurred.
I will throw together a script to put in cron to pull that info. Okay, I have that running now. At the next crash I will include that data. -Paul Edmon- On 8/14/19 3:22 PM, bugs@schedmd.com wrote: > > *Comment # 56 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c56> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > You mentioned that you had two failures last night. Could you please attach > logs and backtrace for 2nd failure? I'd like to compare them, since the > previous one is very different(probably due to down nodes making a lot of > "noise") > > Can you create a simple cron job with a script like executed every minute: > >echo ===== >date >echo threads >ps auxH | grep slurmctld | grep -v grep | wc -l > >echo fdinfo >ls /proc/SLURMCTLD_PID/fdinfo/ | wc -l >echo meminfo > >cat /proc/meminfo >echo statm >cat /proc/SLURMCTLD_PID/statm > > appending the output to the file? This may help us to figure out which resource > is exhausted. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 11244 [details]
Contains info about crash at 08-14-19 at 19:03
7532_gdb_py_v5-081419T1903.log contains the log from the v5 gdb script on the core from the 19:03 crash.
messages-081419T18-19.txt contains a slice of the logs from 6-7 pm when the crash occurred.
data-161779.log contains the info requested about memory usage, threads, etc. at 1 minute intervals leading up to and after the crash.
The latest crash report contains the additional info that was requested to be monitored regarding memory and thread usage. We are continuing to log that data so we will have it for future crashes as well. Paul, The data you shared is in fact very interesting. It shows that just before the crash the MemFree drops rapidly (attached MemFree_Slurmctld.png), however, slurmctld VmSize looks stable. I analyzed MemInfo changes carefully noticing that at during the MemFree drop there is increase activity on WriteBack, which means that PageCache is written back to a storage device. This may indicate that OS granted memory to a process "in advance" and started to make space reducing caches. (attached ServerMemory.png). Please make sure that the monitoring script we're using makes dynamic slurmctld pid lookup - otherwise you'll have to update the script after every restart. You can use a snippet like: >SLURMCTLDPID=$(ps aux | grep slurmctld$ | grep -v grep | awk '{print $2}') >if [ "x$SLURMCTLDPID" == "x" ] >then > exit 1 >fi Could you please add top memory users lines to the cron script: >echo "Top VmSize" >ps aux --sort=vsz | head -20 cheers, Marcin Created attachment 11251 [details]
MemFree_Slurmctld.png
Created attachment 11252 [details]
ServerMemory.png
Justin has reworked the hacky script I threw together to do that and we are adding the VmSize monitoring. At the next crash we will send you what we have. -Paul Edmon- On 8/15/19 6:24 AM, bugs@schedmd.com wrote: > > *Comment # 64 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c64> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > The data you shared is in fact very interesting. It shows that just before the > crash the MemFree drops rapidly (attached MemFree_Slurmctld.png), however, > slurmctld VmSize looks stable. I analyzed MemInfo changes carefully noticing > that at during the MemFree drop there is increase activity on WriteBack, which > means that PageCache is written back to a storage device. This may indicate > that OS granted memory to a process "in advance" and started to make space > reducing caches. (attached ServerMemory.png). > > Please make sure that the monitoring script we're using makes dynamic slurmctld > pid lookup - otherwise you'll have to update the script after every restart. > > You can use a snippet like: > >SLURMCTLDPID=$(ps aux | grep slurmctld$ | grep -v grep | awk '{print $2}') >if [ "x$SLURMCTLDPID" == "x" ] >then > exit 1 >fi > > Could you please add top memory users lines to the cron script: > >echo "Top VmSize" >ps aux --sort=vsz | head -20 > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 11254 [details]
Data for period of 08/14 21:22 to 08/15 10:58
Doesn't contain the vsz but does track the slurmctld pid. Known crash at 02:03. We've restarted the script now with vsz tracking and will send results along with next crash.
Created attachment 11257 [details]
Contains info about crash at 08-15-19 at 12:41
7532_gdb_py_v5-081519T1241.log contains the log from the v5 gdb script on the core from the 12:41 crash.
messages-081519T12.txt contains a slice of the logs from 12 pm when the crash occurred.
data3-081519T1251.log contains the info requested about memory usage, vsz, threads, etc. at 1 minute intervals leading up to and after the crash.
Just added new crash info from latest crash, contains additional info requested regarding vsz. Paul, My apologies. I did a mistake in the ps command used for vsz. It's ascending order now and top entries show the least memory consuming processes. Please change it to: >ps aux --sort=-vsz | head -20 The '-' sign will make it really work as we need. Please use a simpler and more reliable construct to get a pid of slurmctld: >SLURMCTLDPID=$(pidof slurmctld) The one I suggested earlier was over-engineered. cheers, Marcin I've updated the data script with that modification. -Paul Edmon- On 8/16/19 3:47 AM, bugs@schedmd.com wrote: > > *Comment # 72 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c72> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > My apologies. I did a mistake in the ps command used for vsz. It's ascending > order now and top entries show the least memory consuming processes. Please > change it to: > >ps aux --sort=-vsz | head -20 > The '-' sign will make it really work as we need. > > Please use a simpler and more reliable construct to get a pid of slurmctld: > >SLURMCTLDPID=$(pidof slurmctld) > The one I suggested earlier was over-engineered. > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Could you please add also the line for RSS? Finally, memory monitoring should look like:
>ps aux --sort=-vsz | head -20
>ps aux --sort=-rss | head -20
cheers,
Marcin
Yup, I've added that as well. We will send in the log when we have another crash. -Paul Edmon- On 8/16/19 10:14 AM, bugs@schedmd.com wrote: > > *Comment # 74 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c74> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Could you please add also the line for RSS? Finally, memory monitoring should > look like: > >ps aux --sort=-vsz | head -20 >ps aux --sort=-rss | head -20 > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Thanks Paul, Can you share with us your slurmctld prolog? cheers, Marcin Created attachment 11263 [details]
Slurmctld prolog
Paul, Could you please attach your slurmdbd.conf and Epilog script? cheers, Marcin Created attachment 11268 [details]
slurmdbd
Created attachment 11269 [details]
slurm_epilog
Did you notice any failure or was the situation more stable today? cheers, Marcin We've had a couple of failures, but that was before we put in the new metrics that we are polling for. It's' been about 5 hours or so, we seem to be failing every 6, so I'm expecting to see one soon. When it happens I will throw the logs in as I have been. -Paul Edmon- On 8/16/19 3:05 PM, bugs@schedmd.com wrote: > > *Comment # 81 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c81> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Did you notice any failure or was the situation more stable today? > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul - We are wondering if you might be able to disable PrologSlurmctld and run for a while to rule out a possible cause that we have been discussing. There was a change between 18.08 and 19.04 where we changed from detached threads to using track_scripts. Doing this will help us narrow down a possible cause. Created attachment 11271 [details]
Contains info about crash at 08-16-19 at 16:51
7532_gdb_py_v5-081619T1651.log contains the log from the v5 gdb script on the core from the 16:51 crash.
messages-081619T16.txt contains a slice of the logs from 4 pm when the crash occurred.
data-081619T1732.log contains the info requested about memory usage, vsz, threads, etc. at 1 minute intervals leading up to and after the crash
Sure. Will do. I've also upload the last crash data. -Paul Edmon- On 8/16/2019 4:29 PM, bugs@schedmd.com wrote: > > *Comment # 84 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c84> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason > Booth <mailto:jbooth@schedmd.com> * > Paul - We are wondering if you might be able to disable PrologSlurmctld and run > for a while to rule out a possible cause that we have been discussing. There > was a change between 18.08 and 19.04 where we changed from detached threads to > using track_scripts. Doing this will help us narrow down a possible cause. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > It looks like this change stabilized the scheduler. We had no crashes overnight and it had been typically crashing every 6 hours or so. I'm going to keep an eye on it over the weekend but this looks promising. -Paul Edmon- On 8/16/2019 4:29 PM, bugs@schedmd.com wrote: > > *Comment # 84 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c84> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason > Booth <mailto:jbooth@schedmd.com> * > Paul - We are wondering if you might be able to disable PrologSlurmctld and run > for a while to rule out a possible cause that we have been discussing. There > was a change between 18.08 and 19.04 where we changed from detached threads to > using track_scripts. Doing this will help us narrow down a possible cause. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul, Thanks for the good news. Did no info since yesterday mean that you didn't have any crash of slurmctld with PrologSlurmctld disabled? cheers, Marcin It's been rock solid since we made the change. No failures. I think we found the problem. -Paul Edmon- On 8/18/2019 1:37 PM, bugs@schedmd.com wrote: > > *Comment # 88 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c88> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > Thanks for the good news. Did no info since yesterday mean that you didn't have > any crash of slurmctld with PrologSlurmctld disabled? > > cheers, > Marcin > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul, We have committed a patch for this issue into 19.05. https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f We hare fairly certain this resolve the issue, however, would you be willing to test this out by applying this patch and re-enabling PrologSlurmctld? Sure. I can try this out tomorrow. Also as a request, the reason we ended up with that Slurmctld Prolog script is so that we can store copies of the user's job submission scripts for longer than the run of the job so we can do postfact reconstruction of what they did. It really isn't ideal to do this this way. Our preference would be for an option for slurm to automatically ship the script to the slurmdbd as an additional field for the job. If the size ends up being too much for the database, perhaps some size limit for the script and some way of archiving the scripts would be feasible as slurmdbd already supports archive certain things on different time scales. I know that this option would be of great service to us and the community at large as users are typically not savvy as to send us the scripts they used when they have problems, or perhaps don't even remember what they submitted. Thanks. -Paul Edmon- On 8/19/19 3:47 PM, bugs@schedmd.com wrote: > > *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c92> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason > Booth <mailto:jbooth@schedmd.com> * > Hi Paul, > > We have committed a patch for this issue into 19.05. > > https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f > > We hare fairly certain this resolve the issue, however, would you be willing to > test this out by applying this patch and re-enabling PrologSlurmctld? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Okay, this patch is live in our system and our prolog is running. I will let you know if I see any issues. -Paul Edmon- On 8/19/19 3:47 PM, bugs@schedmd.com wrote: > > *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c92> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason > Booth <mailto:jbooth@schedmd.com> * > Hi Paul, > > We have committed a patch for this issue into 19.05. > > https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f > > We hare fairly certain this resolve the issue, however, would you be willing to > test this out by applying this patch and re-enabling PrologSlurmctld? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > I ran it with that patch for a day with ctld prolog on and it was stable. Looks like that did the trick. -Paul Edmon- On 8/19/19 3:47 PM, bugs@schedmd.com wrote: > > *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=7532#c92> on > bug 7532 <https://bugs.schedmd.com/show_bug.cgi?id=7532> from Jason > Booth <mailto:jbooth@schedmd.com> * > Hi Paul, > > We have committed a patch for this issue into 19.05. > > https://github.com/SchedMD/slurm/commit/a04eea2e03af4181b518e28e775b414ffd18e46f > > We hare fairly certain this resolve the issue, however, would you be willing to > test this out by applying this patch and re-enabling PrologSlurmctld? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Paul, As I understand you didn't notice any failure after patch application. The fix is already merged into 19.05 branch and will be available in 19.05.3. I'll close the ticket now as resolved. You should remove debug flags we introduced and reenable route/topology plugin - I think it makes sense in your setup. For the new nodes - we recommend use of Future state as mentioned in comment 32. If you're still experiencing different issues(slowness) or would like to request a feature (store job scripts in accounting database) the best way would be to address those in separate reports. I'm closing it now as fixed. If you'll notice similar instability please re-open. cheers, Marcin *** Ticket 7528 has been marked as a duplicate of this ticket. *** |