Description
Azat
2019-07-05 02:33:17 MDT
Hi Azat, Will you attach slurmctld.log? thanks. Can you also show the output of: $ systemctl status slurmctld $ ulimit -a $ cat /proc/sys/kernel/threads-max (In reply to Alejandro Sanchez from comment #1) > Hi Azat, > > Will you attach slurmctld.log? thanks. Hi Alex, It is quite huge. But there are no new errors which we didn't have previously. And no specific errors preceding the failure of the daemon. There are tons of errors like > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd041 memory is under-allocated (22400-41600) for JobId=709079 > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd069 memory is under-allocated (51200-64000) for JobId=709079 > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd071 memory is under-allocated (51200-64000) for JobId=709079 Are you looking for specific errors? Here are the output of the comands > gwdu104:4 15:38:46 ~ # systemctl status slurmctld > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: disabled) > Active: active (running) since Fri 2019-07-05 09:53:14 CEST; 5h 45min ago > Process: 27433 ExecStart=/opt/slurm/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) > Main PID: 27435 (slurmctld) > CGroup: /system.slice/slurmctld.service > └─27435 /opt/slurm/sbin/slurmctld > > Jul 05 09:53:14 gwdu104 systemd[1]: Starting Slurm controller daemon... > Jul 05 09:53:14 gwdu104 systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start. > Jul 05 09:53:14 gwdu104 systemd[1]: Started Slurm controller daemon. > gwdu104:4 15:38:57 ~ # ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 385661 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 65536 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 385661 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > gwdu104:4 15:41:10 ~ # cat /proc/sys/kernel/threads-max > 771323 Thank you! (In reply to Azat from comment #3) > (In reply to Alejandro Sanchez from comment #1) > > Hi Azat, > > > > Will you attach slurmctld.log? thanks. > > Hi Alex, > > It is quite huge. If your logs are quite large, remember you can logrotate them (there's an example in slurm.conf man page at the bottom). Take into account this recent improvement to such example: https://github.com/SchedMD/slurm/commit/1e3548f387c6f6b > But there are no new errors which we didn't have > previously. And no specific errors preceding the failure of the daemon. > There are tons of errors like > > > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd041 memory is under-allocated (22400-41600) for JobId=709079 > > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd069 memory is under-allocated (51200-64000) for JobId=709079 > > [2019-07-05T11:36:36.396] error: select/cons_res: node gwdd071 memory is under-allocated (51200-64000) for JobId=709079 There's a patch pending review in bug 6769 where we are tracking this issue. > Are you looking for specific errors? Not particularly but just to have a feel of what's going on. > Here are the output of the comands > > gwdu104:4 15:38:46 ~ # systemctl status slurmctld Sorry I was more interested in: $ systemctl show slurmctld.service | egrep "TasksMax|LimitNPROC" (In reply to Alejandro Sanchez from comment #4) > If your logs are quite large, remember you can logrotate them (there's an > example in slurm.conf man page at the bottom). Take into account this recent Thank you, we will look into that possibility > There's a patch pending review in bug 6769 where we are tracking this issue. That would probably highly reduce the logs :) > Sorry I was more interested in: > > $ systemctl show slurmctld.service | egrep "TasksMax|LimitNPROC" Here is the output: > gwdu104:6 16:18:02 ~ # systemctl show slurmctld.service | egrep "TasksMax|LimitNPROC" > LimitNPROC=385661 But once again, all the limits are the same and the workload did not change after the update. Thank you again for the help! It is weird that TasksMax isn't in the output of the previous command. Can you try? $ systemctl show -p TasksMax slurmctld Thanks. (In reply to Alejandro Sanchez from comment #6) > It is weird that TasksMax isn't in the output of the previous command. Can > you try? > > $ systemctl show -p TasksMax slurmctld Yes it is empty indeed. > gwdu104:2 10:22:54 ~ # systemctl show -p TasksMax slurmctld > gwdu104:2 10:25:04 ~ # Can you verify if you have: LimitNOFILE=65536 TasksMax=infinity in your slurmctld.service and if not add those and restart it? (In reply to Alejandro Sanchez from comment #8) > Can you verify if you have: > > LimitNOFILE=65536 > TasksMax=infinity > > in your slurmctld.service and if not add those and restart it? We are using the service file provided by the build process of Slurm. Its content is: > [Unit] > Description=Slurm controller daemon > After=network.target munge.service > ConditionPathExists=/opt/slurm/etc/slurm.conf > > [Service] > Type=forking > EnvironmentFile=-/etc/sysconfig/slurmctld > ExecStart=/opt/slurm/slurm/19.05.0/install/sbin/slurmctld > $SLURMCTLD_OPTIONS > ExecReload=/bin/kill -HUP $MAINPID > PIDFile=/var/run/slurmctld.pid > LimitNOFILE=65536 > > > [Install] > WantedBy=multi-user.target We would not like to set TasksMax to infinity, since if there is some problem with Slurm related to the amount of processes it creates, it may crash the head nodes of the cluster which he runs on and we would like to avoid that. and we already have a fairly high limit on the number of processes set in /proc/SLURMCTLDPID/limits: > Max processes 385661 Should we first try to set the higher limit? With the following amount for instance? > TasksMax=600000 That's interesting. I'll provide you with more background: Since systemd version 227 TasksMax setting was added[1] to unit files. That's why the Slurm configure script makes use of the auxdir/x_ac_systemd.m4 macro to detect[2] the check if systemd version >= 227 and if so, proceed and substitute SYSTEMD_TASKSMAX_OPTION[3] by TasksMax=infinity[4] at build time. What version of systemd have you installed in your environment? if "inifnity" isn't a good value for you the one you suggested should be fine, though depending on your systemd version the setting won't be supported in the unit files yet. [1] https://github.com/systemd/systemd/blob/master/NEWS#L4082 [2] https://github.com/SchedMD/slurm/blob/slurm-19-05-0-1/auxdir/x_ac_systemd.m4#L31 [3] alex@polaris:~/slurm$ grep -i TasksMax source/etc/slurmctld.service.in @SYSTEMD_TASKSMAX_OPTION@ [4] alex@polaris:~/slurm$ grep -i tasks 19.05/build/etc/slurmctld.service TasksMax=infinity (In reply to Alejandro Sanchez from comment #10) Thank you for the exhaustive information. It clarifies a lot. The systemd version on our nodes is 219 > gwdu104:4 11:28:47 ~ # systemctl --version > systemd 219 > ... So that means we don't need to set TasksMax right? What else can we do in order to find the root cause of the issue? Is there any chance to upgrade systemd version? Otherwise, we'd need to investigate how systemd handled this prior to TasksMax setting availability. (In reply to Alejandro Sanchez from comment #12) > Is there any chance to upgrade systemd version? Unfortunately it is the latest version of the systemd for the Scientific Linux we are running on. So we cannot easily update it. > Otherwise, we'd need to investigate how systemd handled this prior to > TasksMax setting availability. But it did work with the previous version of Slurm. The fact it did work for your site while you were running 18.08 isn't conclusive. The problem might also occur if other resource limits are being hit (like memory or file descriptors). For instance, we had a bug from another site running 18.08 where they also reported this problem. They noticed an increased number of mmaps in /proc/<slurmctld pid>/smaps. In their case, they had their own jobcomp plugin and it wasn't properly free_buf'ing some mmap'd memory. After fixing this, they stopped seeing the error. I don't think that's your case since it's unlikely you run your own plugin, but it wouldn't hurt attaching the output of: $ cat /proc/$(pgrep slurmctld)/smaps and monitoring if slurmctld memory grows abnormally. In the other bug they also tried increasing: /etc/security/limits.conf nofile to 1048576 /proc/sys/kernel/threads-max to 3091639 /proc/sys/kernel/pid_max to 4194303 Could you try all this and see if it helps? AFAIC, this has just happened once since upgrading so far, am I right? Created attachment 10857 [details]
smaps
(In reply to Alejandro Sanchez from comment #14) > The fact it did work for your site while you were running 18.08 isn't > conclusive. That's true, I just wanted to point to the fact that the environment for both versions was the same and with 18.08 we never got errors like that (never hit the limits) > but it wouldn't hurt attaching the output of: > > $ cat /proc/$(pgrep slurmctld)/smaps attached it. > and monitoring if slurmctld memory grows abnormally. didn't notice anything suspicious for a couple of hours... it stays around 22.5G, maybe should look at it for a bit longer. > In the other bug they also tried increasing: > > /etc/security/limits.conf nofile to 1048576 > /proc/sys/kernel/threads-max to 3091639 > /proc/sys/kernel/pid_max to 4194303 > > Could you try all this and see if it helps? We can try to increase some limits, yes. > AFAIC, this has just happened once since upgrading so far, am I right? That's the problem, it happens periodically, every every 2-5 days so far. (In reply to Azat from comment #16) > > and monitoring if slurmctld memory grows abnormally. > didn't notice anything suspicious for a couple of hours... it stays around > 22.5G, maybe should look at it for a bit longer. Just to clarify, this is the virtual memory... the resident set size doesn't change much and is ~2G. Ok. Let's do the limits change and monitor the memory growth for a few days. Hi, how are things going so far? (In reply to Alejandro Sanchez from comment #19) > Hi, > > how are things going so far? Dear Alex, I have now increased the limits and we will see how the daemon behaves. I also looked into our monitoring system and the measurements of Memory consumption (~2.5Gb) and Processes (~2000) amount are in normal range when the daemon crashes... so it is probably not because of that, but something else. We also had another crash few days ago. So the crashes are consistent. Have you noticed anything relevant on your monitoring? When Slurm added[1] support for AIX systems back in ~2004, code was put in place to explicitly set the stack size attribute for all pthreads to 1024*1024 bytes (1M). AIX is no longer a supported platform and I'm wondering if this is being a factor contributing to the issue reported in here. Maybe reducing the stack size attribute for all pthreads can help: diff --git a/src/common/macros.h b/src/common/macros.h index aa8b48040e..d35ed10fb5 100644 --- a/src/common/macros.h +++ b/src/common/macros.h @@ -287,7 +287,7 @@ errno = err; \ error("pthread_attr_setscope: %m"); \ } \ - err = pthread_attr_setstacksize(attr, 1024*1024); \ + err = pthread_attr_setstacksize(attr, 256*1024); \ if (err) { \ errno = err; \ error("pthread_attr_setstacksize: %m"); \ This is kind of blind guess since I've not been able to get slurmctld to fatal due to EAGAIN when creating threads. Note also that changing the macro will impact threads creation from all slurm components and daemons, not just slurmctld. On the other hand, you reported the stack size limit as per ulimit: stack size (kbytes, -s) unlimited but there's also the mapping systemd limit LimitSTACK. Can you show me the output of $ systemctl show slurmctld.service | grep LimitSTACK and $ grep -A 1 stack /proc/$(pgrep slurmctld)/smaps [1] https://github.com/SchedMD/slurm/commit/55e62ab46 (In reply to Alejandro Sanchez from comment #22) Hi Alex, unfortunately increasing the limits didn't help. We have encountered 2 crashes since I have changed the limits, which means the crashes occur more or less with the same frequency. > When Slurm added[1] support for AIX systems back in ~2004, code was put in > place to explicitly set the stack size attribute for all pthreads to > 1024*1024 bytes (1M). I would like to emphasize it again, that the environment where Slurm is running didn't changed at all, we have only updated the Slurm version. Whatever is the cause of crashes now, more likely was not causing the crashes in the previous version. > Can you show me the output of > > $ systemctl show slurmctld.service | grep LimitSTACK > > and > > $ grep -A 1 stack /proc/$(pgrep slurmctld)/smaps Sure: > gwdu105:8 14:15:39 ~ # systemctl show slurmctld.service | grep LimitSTACK > LimitSTACK=18446744073709551615 > gwdu105:8 14:16:56 ~ # grep -A 1 stack /proc/$(pgrep slurmctld)/smaps > 7fff61428000-7fff61528000 rw-p 00000000 00:00 0 [stack:371201] > Size: 1024 kB > -- > 7fff61d31000-7fff61e31000 rw-p 00000000 00:00 0 [stack:370716] > Size: 1024 kB > -- > 7ffff0b2b000-7ffff0c2b000 rw-p 00000000 00:00 0 [stack:349656] > Size: 1024 kB > -- > 7ffff0d2d000-7ffff0e2d000 rw-p 00000000 00:00 0 [stack:349654] > Size: 1024 kB > -- > 7ffff0e2e000-7ffff0f2e000 rw-p 00000000 00:00 0 [stack:349653] > Size: 1024 kB > -- > 7ffff0f2f000-7ffff102f000 rw-p 00000000 00:00 0 [stack:349652] > Size: 1024 kB > -- > 7ffff1434000-7ffff1534000 rw-p 00000000 00:00 0 [stack:349651] > Size: 1024 kB > -- > 7ffff1740000-7ffff1840000 rw-p 00000000 00:00 0 [stack:349526] > Size: 1024 kB > -- > 7ffff1841000-7ffff1941000 rw-p 00000000 00:00 0 [stack:349525] > Size: 1024 kB > -- > 7ffff1942000-7ffff1a42000 rw-p 00000000 00:00 0 [stack:349497] > Size: 1024 kB > -- > 7ffff24fa000-7ffff25fa000 rw-p 00000000 00:00 0 [stack:349303] > Size: 1024 kB > -- > 7ffff3416000-7ffff3516000 rw-p 00000000 00:00 0 [stack:329563] > Size: 1024 kB > -- > 7ffff3726000-7ffff3826000 rw-p 00000000 00:00 0 [stack:329562] > Size: 1024 kB > -- > 7ffff7edd000-7ffff7fdd000 rw-p 00000000 00:00 0 [stack:329560] > Size: 1024 kB > -- > 7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0 [stack] > Size: 132 kB Hi Azat, Alex is out until the end of next week. I will be looking at your bug in the meantime. While I am reading through your bug, I miss the full output of: cat /proc/27435/limits Can you show me the full output? This error happens 90% due to some limits in the system. How many open files do you have in the system right now? And how many from the Slurm user? You say your environment haven't changed, but, does it mean you haven't upgraded the kernel, or any other package? Another idea that we could try in the meantime and if it bothers a lot, is to run slurmctld outside systemd, just to discard things. You just need to stop slurmctld and start it from the console as SlurmUser. I'll continue looking into your issue. (In reply to Felip Moll from comment #24) Hi Felip, > I will be looking at your bug in the meantime. Thank you! > While I am reading through your bug, I miss the full output of: > cat /proc/27435/limits > Can you show me the full output? Sure: > gwdu105:10 14:15:29 ~ # cat /proc/$(pgrep slurmctld)/limits > Limit Soft Limit Hard Limit Units > Max cpu time unlimited unlimited seconds > Max file size unlimited unlimited bytes > Max data size unlimited unlimited bytes > Max stack size unlimited unlimited bytes > Max core file size unlimited unlimited bytes > Max resident set unlimited unlimited bytes > Max processes 385661 385661 processes > Max open files 65536 65536 files > Max locked memory 65536 65536 bytes > Max address space unlimited unlimited bytes > Max file locks unlimited unlimited locks > Max pending signals 385661 385661 signals > Max msgqueue size 819200 819200 bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited us > This error happens 90% due to some limits in the system. How many open files > do you have in the system right now? And how many from the Slurm user? Here are lsof line counts: > gwdu105:10 14:14:54 ~ # lsof | wc -l > 77380 > gwdu105:10 14:15:12 ~ # lsof | grep slurm | wc -l > 5891 > gwdu105:10 14:27:01 ~ # lsof -p $(pgrep slurmctld) | wc -l > 68 > You say your environment haven't changed, but, does it mean you haven't > upgraded the kernel, or any other package? Yes it means we haven't updated kernel or installed/updated/removed any other packages on the system and haven't changed any system configuration as well. Since those are head nodes we do it very selectively and on rare occasion. > Another idea that we could try in the meantime and if it bothers a lot, is > to run slurmctld outside systemd, just to discard things. You just need to > stop slurmctld and start it from the console as SlurmUser. The issue actually bothers us a lot since slurmctld daemons crash periodically and we have to restart them manually. We will try to start them without systemd next time it crashes. Is there a way to get more verbose output from the slurmctld daemon why it failed with pthread_create error? Frankly it doesn't look like the limits issue. Or what limits can cause the error, since number of open files, processes, cpu/memory usage seem to be fine. > I'll continue looking into your issue. Thanks! > Is there a way to get more verbose output from the slurmctld daemon why it > failed with pthread_create error? The error you're seeing is generated from macros.h: #define slurm_thread_create(id, func, arg) do { pthread_attr_t attr; int err; slurm_attr_init(&attr); err = pthread_create(id, &attr, func, arg); if (err) { errno = err; fatal("%s: pthread_create error %m", __func__); } slurm_attr_destroy(&attr); } while (0) As you can see, the function that generates the error is pthread_create() which is external to Slurm. This means that the error doesn't come directly from Slurm but from the system, ** which means that there's some system configuration that avoids the creation of new threads or some resource which is exhausted. ** To create a thread we need to be able to open new files. Moreover each thread just needs space in the stack to allocate its data structures, so open files and stack size are key. Then there are possible limits on the maximum tasks a pid can create, in newer systemds it's controlled by TasksMax, but in yours it seems is already ok (by the last command you shown). The kernel can also limit globally the nr of open files, maximum processes or maximum threads on the system, but we doesn't seem to be hitting any of these limits too. -- All this info is what we already know. -- We are missing something here, how often do this error reproduce? It would be interesting to just capture the nr of open files, threads in the system by the slurmctld and slurm user at the exact moment the error is happening. What is your exact kernel version? It's unlikely but a bug existed in the kernel https://bugzilla.kernel.org/show_bug.cgi?id=154011 which sporadically created this issue. I'd like to know why this mismatch: > gwdu105:10 14:15:12 ~ # lsof | grep slurm | wc -l > 5891 > gwdu105:10 14:27:01 ~ # lsof -p $(pgrep slurmctld) | wc -l > 68 What are these 5891-68 files opened? I also would like to see your system logs at the exact time when the error happened to see if any other component is complaining. There's even the kernel.pid_max which can affect the creation of new threads. What's your value (sysctl kernel.pid_max)? In fact seeing a 'sysctl -a' would be useful. You mentioned systemd 219-42, but, what's your exact version (systemctl --version)? RHEL (and therefore SL) have backported the fix into 219-4: https://access.redhat.com/errata/RHBA-2017:2297 Finally, I'd like you to check: ]$ cat /proc/$(pgrep slurmctld)/cgroup .... 5:pids:/user.slice/user-1000.slice/session-2.scope .... Then find the pids.max in the subdirectory, e.g.: ]$ find /sys/fs/cgroup/pids/user.slice/user-1000.slice -name pids.max -exec echo -n '{}: ' \; -exec cat '{}' \; https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt If it is to low, feel free to increase the value manually to test whether it crashes again. If none of this limits applies then it must be that your server is really having some problem, then system logs should give us some hint. Another matter will be then to see why Slurm is trying to open many threads (if it really does), an sdiag then may help. Sorry to insist on limits and for making so many questions, but all of these are the main reasons why pthread_create() fails. Thanks (In reply to Felip Moll from comment #26) > The error you're seeing is generated from macros.h: > > ... > err = pthread_create(id, &attr, func, arg); > if (err) { > errno = err; > fatal("%s: pthread_create error %m", __func__); > ... > As you can see, the function that generates the error is pthread_create() > which is external to Slurm. This means that the error doesn't come directly > from Slurm but from the system, ** which means that there's some system > configuration that avoids the creation of new threads or some resource which > is exhausted. ** That makes sense. > We are missing something here, how often do this error reproduce? It would > be interesting to just capture the nr of open files, threads in the system > by the slurmctld and slurm user at the exact moment the error is happening. We have a monitoring running on the node with 10s interval. The total number of threads is always in the adequate range 800-2000. Can you please tell me what command would be enough to track the number of open files of the slurm user? "lsof -p $(pgrep slurmctld) | wc -l" will it be sufficient? I can add it to the monitoring. > What is your exact kernel version? It's unlikely but a bug existed in the > kernel https://bugzilla.kernel.org/show_bug.cgi?id=154011 which sporadically > created this issue. We have an older kernel version: > gwdu105:10 10:33:01 ~ # uname -r > 3.10.0-514.26.2.el7.x86_64 > I'd like to know why this mismatch: > > gwdu105:10 14:15:12 ~ # lsof | grep slurm | wc -l > > 5891 > > gwdu105:10 14:27:01 ~ # lsof -p $(pgrep slurmctld) | wc -l > > 68 > > What are these 5891-68 files opened? It is because "lsof | grep slurm" returns all fd for every thread, where the second one outputs only once, since the fds are shared with the main process (roughly it means there are 5891/68 threads of slurm processes, since there are some other fds wich belong to other processes and contain "slurm"). > I also would like to see your system logs at the exact time when the error > happened to see if any other component is complaining. Yes, but there is nothing preceding the failure > Aug 02 01:30:22 gwdu105 conmand[5375]: Console [gwdu103] connected to <10.109.49.203> > Aug 02 01:30:33 gwdu105 CROND[739662]: pam_ldap(crond:session): error opening connection to nslcd: No such file or directory > Aug 02 01:31:31 gwdu105 slurmctld[329559]: fatal: _start_msg_tree_internal: pthread_create error Resource temporarily unavailable > Aug 02 01:31:32 gwdu105 systemd[1]: slurmctld.service: main process exited, code=exited, status=1/FAILURE the crash happened at 01:31:31. the messages above happen periodically and have nothing to do with slurmctld. > There's even the kernel.pid_max which can affect the creation of new > threads. What's your value (sysctl kernel.pid_max)? In fact seeing a 'sysctl > -a' would be useful. I will attach the output of it. > You mentioned systemd 219-42, but, what's your exact version (systemctl > --version)? > RHEL (and therefore SL) have backported the fix into 219-4: > https://access.redhat.com/errata/RHBA-2017:2297 It seems we have an older version > gwdu105:10 11:16:04 ~ # rpm -q systemd > systemd-219-30.el7_3.3.x86_64 > Finally, I'd like you to check: > > ]$ cat /proc/$(pgrep slurmctld)/cgroup We don't have any cgroups configured for slurmctld: > gwdu105:10 11:32:14 ~ # cat /proc/$(pgrep slurmctld)/cgroup > 11:pids:/ > 10:perf_event:/ > 9:cpuset:/ > 8:freezer:/ > 7:memory:/ > 6:hugetlb:/ > 5:devices:/ > 4:cpuacct,cpu:/ > 3:blkio:/ > 2:net_prio,net_cls:/ > 1:name=systemd:/system.slice/slurmctld.service > If none of this limits applies then it must be that your server is really > having some problem, then system logs should give us some hint. > Another matter will be then to see why Slurm is trying to open many threads > (if it really does), an sdiag then may help. It doesn't seem that Slurm needs many threads to run and limits are high enough on our system. > Sorry to insist on limits and for making so many questions, but all of these > are the main reasons why pthread_create() fails. Not a problem, this crash error still confuses me and I hope we can figure out the root cause of it :) Created attachment 11100 [details]
sysctl -a output
Thank you. It seems nothing is wrong in all your setup.. I will think about other options to investigate. If you see any strange thing in the server it could be valuable to comment it here. Remember that if it crashes again you can try to run it manually from command line instead of from systemd which would remove one component in the diagnose. > We have a monitoring running on the node with 10s interval. The total number of threads is always in the adequate range 800-2000. > Can you please tell me what command would be enough to track the number of open files of the slurm user? > "lsof -p $(pgrep slurmctld) | wc -l" will it be sufficient? I can add it to the monitoring. I think showing the nr of open files by slurmctld process is what we want, so your command should suffice. P.D: As you've seen this comment I did two typos/mistakes: > > You mentioned systemd 219-42, but, what's your exact version (systemctl > > --version)? RHEL (and therefore SL) have backported the fix into 219-4: Should've been: 'you mentioned systemd 219' , and 'backported the fix into 219-42'. You have 219-30 so you should not have TasksMax in systemd yet, which is 'good'. Created attachment 11121 [details] bug7360_diagnostics_v0.patch Hi Azat, I am attaching a small patch here that replaces the 'fatal' by a 'fatal_abort' in slurm_thread_create. With this patch I aim to get a bit more of information of what threads were created and existed in the slurmctld instance when it fails again. Are you able to apply this patch and recompile slurm? If so, when it fails again you'll get a coredump, then open it with gdb run the following commands and attach the output here: info threads thread apply all bt full Thanks Created attachment 11122 [details] bug7360_diagnostics_v1.patch This patch version contemplates also the failure during creation of detached threads. Please apply this one if possible. Created attachment 11138 [details] slurmctld #fd monitoring (In reply to Felip Moll from comment #29) > Thank you. > It seems nothing is wrong in all your setup.. I will think about other > options to investigate. If you see any strange thing in the server it could > be valuable to comment it here. Thank you. We didn't notice anything suspicious about our servers recently. > Remember that if it crashes again you can try to run it manually from > command line instead of from systemd which would remove one component in the > diagnose. Yes, it crashed again and we started it without systemd... let's see what happens. > > We have a monitoring running on the node with 10s interval. The total number of threads is always in the adequate range 800-2000. > > Can you please tell me what command would be enough to track the number of open files of the slurm user? > > "lsof -p $(pgrep slurmctld) | wc -l" will it be sufficient? I can add it to the monitoring. > > I think showing the nr of open files by slurmctld process is what we want, > so your command should suffice. We have monitored the number of files for couple of days before it crashes and here are some results (see the attachment): - the average #fds is 70 - sometimes it goes almost to 400 - before the crash #fds is ~400 too you can see in the attachment the monitoring data (every 10 sec) before the crash that happened around 20:50. > Are you able to apply this patch and recompile slurm? Since we are trying currently to run slurmctld without systemd, we postponed patching Slurm. If it is not because of systemd, we will apply the patch and get a coredump. > you can see in the attachment the monitoring data (every 10 sec) before the > crash that happened around 20:50. I see this intersting that before the crash it was just increasing threads. > > Are you able to apply this patch and recompile slurm? We've other customers with this issue, and the info of one of them seems to be interesting. It would be nice, if it happens again, to get the backtrace from you too to confirm some suspicions. > Since we are trying currently to run slurmctld without systemd, we postponed > patching Slurm. If it is not because of systemd, we will apply the patch and > get a coredump. Keep me posted if it fails again, I'd like to know about it as soon as it happens. Two more questions: 1. I need your recent slurm.conf 2. I need a: free -m in your slurmctld host. btw, Alex is available again, so he may be responding this bug from now on too. Created attachment 11189 [details] slurmctld #fd and #threads monitoring (In reply to Felip Moll from comment #34) > I see this intersting that before the crash it was just increasing threads. Yes, you are right. I have attached another graph with # of total threads in the system included on the right axis. As you can see, when # of open files increase the # threads increase as well. (Ignore high peaks of # of threads, it is caused by another software running on the host and is completely normal) > We've other customers with this issue, and the info of one of them seems to > be interesting. It would be nice, if it happens again, to get the backtrace > from you too to confirm some suspicions. Ok, we will apply the patch, we are still waiting for the next crash after we have started the daemons without systemd. > Two more questions: > 1. I need your recent slurm.conf You can find it in #7423, or I can attach it here too if needed. > 2. I need a: > > free -m > > in your slurmctld host. Here is the output: > gwdu105:0 10:44:41 ~ # free -m > total used free shared buff/cache available > Mem: 96500 12758 10474 3304 73267 79041 > Swap: 15624 2589 13035 Hi. Any updates after starting ctld without systemd? thanks. (In reply to Alejandro Sanchez from comment #36) > Hi. Any updates after starting ctld without systemd? thanks. Hi Alex! Yes, the daemon crashed without systemd too. Today we have patched the sources and recompiled slurmctld. Now the patched version is running without systemd and after the crash I will provide the output of gdb on the coredump. Ok, thanks for your feedback. Could you show the output of: $ cat /proc/$(pidof slurmctld)/limits And if possible execute a script over time that will monitor these metrics and append executions output to a file? #!/bin/bash echo ctld_thread_count=$(ps --no-headers -p $(pidof slurmctld) -L | wc -l) echo ctld_opened_files=$(lsof -p $(pidof slurmctld) | wc -l) echo all_opened_files=$(lsof | wc -l) echo ctld_maps=$(cat /proc/$(pidof slurmctld)/maps | wc -l) echo file-nr=$(cat /proc/sys/fs/file-nr) echo inode-nr=$(cat /proc/sys/fs/inode-nr) echo nr_open=$(cat /proc/sys/fs/nr_open) echo "== status ==" cat /proc/$(pidof slurmctld)/status echo "== ipcs ==" ipcs -p $(pidof slurmctld) echo "== free ==" free -m echo "== vsz ==" ps aux --sort=-vsz | head -20 echo "== rsz ==" ps aux --sort=-rss | head -20 (In reply to Alejandro Sanchez from comment #39) > Ok, thanks for your feedback. > > Could you show the output of: > > $ cat /proc/$(pidof slurmctld)/limits Sure: > gwdu105:0 13:55:30 ~ # cat /proc/$(pidof slurmctld)/limits > Limit Soft Limit Hard Limit Units > Max cpu time unlimited unlimited seconds > Max file size unlimited unlimited bytes > Max data size unlimited unlimited bytes > Max stack size unlimited unlimited bytes > Max core file size unlimited unlimited bytes > Max resident set unlimited unlimited bytes > Max processes 385661 385661 processes > Max open files 265536 265536 files > Max locked memory unlimited unlimited bytes > Max address space unlimited unlimited bytes > Max file locks unlimited unlimited locks > Max pending signals 385661 385661 signals > Max msgqueue size 819200 819200 bytes > Max nice priority 0 0 > Max realtime priority 0 0 > Max realtime timeout unlimited unlimited us > And if possible execute a script over time that will monitor these metrics > and append executions output to a file? Ok, we are collecting the output of the script now every 30 secs. Once the slurmctld crashes I will attach the concatenated output. Hi, We've been able to reproduce this issue. As a temporal workaround, we suggest disabling the SlurmctldProlog and the EpilogSlurmctld to prevent more fatal's from happening. In the meantime, we're studying a fix for this. We suspect about a change in 19.05 logic where a thread changed from detached to joinable and it was never joined, leaking anonymous pages that eventually would leave no resources to make subsequent pthread_create calls successful. But for now, we still need to verify this is the actual root cause of the problem. Thanks. (In reply to Alejandro Sanchez from comment #41) > Hi, > > We've been able to reproduce this issue. As a temporal workaround, we > suggest disabling the SlurmctldProlog and the EpilogSlurmctld The correct option name is PrologSlurmctld ... Hi, this has been fixed in the following commit which will be available in 19.05.3: https://github.com/SchedMD/slurm/commit/a04eea2e03af418 I'd suggest upgrading to 19.05.2, which includes this other fix: https://github.com/SchedMD/slurm/commit/d1863b963cb1bac and apply a04eea2e03af418 on top till .3 is released. With that applied you should be able to re-enable any [Prolog|Epilog]Slurmctld and then the created threads should be properly cleaned up and stop leaking anonymous pages. Please, let us know how it goes and so we can close this or ask any further questions otherwise. Thanks for your collaboration. Created attachment 11296 [details]
slurmctld debug
Hi Alex,
Thank you for the info and patches. We have now compiled Slurm 19.05.2 and applied a04eea2e03af418 patch. Will let you know if it crashes and if not then the ticket can be closed after 2 weeks.
Meanwhile previous Slurm version with core dump patch crashed and the output from the script you provided (I have added the dates to navigate easier) and gdb are attached here. The crash happened on 20.08.2019 at 03:32:20 +/- 10 seconds.
PS: please mark the attachment as private since it contains the output of ps tool which might show usernames if we have used some commands with usernames on the node.
Hi, ok, thanks for the feedback. I think there's no need to apply the fatal_abort patch nor running the monitoring script anymore. I've marked the attachment as private (you can also do so from your side if needed for future attachments just fyi). Let's wait a couple of weeks and see if ctld gets stable with the fix. I'm also lowering the severity for now. Please, let us know if anything else comes up. Thanks. Hi. It's been a couple of weeks already with the patch applied. Have you seen any more problems or can we close this out? thanks. (In reply to Alejandro Sanchez from comment #59) > Hi. It's been a couple of weeks already with the patch applied. Have you > seen any more problems or can we close this out? thanks. Hi, Alex. Since we have applied the patch everything works smoothly. We can close this as resolved. Thank you once again for the help! All right, thanks for your feedback and cooperation. *** Ticket 7705 has been marked as a duplicate of this ticket. *** |