Summary: | Slurmd stops working - unable to create cgroup | ||
---|---|---|---|
Product: | Slurm | Reporter: | GSK-ONYX-SLURM <slurm-support> |
Component: | slurmd | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | alex |
Version: | 17.02.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | GSK | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | ? | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Control daemon log.
Slurm daemon log Slurm configuration Sdiag information |
Description
GSK-ONYX-SLURM
2018-05-03 08:59:29 MDT
Hi, I believe this is a duplicate of bug 5082. There is a known kernel bug that causes this. What is your kernel version? On bug 5082, I'm also investigating more deeply to see if there's something else that may cause the job cgroups to not get cleaned up. I don't think the logging is related at all - there have been several reports of this and none have mentioned logging. Can you upload the last part of the slurmd log file? Can you check to see if you filled up your filesystem? [root@uk1salx00553 slurm]# cat /proc/version Linux version 3.10.0-693.11.6.el7.x86_64 (mockbuild@x86-041.build.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 28 14:23:39 EST 2017 [root@uk1salx00553 slurm]# [root@uk1salx00553 slurm]# uname -r 3.10.0-693.11.6.el7.x86_64 [root@uk1salx00553 slurm]# [root@uk1salx00553 slurm]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 (Maipo) [root@uk1salx00553 slurm]# Alright, so that kernel version is relatively new - we know certain older kernel versions had this bug, but aren't sure if/when RedHat added the fixes in. Can you upload the relevant slurmd log file (the one the stopped logging) and check your filesystem space? Also, Alex said that when he did training at your site he suggested setting up logrotate. Have you done that? Maybe that's what you were seeing? See attached files for the latest incident. Control daemon log since last restart on Tue 2018-04-24 08:04:40 BST Slurmd daemon log since last restart on Thu 2018-04-26 08:10:40 BST See the slurmd log at [2018-05-03T13:44:19.018] [157205] task_p_pre_launch: Using sched_affinity for tasks [2018-05-03T16:51:59.325] error: accept: Bad address [2018-05-03T17:27:03.990] error: Munge decode failed: Expired credential 17:27 BST is when the queue was resumed. I also have another example of this from another server as well. Created attachment 6767 [details]
Control daemon log.
Control daemon log.
Created attachment 6768 [details]
Slurm daemon log
Slurm daemon log.
I'm noticing a very large amount of authentication errors in the slurmd. For example: [2018-05-03T17:27:04.139] error: slurm_receive_msg_and_forward: Protocol authentication error [2018-05-03T17:27:04.139] ENCODED: Thu May 03 13:51:33 2018 [2018-05-03T17:27:04.139] DECODED: Thu May 03 17:27:04 2018 [2018-05-03T17:27:04.139] error: authentication: Expired credential I also see this error a few times: [2018-05-04T04:16:51.781] error: accept: Bad address Can you check the following: - Validate the munge credentials using munge tools on the controller and compute nodes. - Are the clocks in sync? - Have you updated Slurm recently? - Have you changed your slurm.conf at all recently? If so, then what changed? Did you just issue a scontrol reconfigure or did you restart the various daemons after the change? I think the cgroup error is unrelated. However, to assist with that problem, can you also upload the output of lscgroup on the afflicted compute node? All those munge errors are produced when the slurmd deamon "wakes up" after its hung and we've resumed it. I'm assuming they were in transit comms that have to be discarded because they are no longer valid. Maybe this is where jobs were queued to uk1salx00553 but then requeued (we see the requeue action in the logs). To answer your other questions: Munge appears to be working correctly. Yes the clocks are in synch. We use ntp in our networks. SLURM version is 17.02.7. That's what we installed originally. We haven't yet upgraded to 17.11.5 as we are still testing that in our test / dev environment. I changed our slurm.conf today, add an existing known node to an existing queue and reconfigure. Any changes to slurm.conf were prior to the last full restarts. I'm assuming you only want lscgroup for slurm. If you want a full lscgroup then let me know. uk1salx00553 (The Lion): lscgroup | grep slurm blkio:/system.slice/slurmd.service cpu,cpuacct:/system.slice/slurmd.service freezer:/slurm freezer:/slurm/uid_62356 freezer:/slurm/uid_62356/job_111280 memory:/slurm memory:/slurm/system devices:/system.slice/slurmd.service devices:/slurm uk1salx00553 (The Lion): uk1salx00553 (The Lion): uk1salx00553 (The Lion): squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 164819 uk_hpc Percolat ll289546 R 1-02:07:54 1 uk1salx00552 167446 uk_hpc Danirixi ll289546 R 29:14 1 uk1salx00552 uk1salx00553 (The Lion): uk1salx00553 (The Lion): > All those munge errors are produced when the slurmd deamon "wakes up" after > its hung and we've resumed it. I'm assuming they were in transit comms that > have to be discarded because they are no longer valid. Maybe this is where > jobs were queued to uk1salx00553 but then requeued (we see the requeue > action in the logs). I agree. > Munge appears to be working correctly. > > Yes the clocks are in synch. We use ntp in our networks. > > SLURM version is 17.02.7. That's what we installed originally. We haven't yet upgraded to 17.11.5 as we are still testing that in our test / dev environment. Thanks - that all sounds like it’s working correctly. > I changed our slurm.conf today, add an existing known node to an existing > queue and reconfigure. Any changes to slurm.conf were prior to the last > full restarts. Okay - the change to the partition is just fine with a reconfigure. I just wanted to make sure you hadn’t changed a node definition and called reconfigure, since changing a node definition requires a full restart. > I'm assuming you only want lscgroup for slurm. If you want a full lscgroup > then let me know. Yes, just for Slurm. I was looking to see if there were lots of leftover cgroups that hadn’t gotten cleaned up properly, since that has happened for others with this problem. However, I only see one job cgroup that hasn’t gotten cleaned up. Look at bug 5082 to keep track of progress on this cgroup bug - I've been working on it and have posted updates there. I also noticed quite a few “Socket timed out” error messages, such as this [2018-05-04T04:09:53.977] error: slurm_receive_msgs: Socket timed out on send/recv operation I’m wondering if there are other issues in addition to the cgroup bug. Have you noticed client commands being unresponsive? Can you also upload a slurm.conf and the output of sdiag? Sorry for the delay. UK Bank Holiday. I will attach sdiag and slurm.conf. I came in this morning to find SLURM queues down on both our main HPC servers, uk1salx00553 and uk1salx00552. uk1salx00552 1 uk_hpc* down* 48 48:1:1 257526 2036 1 (null) Not responding uk1salx00552 1 uk_columbus_tst down* 48 48:1:1 257526 2036 1 (null) Not responding uk1salx00553 1 uk_hpc* down* 48 48:1:1 257526 2036 1 (null) Not responding uk1salx00553 1 uk_columbus_tst down* 48 48:1:1 257526 2036 1 (null) Not responding Looking at the end of the 552 slurmd log shows: [2018-05-04T18:19:47.471] [167485] error: task/cgroup: unable to add task[pid=6681] to memory cg '(null)' [2018-05-04T18:19:47.473] [167485] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.509] [167487] error: task/cgroup: unable to add task[pid=6691] to memory cg '(null)' [2018-05-04T18:19:47.511] [167487] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.527] [167491] error: task/cgroup: unable to add task[pid=6700] to memory cg '(null)' [2018-05-04T18:19:47.532] [167491] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.542] [167488] error: task/cgroup: unable to add task[pid=6705] to memory cg '(null)' [2018-05-04T18:19:47.544] [167488] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.564] [167486] error: task/cgroup: unable to add task[pid=6711] to memory cg '(null)' [2018-05-04T18:19:47.566] [167486] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.575] [167489] error: task/cgroup: unable to add task[pid=6713] to memory cg '(null)' [2018-05-04T18:19:47.577] [167489] task_p_pre_launch: Using sched_affinity for tasks [2018-05-06T21:05:57.891] [164819] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 256 and the log on 553 is similar: [2018-05-04T18:19:47.500] [167538] error: task/cgroup: unable to add task[pid=11468] to memory cg '(null)' [2018-05-04T18:19:47.502] [167538] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.550] [167535] error: task/cgroup: unable to add task[pid=11471] to memory cg '(null)' [2018-05-04T18:19:47.552] [167535] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.568] [167527] error: task/cgroup: unable to add task[pid=11474] to memory cg '(null)' [2018-05-04T18:19:47.571] [167539] error: task/cgroup: unable to add task[pid=11481] to memory cg '(null)' [2018-05-04T18:19:47.572] [167539] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.573] [167527] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.578] [167533] error: task/cgroup: unable to add task[pid=11484] to memory cg '(null)' [2018-05-04T18:19:47.579] [167536] error: task/cgroup: unable to add task[pid=11485] to memory cg '(null)' [2018-05-04T18:19:47.580] [167533] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.580] [167536] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.584] [167540] error: task/cgroup: unable to add task[pid=11488] to memory cg '(null)' [2018-05-04T18:19:47.585] [167540] task_p_pre_launch: Using sched_affinity for tasks [2018-05-04T18:19:47.600] [167537] error: task/cgroup: unable to add task[pid=11489] to memory cg '(null)' [2018-05-04T18:19:47.601] [167537] task_p_pre_launch: Using sched_affinity for tasks So we took the decision to turn over the slurmd logs as they were quite big and restart the slurmd service. Bad move. The slurmd service failed on both 552 and 553 with: [2018-05-08T12:08:16.049] error: plugin_load_from_file: dlopen(/usr/local/slurm/lib64/slurm/proctrack_cgroup.so): /usr/local/slurm/lib64/slurm/proc track_cgroup.so: cannot read file data: Cannot allocate memory [2018-05-08T12:08:16.049] error: Couldn't load specified plugin name for proctrack/cgroup: Dlopen of plugin file failed [2018-05-08T12:08:16.049] error: cannot create proctrack context for proctrack/cgroup [2018-05-08T12:08:16.049] error: slurmd initialization failed We eventually had to do a "cgclear" and were then able to get the slurmd service to restart. However we still now have: [2018-05-08T15:17:57.999] error: xcgroup_instantiate: unable to create cgroup '/sys/fs/cgroup/memory/slurm' : No space left on device [2018-05-08T15:17:57.999] error: system cgroup: unable to build slurm cgroup for ns memory: No space left on device [2018-05-08T15:17:58.000] error: Resource spec: unable to initialize system memory cgroup [2018-05-08T15:17:58.000] error: Resource spec: system cgroup memory limit disabled occurring in the 553 log file. Created attachment 6790 [details]
Slurm configuration
Slurm.conf
Created attachment 6791 [details]
Sdiag information
Sdiag information
I know this is a clunky workaround, but can you try rebooting the node? In every other case that has fixed the problem. The kernel bug known to cause this is basically running out of cgroup id's - after 65535, no more can be created, despite not actually having that many active cgroups. If you're experiencing this kernel bug, I suspect that cat /proc/cgroups will show a lot fewer than 64k cgroups, and that a node reboot will fix the issue. It would be good to know if we're leaking 64k cgroups, so if you could get the output of cat /proc/cgroups as well as lscgroup (again) on an afflicted node, that would be good. cat /proc/cgroups shows the actual number of cgroups used. I also have a patch pending review that prevents freezer cgroups from leaking, since we've seen that happen. (see bug 5082) Since I personally haven't been able to reproduce the exact error you're seeing, I don't know if that patch will fix this "Unable to add task to memory cg"/"No space left on device" error, but when it's available I'll let you know. Here's the cgroups info. We'll look at scheduling a reboot... as this is a production server that might take a while to schedule. There some additional info below as well. [root@uk1salx00553 slurm]# [root@uk1salx00553 slurm]# cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 0 1 1 cpuacct 0 1 1 memory 15 1 1 devices 17 3 1 freezer 16 2 1 net_cls 0 1 1 blkio 0 1 1 perf_event 0 1 1 hugetlb 0 1 1 pids 0 1 1 net_prio 0 1 1 [root@uk1salx00553 slurm]# [root@uk1salx00553 slurm]# lscgroup | grep slurm freezer:/slurm devices:/slurm [root@uk1salx00553 slurm]# [root@uk1salx00552 slurm]# cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 0 1 1 cpu 0 1 1 cpuacct 0 1 1 memory 13 4 1 devices 15 2 1 freezer 14 2 1 net_cls 0 1 1 blkio 0 1 1 perf_event 0 1 1 hugetlb 0 1 1 pids 0 1 1 net_prio 0 1 1 [root@uk1salx00552 slurm]# [root@uk1salx00552 slurm]# lscgroup | grep slurm memory:/slurm memory:/slurm/uid_0 memory:/slurm/system freezer:/slurm devices:/slurm [root@uk1salx00552 slurm]# Comparing these two servers, 552 and 553, on 552 [root@uk1salx00552 slurm]# ls /sys/fs/cgroup/memory/slurm cgroup.clone_children memory.kmem.max_usage_in_bytes memory.limit_in_bytes memory.numa_stat memory.use_hierarchy cgroup.event_control memory.kmem.slabinfo memory.max_usage_in_bytes memory.oom_control notify_on_release cgroup.procs memory.kmem.tcp.failcnt memory.memsw.failcnt memory.pressure_level system memory.failcnt memory.kmem.tcp.limit_in_bytes memory.memsw.limit_in_bytes memory.soft_limit_in_bytes tasks memory.force_empty memory.kmem.tcp.max_usage_in_bytes memory.memsw.max_usage_in_bytes memory.stat uid_0 memory.kmem.failcnt memory.kmem.tcp.usage_in_bytes memory.memsw.usage_in_bytes memory.swappiness memory.kmem.limit_in_bytes memory.kmem.usage_in_bytes memory.move_charge_at_immigrate memory.usage_in_bytes [root@uk1salx00552 slurm]# but on 553, [root@uk1salx00553 slurm]# ls /sys/fs/cgroup/memory/slurm ls: cannot access /sys/fs/cgroup/memory/slurm: No such file or directory [root@uk1salx00553 slurm]# Also, when we restart slurmd there is a slurmstepd reported [root@uk1salx00553 slurm]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2018-05-08 17:12:04 BST; 14min ago Process: 5216 ExecStart=/usr/local/slurm/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 5218 (slurmd) CGroup: /system.slice/slurmd.service ├─ 5218 /usr/local/slurm/sbin/slurmd └─30720 slurmstepd: [157108] May 08 17:12:03 uk1salx00553.corpnet2.com systemd[1]: Starting Slurm node daemon... May 08 17:12:04 uk1salx00553.corpnet2.com systemd[1]: Started Slurm node daemon. [root@uk1salx00553 slurm]# [root@uk1salx00553 slurm]# ps -ef | grep slurm root 5218 1 0 17:12 ? 00:00:00 /usr/local/slurm/sbin/slurmd root 12715 4921 0 17:27 pts/7 00:00:00 grep slurm root 30720 1 0 May03 ? 00:00:00 slurmstepd: [157108] [root@uk1salx00553 slurm]# but there are no queued jobs [root@uk1salx00553 slurm]# /usr/local/slurm/bin/squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [root@uk1salx00553 slurm]# and jobid 157108 finished days ago... [root@uk1salx00553 slurm]# /usr/local/slurm/bin/sacct -j 157108 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 153212_201 COL-H004M+ uk_columb+ 1 COMPLETED 0:0 153212_201.+ batch 1 COMPLETED 0:0 [root@uk1salx00553 slurm]# /usr/local/slurm/bin/sacct -j 157108 -o start,end,nodelist Start End NodeList ------------------- ------------------- --------------- 2018-05-03T13:43:20 2018-05-03T13:44:15 uk1salx00553 2018-05-03T13:43:20 2018-05-03T13:44:15 uk1salx00553 Is that job 157108 stuck somehow and preventing the slurmd process starting properly? From the original slurmd log it looks like job 157108 as the last successful scheduled job before we started getting the initial cannot create cgroup message at around 13:44 on 3rd May. Should we try killing the slurmstepd process and restarting the daemon to see if that changes anything? (In reply to GSK-EIS-SLURM from comment #13) > Bad move. The slurmd service failed on both 552 and 553 with: Did the slurmd generate a core file? If so, can you get a full backtrace from the slurmd? thread apply all bt full (In reply to GSK-EIS-SLURM from comment #17) > Here's the cgroups info. We'll look at scheduling a reboot... as this is a > production server that might take a while to schedule. Yes, that's the problem with a reboot being a workaround. > There some additional info below as well. > > [root@uk1salx00553 slurm]# > [root@uk1salx00553 slurm]# cat /proc/cgroups > #subsys_name hierarchy num_cgroups enabled > cpuset 0 1 1 > cpu 0 1 1 > cpuacct 0 1 1 > memory 15 1 1 > devices 17 3 1 > freezer 16 2 1 > net_cls 0 1 1 > blkio 0 1 1 > perf_event 0 1 1 > hugetlb 0 1 1 > pids 0 1 1 > net_prio 0 1 1 > [root@uk1salx00553 slurm]# > [root@uk1salx00553 slurm]# lscgroup | grep slurm > freezer:/slurm > devices:/slurm > [root@uk1salx00553 slurm]# > > > [root@uk1salx00552 slurm]# cat /proc/cgroups > #subsys_name hierarchy num_cgroups enabled > cpuset 0 1 1 > cpu 0 1 1 > cpuacct 0 1 1 > memory 13 4 1 > devices 15 2 1 > freezer 14 2 1 > net_cls 0 1 1 > blkio 0 1 1 > perf_event 0 1 1 > hugetlb 0 1 1 > pids 0 1 1 > net_prio 0 1 1 > [root@uk1salx00552 slurm]# > [root@uk1salx00552 slurm]# lscgroup | grep slurm > memory:/slurm > memory:/slurm/uid_0 > memory:/slurm/system > freezer:/slurm > devices:/slurm > [root@uk1salx00552 slurm]# That's not very many active cgroups - it should be able to create more cgroups. (Should being the operative word here.) Unfortunately, I guess this is after you used cgclear, so I don't know if Slurm leaked cgroups or not. If this happens on another node, get the output of cat /proc/cgroups on that one, too. > > Comparing these two servers, 552 and 553, on 552 > > [root@uk1salx00552 slurm]# ls /sys/fs/cgroup/memory/slurm > cgroup.clone_children memory.kmem.max_usage_in_bytes > memory.limit_in_bytes memory.numa_stat > memory.use_hierarchy > cgroup.event_control memory.kmem.slabinfo > memory.max_usage_in_bytes memory.oom_control > notify_on_release > cgroup.procs memory.kmem.tcp.failcnt > memory.memsw.failcnt memory.pressure_level system > memory.failcnt memory.kmem.tcp.limit_in_bytes > memory.memsw.limit_in_bytes memory.soft_limit_in_bytes tasks > memory.force_empty memory.kmem.tcp.max_usage_in_bytes > memory.memsw.max_usage_in_bytes memory.stat uid_0 > memory.kmem.failcnt memory.kmem.tcp.usage_in_bytes > memory.memsw.usage_in_bytes memory.swappiness > memory.kmem.limit_in_bytes memory.kmem.usage_in_bytes > memory.move_charge_at_immigrate memory.usage_in_bytes > [root@uk1salx00552 slurm]# > > but on 553, > > [root@uk1salx00553 slurm]# ls /sys/fs/cgroup/memory/slurm > ls: cannot access /sys/fs/cgroup/memory/slurm: No such file or directory > [root@uk1salx00553 slurm]# Is that because you deleted the cgroups with cgclear? It should get re-created with a restart of the slurmd > Also, when we restart slurmd there is a slurmstepd reported ... > Is that job 157108 stuck somehow and preventing the slurmd process starting > properly? Deadlocked or hanging slurmstepd's shouldn't prevent the slurmd from starting. Without seeing the slurmd log file of when that job completed, I can only guess what happened. The job clearly isn't stuck, since its state is COMPLETED. If the job was stuck, it would still be in the slurmctld's state, and you could view it with scontrol show job 157108. Before you kill it, can you get a backtrace from it? gdb attach <slurmstepd pid> thread apply all bt There is a known bug that causes a slurmstepd to deadlock. That is fixed in bug 5103 and will be in 17.11.6. [root@uk1salx00553 ~]# /usr/local/slurm/bin/scontrol show job 157108 slurm_load_jobs error: Invalid job id specified [root@uk1salx00553 ~]# [root@uk1salx00553 ~]# gdb attach 30720 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... attach: No such file or directory. Attaching to process 30720 Reading symbols from /usr/local/slurm/sbin/slurmstepd...done. Reading symbols from /usr/lib64/libhwloc.so.5...Reading symbols from /usr/lib64/libhwloc.so.5...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libhwloc.so.5 Reading symbols from /usr/lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libdl.so.2 Reading symbols from /usr/lib64/libpam.so.0...Reading symbols from /usr/lib64/libpam.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpam.so.0 Reading symbols from /usr/lib64/libpam_misc.so.0...Reading symbols from /usr/lib64/libpam_misc.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpam_misc.so.0 Reading symbols from /usr/lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libutil.so.1 Reading symbols from /usr/lib64/libgcc_s.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libgcc_s.so.1 Reading symbols from /usr/lib64/libpthread.so.0...(no debugging symbols found)...done. [New LWP 30723] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Loaded symbols for /usr/lib64/libpthread.so.0 Reading symbols from /usr/lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libc.so.6 Reading symbols from /usr/lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libm.so.6 Reading symbols from /usr/lib64/libnuma.so.1...Reading symbols from /usr/lib64/libnuma.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnuma.so.1 Reading symbols from /usr/lib64/libltdl.so.7...Reading symbols from /usr/lib64/libltdl.so.7...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libltdl.so.7 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib64/libaudit.so.1...Reading symbols from /usr/lib64/libaudit.so.1...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libaudit.so.1 Reading symbols from /usr/lib64/libcap-ng.so.0...Reading symbols from /usr/lib64/libcap-ng.so.0...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libcap-ng.so.0 Reading symbols from /usr/lib64/libnss_compat.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnss_compat.so.2 Reading symbols from /usr/lib64/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnsl.so.1 Reading symbols from /usr/lib64/libnss_nis.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnss_nis.so.2 Reading symbols from /usr/lib64/libnss_files.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnss_files.so.2 Reading symbols from /usr/local/slurm/lib64/slurm/select_cons_res.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/select_cons_res.so Reading symbols from /usr/local/slurm/lib64/slurm/auth_munge.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/auth_munge.so Reading symbols from /usr/lib64/libmunge.so.2...Reading symbols from /usr/lib64/libmunge.so.2...(no debugging symbols found)...done. (no debugging symbols found)...done. Loaded symbols for /usr/lib64/libmunge.so.2 Reading symbols from /usr/local/slurm/lib64/slurm/switch_none.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/switch_none.so Reading symbols from /usr/local/slurm/lib64/slurm/gres_gpu.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/gres_gpu.so Reading symbols from /usr/local/slurm/lib64/slurm/core_spec_none.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/core_spec_none.so Reading symbols from /usr/local/slurm/lib64/slurm/task_cgroup.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/task_cgroup.so Reading symbols from /usr/local/slurm/lib64/slurm/task_affinity.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/task_affinity.so Reading symbols from /usr/local/slurm/lib64/slurm/proctrack_cgroup.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/proctrack_cgroup.so Reading symbols from /usr/local/slurm/lib64/slurm/checkpoint_none.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/checkpoint_none.so Reading symbols from /usr/local/slurm/lib64/slurm/crypto_munge.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/crypto_munge.so Reading symbols from /usr/local/slurm/lib64/slurm/job_container_none.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/job_container_none.so Reading symbols from /usr/local/slurm/lib64/slurm/mpi_none.so...done. Loaded symbols for /usr/local/slurm/lib64/slurm/mpi_none.so Reading symbols from /usr/lib64/libnss_dns.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libnss_dns.so.2 Reading symbols from /usr/lib64/libresolv.so.2...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libresolv.so.2 0x00002b35ef5bdf57 in pthread_join () from /usr/lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-17.02.7-1.el7.x86_64 (gdb) thread apply all bt Thread 2 (Thread 0x2b35f2567700 (LWP 30723)): #0 0x00002b35ef8d6eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6 #1 0x00002b35ef93860d in _L_lock_27 () from /usr/lib64/libc.so.6 #2 0x00002b35ef9385bd in arena_thread_freeres () from /usr/lib64/libc.so.6 #3 0x00002b35ef938662 in __libc_thread_freeres () from /usr/lib64/libc.so.6 #4 0x00002b35ef5bce38 in start_thread () from /usr/lib64/libpthread.so.0 #5 0x00002b35ef8c934d in clone () from /usr/lib64/libc.so.6 Thread 1 (Thread 0x2b35ee74e5c0 (LWP 30720)): #0 0x00002b35ef5bdf57 in pthread_join () from /usr/lib64/libpthread.so.0 #1 0x000000000042a787 in stepd_cleanup (msg=0xc209f0, job=0xc1fe20, cli=0xc1cab0, self=0x0, rc=0, only_mem=false) at slurmstepd.c:200 #2 0x000000000042a6ce in main (argc=1, argv=0x7fffc2271168) at slurmstepd.c:185 (gdb) Great - so we know that job isn't stuck since it isn't in the slurmctld state. That slurmstepd backtrace looks exactly like the deadlock bug that is fixed in bug 5103. Feel free to kill -9 that stepd. Both the uk1saxl00552 and uk1salx00553 servers have been rebooted and SLURM has started normally. The missing /sys/fs/cgroup/memory/slurm cgroup for 553 now exists. Scheduling a couple of test jobs was successful. Is there any information we should collect at this point in time ready for comparison when the issue returns? (In reply to GSK-EIS-SLURM from comment #22) > Both the uk1saxl00552 and uk1salx00553 servers have been rebooted and SLURM > has started normally. The missing /sys/fs/cgroup/memory/slurm cgroup for > 553 now exists. > > Scheduling a couple of test jobs was successful. > > Is there any information we should collect at this point in time ready for > comparison when the issue returns? I don't think we need anything right now. If you do hit this problem again, run cat /proc/cgroups and see how many cgroups there are - if there aren't 64k, you should still be able to create it. Also run lscgroup to see if Slurm has leaked any cgroups. See my latest post on bug 5082 - none of the fixes for this "No space left on device" bug when trying to create memory cgroups have been backported to Linux 3.10 - they're all on Linux 4.<something>. One of the sites has opened a ticket with RedHat. I'd keep bugging RedHat to backport the fixes. If it's okay with you, can I close this as a duplicate of bug 5082? Feel free to CC yourself on that ticket. Also, I took a look at your slurm.conf and sdiag output. The sdiag output was only for a very short amount of time, so it didn't give me very much information. The little information that it did give showed the system working fine. If you find that your system is slow and need tuning recommendations, feel free to open a new ticket, and upload the output of sdiag and your current slurm.conf on that one. One thing to watch out for is the "Socket timed out on send/recv" messages. Yes, sure please go ahead and close. I couldn't cc myself on that other ticket. It seemed to want me to complete the version fixed field as it says it mandatory. I didn't think it appropriate for me to be changing any other fields other than adding a cc. So I didn't. Closing as a duplicate of bug 5082. I've add the following email to the CC list on bug 5082: GSK-EIS-SLURM@gsk.com Please feel free to comment on that bug. *** This ticket has been marked as a duplicate of ticket 5082 *** |