Summary: | Nodes crash while running epilog | ||
---|---|---|---|
Product: | Slurm | Reporter: | Anthony DelSorbo <anthony.delsorbo> |
Component: | slurmd | Assignee: | Marshall Garey <marshall> |
Status: | RESOLVED TIMEDOUT | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | nathan.dauchy, shawn.needham |
Version: | 19.05.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=8911 https://bugs.schedmd.com/show_bug.cgi?id=8851 |
||
Site: | NOAA | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | NESCC | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Juno Cgroup file
Juno slurm.conf file Node Health Check File last week's slurmctld.log file |
Description
Anthony DelSorbo
2020-04-09 10:17:34 MDT
Do you know if the node itself is crashing (OS and everything, so you have to reboot the node)? Or is slurmd crashing (so you just need to restart slurmd)? If slurmd is crashing, are there any core files generated? If so, can you run the following commands in gdb and upload the output: * info threads * bt full * t a a bt In addition, can you upload: * slurmd log file from a node that crashed (let me know the node name) * the epilog script and any output to the epilog script from that node * slurm.conf for the cluster * a slurmctld log file during the time that node crashed (I'll look for the node name in this log file) Thanks (In reply to Marshall Garey from comment #1) Marshall, Thanks for the quick reply. Although we've been experiencing this issue for a few days, I may have jumped the gun here a bit. It may just be a coincidence that the issue seems to coincide with the slurm epilog phase. We're collecting more data on our end to rule out other potential culprits. Specifically, we're rolling back a lustre client update we installed a few days before we started running these tests. I've downgraded to priority 4 for now. I will let you know within the next few days whether this issue recurs. Best, Tony. Sounds good; I'll wait for your reply. (In reply to Marshall Garey from comment #3) > Sounds good; I'll wait for your reply. Thanks for holding on. We've been running thousands of tests in order to try to pinpoint the root cause. Unfortunately, It's still not clear to us whether this is slurm related or not. The only thing I can tell you is that the issue's timing is always either during a job or just as the job is ending. We've even backed out recent image changes to determine if there's any correlation. Additionally, we've re-engineered our images to force a crash dump at the next occurrence. I'm hopeful we'll be able to get some info from that. I'll keep you posted, possibly as early as tomorrow AM. Thanks, Tony. (In reply to Marshall Garey from comment #5) > Is this bug related to bug 8851? Marshall, No. Different behavior altogether. There are not kernel stack traces in that bug. Tony. Hi Marshall, We were able to re-image our test environment to support kernel crash dumps. I wanted to get you some good data before I came back to you on this. So we ran a few scenarios starting last Friday evening. In that set, by early AM we were able to collect about 4 dumps and another three this morning. The dumps have messages of the following format: [233374.108758] CPU: 10 PID: 284713 Comm: kworker/10:6 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7.x86_64 #1 [233374.234602] BUG: unable to handle kernel paging request at ffff97e04ef9ad00 [233374.367857] CPU: 15 PID: 0 Comm: swapper/15 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7.x86_64 #1 [107720.490173] CPU: 36 PID: 0 Comm: swapper/36 Kdump: loaded Tainted: G OE ------------ 3.10.0-1062.18.1.el7.x86_64 #1 [107720.631758] BUG: unable to handle kernel paging request at ffff9fb58d25ad00 [107720.765193] CPU: 29 PID: 0 Comm: swapper/29 Kdump: loaded Tainted: G W OE ------------ 3.10.0-1062.18.1.el7.x86_64 #1 Now, typically, we attribute these issues to a system problem - and we're not saying it isn't. But, the issues only occur after a considerable number of Slurm jobs have been processed over time. It's possible this is either a resource contention or race condition. We are leaning in this direction only because if we slow down the number of jobs by inserting a sleep delay in the script - before it submits the chained job - the problem does not surface. Then we decreased that delay from 60 seconds to 15 seconds and the issue resurfaces. So, in summary: first test: sleep delay = 0 ==> crashed nodes second test: sleep delay = 60 ==> ALL GOOD third test: sleep delay = 15 ==> crashed nodes Additionally, under the low or no delay tests we also see some locked up nodes that were taken off line by NHC with messages like: Reason=NHC: Script timed out while executing "check_ps_service -u munge munged". The particular message might be different, but it does put the node in a drained state: [Anthony.DelSorbo@jfe01:6 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST juno* up 8:00:00 1 drain j1c11 We're also seeing a considerable number of "Device or resource busy" messages such as these: 1398384 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_537/job_3170826 Device or resource busy 1398385 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_537): Device or resource busy 1398386 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_537 Device or resource busy 1398389 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_537/job_3170826): Device or resource busy 1398390 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_537/job_3170826 Device or resource busy I'm sure these aren't helping matters. In troubleshooting batch system issues, we've relied on logging info to the syslogs via the prolog and epilogs. Three other observations we've made are: 1. In the kernel ring buffer, we can see when we mark the beginning and end of a job as shown below, we see normal job start then job finish messages, then a period of only job start messages until the very end - just before the crash - where we have but one more epilog message. So something happened in the epilog stages for each of the jobs. 10678 [237228.401773] prolog.root: Thu Apr 23 11:23:54 UTC 2020 Starting Job 3170565 10679 [237248.365507] epilog.root: Thu Apr 23 11:24:14 UTC 2020 Finished Job 3170565 10680 [237248.459147] prolog.root: Thu Apr 23 11:24:14 UTC 2020 Starting Job 3170570 10681 [237268.467639] epilog.root: Thu Apr 23 11:24:34 UTC 2020 Finished Job 3170570 10682 [237268.562598] prolog.root: Thu Apr 23 11:24:34 UTC 2020 Starting Job 3170575 10683 [237288.548849] prolog.root: Thu Apr 23 11:24:54 UTC 2020 Starting Job 3170580 10684 [237308.656116] prolog.root: Thu Apr 23 11:25:14 UTC 2020 Starting Job 3170585 10685 [237328.624536] prolog.root: Thu Apr 23 11:25:34 UTC 2020 Starting Job 3170590 10686 [237349.784324] prolog.root: Thu Apr 23 11:25:55 UTC 2020 Starting Job 3170595 -- More only prolog.root messages -- ... 10743 [238521.471919] prolog.root: Thu Apr 23 11:45:30 UTC 2020 Starting Job 3170881 10744 [238541.511324] prolog.root: Thu Apr 23 11:45:50 UTC 2020 Starting Job 3170886 10745 [238569.761344] prolog.root: Thu Apr 23 11:46:18 UTC 2020 Starting Job 3170893 10746 [238589.776329] epilog.root: Thu Apr 23 11:46:38 UTC 2020 Finished Job 3170893 10747 [238589.879354] prolog.root: Thu Apr 23 11:46:39 UTC 2020 Starting Job 3170898 10748 [238594.546849] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) 10749 [238594.554742] BUG: unable to handle kernel paging request at ffff95280d3dad00 2. The second observation is that the buffer shows the next to last "Finished Job" occurs at 11:24:34 with job 3170570. The crash occurs more than 20 minutes later. In between, more jobs continue to be processed and this can be seen in the syslogs as well as the slurm continues to write to the logs until job 3170831. 3. But notice that the last jobid is 3170898. The last observation is that the node stopped logging to the syslogs even though the jobs continued to run. So, is this potentially a race condition between cleanup of the previous job and the start of a new job? The "xcgroup ... Device or resource busy" messages are particularly troubling (to me) leading me to believe something is slowing down or getting in the way and over time is causing resource contention? I have crash dumps, node sylogs, and slurmctld.log files I can pass to you. What are you interested in seeing? Thanks, Tony. Marshall, The nodes that crashed today are still in a down state - on purpose, pending further investigation. The j1c11 node I spoke of earlier with the offline message is also acting quite strange at the moment. It may be of use to us in providing additional information. Our node health-check script can be run manually as well as through slurm. It is set to run every 5 min (HealthCheckInterval=300). If I run it manually, the node returns no issues and puts the node back online: [root@j1c11 ~]# date; sinfo ; /apps/slurm/tools/bin/nhc-slurm Thu Apr 23 19:52:56 UTC 2020 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST juno* up 8:00:00 3 down* j1c[01,03,09] juno* up 8:00:00 1 drain j1c11 juno* up 8:00:00 10 idle j1c[02,04-08,10,12-14] ... [root@j1c11 ~]# date; sinfo Thu Apr 23 19:53:07 UTC 2020 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST juno* up 8:00:00 3 down* j1c[01,03,09] juno* up 8:00:00 11 idle j1c[02,04-08,10-14] ... A few minutes later it is still online: [root@j1c11 ~]# date; sinfo Thu Apr 23 19:54:27 UTC 2020 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST juno* up 8:00:00 3 down* j1c[01,03,09] juno* up 8:00:00 11 idle j1c[02,04-08,10-14] ... Then, while watching the syslogs: [root@j1c11 ~]# tail -f /var/log/messages ... Apr 23 19:51:19 j1c11 nhc[92701]: Health check failed: Script timed out while executing "check_ps_service -u munge munged". Apr 23 19:51:19 j1c11 slurmd[303176]: error: /apps/slurm/tools/bin/nhc-slurm: exited with status 0x8e00 Apr 23 19:56:20 j1c11 nhc[93218]: Health check failed: Script timed out while executing "check_ps_service -u munge munged". Apr 23 19:56:20 j1c11 slurmd[303176]: error: /apps/slurm/tools/bin/nhc-slurm: exited with status 0x8e00 Notice that the node is taken offline again by slurm. This is very repeatable. I will keep this node in its current state until we've had a chance to troubleshoot it. Best, Tony. Hi Tony, Thanks for all the information. Can you upload: - slurm.conf - cgroup.conf - slurmctld log file - syslog file - a backtrace from the crash dump, if you can get it - the script used for node health check You mentioned that if you insert a delay in submitting the jobs that the compute nodes don't crash. I assume that means your test script has a loop to submit jobs to a specific node - is that correct? How often are jobs submitted on this cluster? Are job arrays used? Is it just the compute nodes (nodes with slurmd) that are crashing, or is the slurmctld node also crashing? > 1398384 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_537/job_3170826 Device or resource busy These messages are actually expected to an extent, which is why they're debug2 messages (not error messages). As job steps complete they remove their PID's from the appropriate cgroups and then try to rmdir the cgroup directories. But multiple steps may exist, so only the last one to complete will succeed in removing all the job cgroup directories. > 1398385 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_537): Device or resource busy And if there are multiple jobs by the same user on a node, then when one job completes it won't be able to clean up the UID cgroup directory because the other job is using it. That said, can you check if one of the crashed nodes has cgroup directories from old jobs that haven't been cleaned up? - Marshall I'm adding bug 8851 to see also (even though it's a different bug) since Felip is working with you on NHC in that bug, and you talked about NHC here. (In reply to Marshall Garey from comment #9) > Hi Tony, > > - a backtrace from the crash dump, if you can get it > - the script used for node health check Here's a bt from j1c01 crash> bt PID: 0 TASK: ffff97d5ada0a0e0 CPU: 15 COMMAND: "swapper/15" #0 [ffff97e04efc3b40] machine_kexec at ffffffffb4465b34 #1 [ffff97e04efc3ba0] __crash_kexec at ffffffffb4522592 #2 [ffff97e04efc3c70] crash_kexec at ffffffffb4522680 #3 [ffff97e04efc3c88] oops_end at ffffffffb4b85798 #4 [ffff97e04efc3cb0] no_context at ffffffffb4475bb4 #5 [ffff97e04efc3d00] __bad_area_nosemaphore at ffffffffb4475e82 #6 [ffff97e04efc3d50] bad_area_nosemaphore at ffffffffb4475fa4 #7 [ffff97e04efc3d60] __do_page_fault at ffffffffb4b88750 #8 [ffff97e04efc3dd0] do_page_fault at ffffffffb4b88975 #9 [ffff97e04efc3e00] page_fault at ffffffffb4b84778 [exception RIP: unknown or invalid address] RIP: ffff97e04ef9ad00 RSP: ffff97e04efc3eb8 RFLAGS: 00010292 RAX: ffff97e04ef9ad00 RBX: ffffffffb5079b80 RCX: 00000001800f0008 RDX: ffff97dff2bcbd60 RSI: fffff8cc30271600 RDI: ffff97dff2bcbd60 RBP: ffff97e04efc3f10 R8: ffff97e049c5eb40 R9: 00000001800f0008 R10: 0000000049c5f301 R11: ffff97e049c5eb40 R12: 000000000000000a R13: 0000000000000002 R14: ffffffffb44e9af6 R15: ffff97e04efd6340 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff97e04efc3eb8] rcu_process_callbacks at ffffffffb455a4f8 #11 [ffff97e04efc3f18] __do_softirq at ffffffffb44a5435 #12 [ffff97e04efc3f88] call_softirq at ffffffffb4b9142c #13 [ffff97e04efc3fa0] do_softirq at ffffffffb442f715 #14 [ffff97e04efc3fc0] irq_exit at ffffffffb44a57b5 #15 [ffff97e04efc3fd8] smp_apic_timer_interrupt at ffffffffb4b929d8 #16 [ffff97e04efc3ff0] apic_timer_interrupt at ffffffffb4b8eefa --- <IRQ stack> --- #17 [ffff97d5ada17db8] apic_timer_interrupt at ffffffffb4b8eefa [exception RIP: cpuidle_enter_state+87] RIP: ffffffffb49c1c27 RSP: ffff97d5ada17e60 RFLAGS: 00000202 RAX: 0000d4c146f21767 RBX: ffff97d5ada17e38 RCX: 0000000000000018 RDX: 0000000225c17d03 RSI: ffff97d5ada17fd8 RDI: 0000d4c146f21767 RBP: ffff97d5ada17e88 R8: 00000000000003df R9: ffff97e04efd9c90 R10: 7fffffffffffffff R11: ffff97dff28a1200 R12: ffff97e04efdad00 R13: ffff97dff63d90d8 R14: ffff97e04efdac80 R15: ffffffffb44e2b2c ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #18 [ffff97d5ada17e90] cpuidle_idle_call at ffffffffb49c1d7e #19 [ffff97d5ada17ed0] arch_cpu_idle at ffffffffb4437c6e #20 [ffff97d5ada17ee0] cpu_startup_entry at ffffffffb45017da #21 [ffff97d5ada17f28] start_secondary at ffffffffb445a0c7 #22 [ffff97d5ada17f50] start_cpu at ffffffffb44000d5 > You mentioned that if you insert a delay in submitting the jobs that the > compute nodes don't crash. I assume that means your test script has a loop > to submit jobs to a specific node - is that correct? How often are jobs > submitted on this cluster? Are job arrays used? Is it just the compute nodes > (nodes with slurmd) that are crashing, or is the slurmctld node also > crashing? No, slurmctld remains functional. The tests are not submitted to a "specific node." They are simple hello world scripts, that, just before ending, submit a new job of itself to the queue for scheduling. There is only one user (other than me) on the cluster, an admin/application engineer, who's helping me by submitting these jobs. The number of jobs submitted at one time is only 7. There are only 14 compute nodes on this cluster. Since the jobs are 2-node parallel jobs, the nodes stay busy. Since there are not other jobs, if all the jobs are in a run state, there will only be 7 jobs listed by squeue. The total number of jobs per day depends on the delay implemented in the job script. With no delay, we've had approximately 80,000+ jobs run through in a 24 hour period. When we had a 60 second delay, only about 9,100 jobs got processed. Then we implemented a 15 Sec delay and processed about 26,000 jobs in just over 25 hrs - when we finally called it quits. > And if there are multiple jobs by the same user on a node, then when one job > completes it won't be able to clean up the UID cgroup directory because the > other job is using it. For parallel jobs, we have only one user have exclusive access to the nodes. So, at most, there is one user on a node. > That said, can you check if one of the crashed nodes has cgroup directories > from old jobs that haven't been cleaned up? Not sure what you mean. Crashed nodes = inaccessible nodes. I'd be glad to entertain any ideas you may have on how to go about doing this. The j1c11 node I spoke of before that's in a peculiar state has this: [root@j1c11 ~]# ll /sys/fs/cgroup/cpuacct/slurm/ total 0 -rw-r--r-- 1 root root 0 Apr 17 13:12 cgroup.clone_children --w--w--w- 1 root root 0 Apr 17 13:12 cgroup.event_control -rw-r--r-- 1 root root 0 Apr 17 13:12 cgroup.procs -r--r--r-- 1 root root 0 Apr 17 13:12 cpuacct.stat -rw-r--r-- 1 root root 0 Apr 17 13:12 cpuacct.usage -r--r--r-- 1 root root 0 Apr 17 13:12 cpuacct.usage_percpu -rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.cfs_period_us -rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.cfs_quota_us -rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.rt_period_us -rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.rt_runtime_us -rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.shares -r--r--r-- 1 root root 0 Apr 17 13:12 cpu.stat -rw-r--r-- 1 root root 0 Apr 17 13:12 notify_on_release -rw-r--r-- 1 root root 0 Apr 17 13:12 tasks Dont see any user related data there. Will attach other files momentarily.... Tony. (In reply to Marshall Garey from comment #9) Marshall - which, > - syslog file do you want? > - the script used for node health check Do you want the nhc script (It's the same script as developed by Michael Jennings and recommended by SchedMD: https://github.com/mej/nhc I'm going to assume you want the configuration file Created attachment 14002 [details]
Juno Cgroup file
Created attachment 14003 [details]
Juno slurm.conf file
Created attachment 14004 [details]
Node Health Check File
NHC wrapper content:
[slurm@bqs7 etc]$ cat /apps/slurm/tools/bin/nhc-slurm
#!/bin/bash -l
export MARK_OFFLINE=1
export DEBUG=1
/apps/slurm/tools/dev/nhc/nhc -v -D /apps/slurm/tools/dev/nhc HELPERDIR="/apps/slurm/tools/dev/nhc/helpers" NHC_RM="slurm" MARK_OFFLINE=1 #DEBUG=1
Created attachment 14005 [details]
last week's slurmctld.log file
.gz file inside
(In reply to Anthony DelSorbo from comment #18) > Created attachment 14005 [details] > last week's slurmctld.log file > > .gz file inside Had a hard time submitting a .gz file. Does your system not accept them? Got some sort of a "POST" error when I tried to send it. Tony We limit the file size of uploads to 300 mb. xz usually compresses files much smaller than gz, so you can use xz to help make it smaller. However, a 300 mb file compressed with gz is still a lot of logs for us to comb through, so we prefer to keep the time period of the logs to the smallest relevant time period. For this bug having logs from a single day or part of a day where a crash happened. About getting cgroups from crashed nodes - I meant that after you reboot a crashed node, then you can look at the cgroups to see if there are any hanging around. Thanks for all the logs. I'll let you know what I find Marshall, We actually had a user submit a helpticket on Friday for an issue with his job that had the following in his output file: + srun --ntasks 1 /scratch2/BMC/gsienkf/Philip.Pegion/fv3gfs/sorc/regrid_nemsio2d.fd/regrid_nemsio slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen srun: error: timeout waiting for task launch, started 0 of 1 tasks srun: Job step 7722678.1 aborted before step completely launched. srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen srun: error: Timed out waiting for job step to complete looking at the syslogs I found that shortly after his job ran the node spewed out: 85758 May 15 14:08:05 h10c45 kernel: WARNING: CPU: 79 PID: 0 at kernel/rcutree.c:2221 rcu_process_callbacks+0x50c/0x570 85759 May 15 14:08:05 h10c45 kernel: Modules linked in: xfs libcrc32c vtsspp(OE) sep5(OE) socperf3(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) pax(OE) libcfs(OE) crc_t10dif nfsv3 crct10dif_generic nfs_acl iTCO_wdt iTCO_vendor_support rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_p clmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect pcspkr sysimgblt fb_sys_fops drm joydev drm_panel_orientation_quirks lpc_ich mei_me i2c_i801 mei wmi mlx4_en(OE) ipmi_si ipmi_devintf ipmi_msghandler mlx4_ib(OE) acpi_power_meter acpi_pad mlx4_core(OE) pcc_cpufreq 85760 May 15 14:08:05 h10c45 kernel: binfmt_misc knem(OE) ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) vfio_mdev(OE) vfio_iommu_type1 vfio mdev(OE) mlx_compa t(OE) crct10dif_pclmul crct10dif_common ixgbe mlxfw(OE) crc32c_intel devlink ahci dca mdio libahci ptp libata pps_core nfit libnvdimm sunrpc 85761 May 15 14:08:05 h10c45 kernel: CPU: 79 PID: 0 Comm: swapper/79 Tainted: G OE ------------ 3.10.0-1062.18.1.el7.x86_64 #1 85762 May 15 14:08:05 h10c45 kernel: Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0008.C0001.031920191559 03/19/2019 85763 May 15 14:08:05 h10c45 kernel: Call Trace: 85764 May 15 14:08:05 h10c45 kernel: <IRQ> [<ffffffff8417b416>] dump_stack+0x19/0x1b 85765 May 15 14:08:05 h10c45 kernel: [<ffffffff83a9bab8>] __warn+0xd8/0x100 85766 May 15 14:08:05 h10c45 kernel: [<ffffffff83a9bbfd>] warn_slowpath_null+0x1d/0x20 85767 May 15 14:08:05 h10c45 kernel: [<ffffffff83b5a82c>] rcu_process_callbacks+0x50c/0x570 85768 May 15 14:08:05 h10c45 kernel: [<ffffffff83aa5435>] __do_softirq+0xf5/0x280 85769 May 15 14:08:05 h10c45 kernel: [<ffffffff8419142c>] call_softirq+0x1c/0x30 85770 May 15 14:08:05 h10c45 kernel: [<ffffffff83a2f715>] do_softirq+0x65/0xa0 85771 May 15 14:08:05 h10c45 kernel: [<ffffffff83aa57b5>] irq_exit+0x105/0x110 85772 May 15 14:08:05 h10c45 kernel: [<ffffffff841929d8>] smp_apic_timer_interrupt+0x48/0x60 85773 May 15 14:08:05 h10c45 kernel: [<ffffffff8418eefa>] apic_timer_interrupt+0x16a/0x170 85774 May 15 14:08:05 h10c45 kernel: <EOI> [<ffffffff83fc1c27>] ? cpuidle_enter_state+0x57/0xd0 85775 May 15 14:08:05 h10c45 kernel: [<ffffffff83fc1d7e>] cpuidle_idle_call+0xde/0x230 85776 May 15 14:08:05 h10c45 kernel: [<ffffffff83a37c6e>] arch_cpu_idle+0xe/0xc0 85777 May 15 14:08:05 h10c45 kernel: [<ffffffff83b017da>] cpu_startup_entry+0x14a/0x1e0 85778 May 15 14:08:05 h10c45 kernel: [<ffffffff83a5a0c7>] start_secondary+0x1f7/0x270 85779 May 15 14:08:05 h10c45 kernel: [<ffffffff83a000d5>] start_cpu+0x5/0x14 85780 May 15 14:08:05 h10c45 kernel: ---[ end trace 87d25452bc301feb ]--- 85781 May 15 14:08:06 h10c45 sh: abrt-dump-oops: Found oopses: 1 85782 May 15 14:08:06 h10c45 sh: abrt-dump-oops: Creating problem directories 85783 May 15 14:08:06 h10c45 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on 85784 May 15 14:08:06 h10c45 abrt-server: Looking for kernel package 85785 May 15 14:08:07 h10c45 abrt-server: Kernel package kernel-3.10.0-1062.18.1.el7.x86_64 found 85786 May 15 14:08:07 h10c45 abrt-dump-oops: Reported 1 kernel oopses to Abrt 85787 May 15 14:08:08 h10c45 dbus[2280]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) 85788 May 15 14:08:08 h10c45 dbus[2280]: [system] Successfully activated service 'org.freedesktop.problems' 85789 May 15 14:08:31 h10c45 slurmstepd[128820]: debug: jag_common_poll_data: Task 0 pid 128827 ave_freq = 2604638 mem size/max 0/0 vmem size/max 4468736/4468736, disk read size/max (2012/2012), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 And finally crashed and burned just short of a half hour after that. Please let me know if there's anything else I can provide you. Best, Tony. > slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should > never happen This is a known error that has been fixed in 20.02 in commit 4c030c03778 (it will be in 20.02.3). It's harmless, so you don't need to worry about it. > srun: error: timeout waiting for task launch, started 0 of 1 tasks > srun: Job step 7722678.1 aborted before step completely launched. > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete This means the task didn't launch that srun tried to launch. You can increase MessageTimeout to potentially help with this. Your slurm.conf has that commented out, so I'm guessing you have it at the default of 10 seconds, which can be quite short for some systems, especially with heavier workloads. I'd recommend setting it to 30 seconds. However, this error is more likely a symptom of the actual problem you've been experiencing. Thanks for getting some syslogs [1]. It gives some information, but can you get more syslogs from that node (or another node that has crashed)? That can hopefully give us more context about the crash. Bug 8851 and bug 7839 are both really similar to this one. It looks like neither of those really got resolved. Some of the common threads between those bugs and this one are: Lustre, NHC on a shared filesystem, RealMemory, and a heavy workload. There are a couple things Felip suggested in those bugs that I don't see were ever done: - Remove NHC from the shared filesystem. We want to eliminate the possibility of issues with NHC and Lustre and your heavy workload. - Make sure RealMemory is lower than the node's actual physical memory by at least a few GiB [2]. In comment 21 you said that node h10c45 crashed. What is that node's actual physical memory, and what is RealMemory set to in slurm.conf? I don't see node h10c45 in the slurm.conf that you uploaded. Is this node in a different cluster? Can you make those changes? There's one more change I'd like to propose that you might try before those other ones. It's more of a workaround. Try setting SchedulerParameters=defer in slurm.conf. This is in response to what you said in comment 7: > It's possible this is either a resource contention or race condition. We are leaning in this > direction only because if we slow down the number of jobs by inserting a sleep delay in the > script - before it submits the chained job - the problem does not surface. SchedulerParameters=defer effectively stops the main scheduler from running more frequently than every batch_sched_interval seconds (3 by default). You can also try SchedulerParameters=setting sched_min_interval (I recommend 100000 microseconds or 100 ms) instead, which limits how frequently the main scheduler can run. Whatever problem is happening seems like it's happening on the compute nodes, not the slurmctld node. But by limiting how quickly you schedule jobs perhaps that could alleviate some sort of race condition or resource contention with whatever is causing the problem. [1] We'll do our best to eliminate Slurm as a possibility, and fix any issues with Slurm we find. But we're not kernel experts. Diagnosing this problem is really hard. [2] See bug 7839 comment 27 to explain why you should set RealMemory to a few GB lower than the system's physical memory. This is something we recommend to every site. We document this in man slurm.conf, but it could be more clear: https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory "Lowering RealMemory with the goal of setting aside some amount for the OS and not available for job allocations will not work as intended if Memory is not set as a consumable resource in SelectTypeParameters. So one of the *_Memory options need to be enabled for that goal to be accomplished." Hi, I'm just wondering if you've had a chance to test any changes that have been suggested? - Marshall I'm closing this as timed out. Please reopen it if you're still experiencing this issue. |