Ticket 8826

Summary:	Nodes crash while running epilog
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	slurmd	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	nathan.dauchy, shawn.needham
Version:	19.05.3
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=8911 https://bugs.schedmd.com/show_bug.cgi?id=8851
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Juno Cgroup file Juno slurm.conf file Node Health Check File last week's slurmctld.log file

Description Anthony DelSorbo 2020-04-09 10:17:34 MDT

We've been hunting down an issue where our nodes are intermittently dying either during epilog or directly thereafter.  Our only proof is that the logs stop logging during what appears to be the epilog phases.  In all cases, the jobs that were running do end appropriately with appropriate epilog messages indicating the job's epilog.root ran to completion.  For example:

_______________________________________________________________
Start Epilog v19.07.23 on node j1c11 for job 2581046 :: Thu Apr 9 10:33:42 UTC 2020
Job 2581046 (not serial) finished for user Raghu.Reddy in partition juno with exit code 0:0
_______________________________________________________________
End Epilogue v19.07.23 Thu Apr 9 10:33:42 UTC 2020

and, here are the last few lines of the syslogs that seem to repeat:


1220393 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 13, STDOUT
1220394 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220395 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 13, STDERR
1220396 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220397 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_writable
1220398 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   true, list_count = 1
1220399 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 14, STDOUT
1220400 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220401 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 14, STDERR
1220402 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220403 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_writable
1220404 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   true, list_count = 1
1220405 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 15, STDOUT
1220406 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220407 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 15, STDERR
1220408 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220409 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_writable
1220410 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   true, list_count = 1
1220411 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 16, STDOUT
1220412 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220413 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_readable, task 16, STDERR
1220414 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5:   cbuf_free = 4096
1220415 Apr  9 10:33:40 j1c13 slurmstepd[279577]: debug5: Called _task_writable

The jobs are two node mpi "hello world" jobs that run for a few seconds and then re-submit themselves.  So, theoretically, the system should always have about 10 jobs either in a run or pending state.

Please let me know what additional information I can provide you to arrive at a root cause.

Comment 1 Marshall Garey 2020-04-09 10:56:06 MDT

Do you know if the node itself is crashing (OS and everything, so you have to reboot the node)? Or is slurmd crashing (so you just need to restart slurmd)? If slurmd is crashing, are there any core files generated? If so, can you run the following commands in gdb and upload the output:

* info threads
* bt full
* t a a bt


In addition, can you upload:

* slurmd log file from a node that crashed (let me know the node name)
* the epilog script and any output to the epilog script from that node
* slurm.conf for the cluster
* a slurmctld log file during the time that node crashed (I'll look for the node name in this log file)


Thanks

Comment 2 Anthony DelSorbo 2020-04-09 13:16:35 MDT

(In reply to Marshall Garey from comment #1)

Marshall,

Thanks for the quick reply.

Although we've been experiencing this issue for a few days, I may have jumped the gun here a bit. It may just be a coincidence that the issue seems to coincide with the slurm epilog phase.  We're collecting more data on our end to rule out other potential culprits.  Specifically, we're rolling back a lustre client update we installed a few days before we started running these tests.  

I've downgraded to priority 4 for now.  I will let you know within the next few days whether this issue recurs.

Best,

Tony.

Comment 3 Marshall Garey 2020-04-09 14:07:20 MDT

Sounds good; I'll wait for your reply.

Comment 4 Anthony DelSorbo 2020-04-13 12:30:28 MDT

(In reply to Marshall Garey from comment #3)
> Sounds good; I'll wait for your reply.

Thanks for holding on.  We've been running thousands of tests in order to try to pinpoint the root cause.  Unfortunately, It's still not clear to us whether this is slurm related or not.  The only thing I can tell you is that the issue's timing is always either during a job or just as the job is ending.  

We've even backed out recent image changes to determine if there's any correlation.  Additionally, we've re-engineered our images to force a crash dump at the next occurrence.  I'm hopeful we'll be able to get some info from that.

I'll keep you posted, possibly as early as tomorrow AM.

Thanks,

Tony.

Comment 5 Marshall Garey 2020-04-14 08:46:34 MDT

Is this bug related to bug 8851?

Comment 6 Anthony DelSorbo 2020-04-14 09:13:55 MDT

(In reply to Marshall Garey from comment #5)
> Is this bug related to bug 8851?

Marshall,

No.  Different behavior altogether.  There are not kernel stack traces in that bug.

Tony.

Comment 7 Anthony DelSorbo 2020-04-23 13:17:49 MDT

Hi Marshall,

We were able to re-image our test environment to support kernel crash dumps.  I wanted to get you some good data before I came back to you on this.  So we ran a few scenarios starting last Friday evening.  In that set, by early AM we were able to collect about 4 dumps and another three this morning.  The dumps have messages of the following format:

[233374.108758] CPU: 10 PID: 284713 Comm: kworker/10:6 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.18.1.el7.x86_64 #1
[233374.234602] BUG: unable to handle kernel paging request at ffff97e04ef9ad00
[233374.367857] CPU: 15 PID: 0 Comm: swapper/15 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.18.1.el7.x86_64 #1

[107720.490173] CPU: 36 PID: 0 Comm: swapper/36 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1062.18.1.el7.x86_64 #1
[107720.631758] BUG: unable to handle kernel paging request at ffff9fb58d25ad00
[107720.765193] CPU: 29 PID: 0 Comm: swapper/29 Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-1062.18.1.el7.x86_64 #1

Now, typically, we attribute these issues to a system problem - and we're not saying it isn't.  But, the issues only occur after a considerable number of Slurm jobs have been processed over time.  

It's possible this is either a resource contention or race condition.  We are leaning in this direction only because if we slow down the number of jobs by inserting a sleep delay in the script - before it submits the chained job - the problem does not surface.  Then we decreased that delay from 60 seconds to 15 seconds and the issue resurfaces.  So, in summary:

first test: sleep delay = 0 ==> crashed nodes
second test: sleep delay = 60 ==> ALL GOOD
third test: sleep delay = 15 ==> crashed nodes

Additionally, under the low or no delay tests we also see some locked up nodes that were taken off line by NHC with messages like:

Reason=NHC: Script timed out while executing "check_ps_service -u munge munged".

The particular message might be different, but it does put the node in a drained state:

[Anthony.DelSorbo@jfe01:6 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
juno*        up    8:00:00      1  drain j1c11

We're also seeing a considerable number of "Device or resource busy" messages such as these:

1398384 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_537/job_3170826 Device or resource busy
1398385 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_537): Device or resource busy
1398386 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_537 Device or resource busy
1398389 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/memory/slurm/uid_537/job_3170826): Device or resource busy
1398390 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_memory_fini: failed to delete /sys/fs/cgroup/memory/slurm/uid_537/job_3170826 Device or resource busy

I'm sure these aren't helping matters.  

In troubleshooting batch system issues, we've relied on logging info to the syslogs via the prolog and epilogs.  Three other observations we've made are:

1. In the kernel ring buffer, we can see when we mark the beginning and end of a job as shown below, we see normal job start then job finish messages, then a period of only job start messages until the very end - just before the crash - where we have but one more epilog message.  So something happened in the epilog stages for each of the jobs.

10678 [237228.401773] prolog.root: Thu Apr 23 11:23:54 UTC 2020 Starting Job 3170565
10679 [237248.365507] epilog.root: Thu Apr 23 11:24:14 UTC 2020 Finished Job 3170565
10680 [237248.459147] prolog.root: Thu Apr 23 11:24:14 UTC 2020 Starting Job 3170570
10681 [237268.467639] epilog.root: Thu Apr 23 11:24:34 UTC 2020 Finished Job 3170570
10682 [237268.562598] prolog.root: Thu Apr 23 11:24:34 UTC 2020 Starting Job 3170575
10683 [237288.548849] prolog.root: Thu Apr 23 11:24:54 UTC 2020 Starting Job 3170580
10684 [237308.656116] prolog.root: Thu Apr 23 11:25:14 UTC 2020 Starting Job 3170585
10685 [237328.624536] prolog.root: Thu Apr 23 11:25:34 UTC 2020 Starting Job 3170590
10686 [237349.784324] prolog.root: Thu Apr 23 11:25:55 UTC 2020 Starting Job 3170595
-- More only prolog.root messages -- 
...
10743 [238521.471919] prolog.root: Thu Apr 23 11:45:30 UTC 2020 Starting Job 3170881
10744 [238541.511324] prolog.root: Thu Apr 23 11:45:50 UTC 2020 Starting Job 3170886
10745 [238569.761344] prolog.root: Thu Apr 23 11:46:18 UTC 2020 Starting Job 3170893
10746 [238589.776329] epilog.root: Thu Apr 23 11:46:38 UTC 2020 Finished Job 3170893
10747 [238589.879354] prolog.root: Thu Apr 23 11:46:39 UTC 2020 Starting Job 3170898
10748 [238594.546849] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
10749 [238594.554742] BUG: unable to handle kernel paging request at ffff95280d3dad00

2. The second observation is that the buffer shows the next to last "Finished Job" occurs at 11:24:34 with job 3170570.  The crash occurs more than 20 minutes later.  In between, more jobs continue to be processed and this can be seen in the syslogs as well as the slurm continues to write to the logs until job 3170831.

3. But notice that the last jobid is 3170898.  The last observation is that the node stopped logging to the syslogs even though the jobs continued to run.  

So, is this potentially a race condition between cleanup of the previous job and the start of a new job?  
The "xcgroup ... Device or resource busy" messages are particularly troubling (to me) leading me to believe something is slowing down or getting in the way and over time is causing resource contention?


I have crash dumps, node sylogs, and slurmctld.log files I can pass to you.  What are you interested in seeing?

Thanks,

Tony.

Comment 8 Anthony DelSorbo 2020-04-23 14:08:40 MDT

Marshall,

The nodes that crashed today are still in a down state - on purpose, pending further investigation.

The j1c11 node I spoke of earlier with the offline message is also acting quite strange at the moment.  It may be of use to us in providing additional information.  

Our node health-check script can be run manually as well as through slurm.  It is set to run every 5 min (HealthCheckInterval=300).  If I run it manually, the node returns no issues and puts the node back online:

[root@j1c11 ~]#  date; sinfo ; /apps/slurm/tools/bin/nhc-slurm
Thu Apr 23 19:52:56 UTC 2020
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
juno*        up    8:00:00      3  down* j1c[01,03,09]
juno*        up    8:00:00      1  drain j1c11
juno*        up    8:00:00     10   idle j1c[02,04-08,10,12-14]
...
[root@j1c11 ~]# date; sinfo
Thu Apr 23 19:53:07 UTC 2020
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
juno*        up    8:00:00      3  down* j1c[01,03,09]
juno*        up    8:00:00     11   idle j1c[02,04-08,10-14]
...

A few minutes later it is still online:
[root@j1c11 ~]# date; sinfo
Thu Apr 23 19:54:27 UTC 2020
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
juno*        up    8:00:00      3  down* j1c[01,03,09]
juno*        up    8:00:00     11   idle j1c[02,04-08,10-14]
...

Then, while watching the syslogs:

[root@j1c11 ~]# tail -f /var/log/messages
...
Apr 23 19:51:19 j1c11 nhc[92701]: Health check failed:  Script timed out while executing "check_ps_service -u munge munged".
Apr 23 19:51:19 j1c11 slurmd[303176]: error: /apps/slurm/tools/bin/nhc-slurm: exited with status 0x8e00
Apr 23 19:56:20 j1c11 nhc[93218]: Health check failed:  Script timed out while executing "check_ps_service -u munge munged".
Apr 23 19:56:20 j1c11 slurmd[303176]: error: /apps/slurm/tools/bin/nhc-slurm: exited with status 0x8e00

Notice that the node is taken offline again by slurm.

This is very repeatable.

I will keep this node in its current state until we've had a chance to troubleshoot it.

Best,

Tony.

Comment 9 Marshall Garey 2020-04-27 10:19:28 MDT

Hi Tony,

Thanks for all the information. Can you upload:

- slurm.conf
- cgroup.conf
- slurmctld log file
- syslog file
- a backtrace from the crash dump, if you can get it
- the script used for node health check

You mentioned that if you insert a delay in submitting the jobs that the compute nodes don't crash. I assume that means your test script has a loop to submit jobs to a specific node - is that correct? How often are jobs submitted on this cluster? Are job arrays used? Is it just the compute nodes (nodes with slurmd) that are crashing, or is the slurmctld node also crashing?


> 1398384 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: jobacct_gather_cgroup_cpuacct_fini: failed to delete /sys/fs/cgroup/cpuacct/slurm/uid_537/job_3170826 Device or resource busy


These messages are actually expected to an extent, which is why they're debug2 messages (not error messages). As job steps complete they remove their PID's from the appropriate cgroups and then try to rmdir the cgroup directories. But multiple steps may exist, so only the last one to complete will succeed in removing all the job cgroup directories. 


> 1398385 Apr 23 11:41:49 j1c09 slurmstepd[130252]: debug2: xcgroup_delete: rmdir(/sys/fs/cgroup/cpuacct/slurm/uid_537): Device or resource busy

And if there are multiple jobs by the same user on a node, then when one job completes it won't be able to clean up the UID cgroup directory because the other job is using it.


That said, can you check if one of the crashed nodes has cgroup directories from old jobs that haven't been cleaned up?

- Marshall

Comment 12 Marshall Garey 2020-04-27 10:45:56 MDT

I'm adding bug 8851 to see also (even though it's a different bug) since Felip is working with you on NHC in that bug, and you talked about NHC here.

Comment 13 Anthony DelSorbo 2020-04-27 11:48:56 MDT

(In reply to Marshall Garey from comment #9)
> Hi Tony,
> 

> - a backtrace from the crash dump, if you can get it
> - the script used for node health check

Here's a bt from j1c01

crash> bt
PID: 0      TASK: ffff97d5ada0a0e0  CPU: 15  COMMAND: "swapper/15"
 #0 [ffff97e04efc3b40] machine_kexec at ffffffffb4465b34
 #1 [ffff97e04efc3ba0] __crash_kexec at ffffffffb4522592
 #2 [ffff97e04efc3c70] crash_kexec at ffffffffb4522680
 #3 [ffff97e04efc3c88] oops_end at ffffffffb4b85798
 #4 [ffff97e04efc3cb0] no_context at ffffffffb4475bb4
 #5 [ffff97e04efc3d00] __bad_area_nosemaphore at ffffffffb4475e82
 #6 [ffff97e04efc3d50] bad_area_nosemaphore at ffffffffb4475fa4
 #7 [ffff97e04efc3d60] __do_page_fault at ffffffffb4b88750
 #8 [ffff97e04efc3dd0] do_page_fault at ffffffffb4b88975
 #9 [ffff97e04efc3e00] page_fault at ffffffffb4b84778
    [exception RIP: unknown or invalid address]
    RIP: ffff97e04ef9ad00  RSP: ffff97e04efc3eb8  RFLAGS: 00010292
    RAX: ffff97e04ef9ad00  RBX: ffffffffb5079b80  RCX: 00000001800f0008
    RDX: ffff97dff2bcbd60  RSI: fffff8cc30271600  RDI: ffff97dff2bcbd60
    RBP: ffff97e04efc3f10   R8: ffff97e049c5eb40   R9: 00000001800f0008
    R10: 0000000049c5f301  R11: ffff97e049c5eb40  R12: 000000000000000a
    R13: 0000000000000002  R14: ffffffffb44e9af6  R15: ffff97e04efd6340
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff97e04efc3eb8] rcu_process_callbacks at ffffffffb455a4f8
#11 [ffff97e04efc3f18] __do_softirq at ffffffffb44a5435
#12 [ffff97e04efc3f88] call_softirq at ffffffffb4b9142c
#13 [ffff97e04efc3fa0] do_softirq at ffffffffb442f715
#14 [ffff97e04efc3fc0] irq_exit at ffffffffb44a57b5
#15 [ffff97e04efc3fd8] smp_apic_timer_interrupt at ffffffffb4b929d8
#16 [ffff97e04efc3ff0] apic_timer_interrupt at ffffffffb4b8eefa
--- <IRQ stack> ---
#17 [ffff97d5ada17db8] apic_timer_interrupt at ffffffffb4b8eefa
    [exception RIP: cpuidle_enter_state+87]
    RIP: ffffffffb49c1c27  RSP: ffff97d5ada17e60  RFLAGS: 00000202
    RAX: 0000d4c146f21767  RBX: ffff97d5ada17e38  RCX: 0000000000000018
    RDX: 0000000225c17d03  RSI: ffff97d5ada17fd8  RDI: 0000d4c146f21767
    RBP: ffff97d5ada17e88   R8: 00000000000003df   R9: ffff97e04efd9c90
    R10: 7fffffffffffffff  R11: ffff97dff28a1200  R12: ffff97e04efdad00
    R13: ffff97dff63d90d8  R14: ffff97e04efdac80  R15: ffffffffb44e2b2c
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#18 [ffff97d5ada17e90] cpuidle_idle_call at ffffffffb49c1d7e
#19 [ffff97d5ada17ed0] arch_cpu_idle at ffffffffb4437c6e
#20 [ffff97d5ada17ee0] cpu_startup_entry at ffffffffb45017da
#21 [ffff97d5ada17f28] start_secondary at ffffffffb445a0c7
#22 [ffff97d5ada17f50] start_cpu at ffffffffb44000d5



> You mentioned that if you insert a delay in submitting the jobs that the
> compute nodes don't crash. I assume that means your test script has a loop
> to submit jobs to a specific node - is that correct? How often are jobs
> submitted on this cluster? Are job arrays used? Is it just the compute nodes
> (nodes with slurmd) that are crashing, or is the slurmctld node also
> crashing?


No, slurmctld remains functional.  

The tests are not submitted to a "specific node."  They are simple hello world scripts, that, just before ending, submit a new job of itself to the queue for scheduling.  There is only one user (other than me) on the cluster, an admin/application engineer, who's helping me by submitting these jobs.  

The number of jobs submitted at one time is only 7.  There are only 14 compute nodes on this cluster.  Since the jobs are 2-node parallel jobs, the nodes stay busy.  Since there are not other jobs, if all the jobs are in a run state, there will only be 7 jobs listed by squeue.

The total number of jobs per day depends on the delay implemented in the job script.  With no delay, we've had approximately 80,000+ jobs run through in a 24 hour period.  When we had a 60 second delay, only about 9,100 jobs got processed.  Then we implemented a 15 Sec delay and processed about 26,000 jobs in just over 25 hrs - when we finally called it quits.


> And if there are multiple jobs by the same user on a node, then when one job
> completes it won't be able to clean up the UID cgroup directory because the
> other job is using it.

For parallel jobs, we have only one user have exclusive access to the nodes.  So, at most, there is one user on a node.


> That said, can you check if one of the crashed nodes has cgroup directories
> from old jobs that haven't been cleaned up?

Not sure what you mean.  Crashed nodes = inaccessible nodes.  I'd be glad to entertain any ideas you may have on how to go about doing this.  The j1c11 node I spoke of before that's in a peculiar state has this:

[root@j1c11 ~]# ll /sys/fs/cgroup/cpuacct/slurm/
total 0
-rw-r--r-- 1 root root 0 Apr 17 13:12 cgroup.clone_children
--w--w--w- 1 root root 0 Apr 17 13:12 cgroup.event_control
-rw-r--r-- 1 root root 0 Apr 17 13:12 cgroup.procs
-r--r--r-- 1 root root 0 Apr 17 13:12 cpuacct.stat
-rw-r--r-- 1 root root 0 Apr 17 13:12 cpuacct.usage
-r--r--r-- 1 root root 0 Apr 17 13:12 cpuacct.usage_percpu
-rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.cfs_period_us
-rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.cfs_quota_us
-rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.rt_period_us
-rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.rt_runtime_us
-rw-r--r-- 1 root root 0 Apr 17 13:12 cpu.shares
-r--r--r-- 1 root root 0 Apr 17 13:12 cpu.stat
-rw-r--r-- 1 root root 0 Apr 17 13:12 notify_on_release
-rw-r--r-- 1 root root 0 Apr 17 13:12 tasks

Dont see any user related data there.

Will attach other files momentarily....


Tony.

Comment 14 Anthony DelSorbo 2020-04-27 11:57:44 MDT

(In reply to Marshall Garey from comment #9)
Marshall - which,
> - syslog file
do you want?

> - the script used for node health check
Do you want the nhc script (It's the same script as developed by Michael Jennings and recommended by SchedMD: https://github.com/mej/nhc

I'm going to assume you want the configuration file

Comment 15 Anthony DelSorbo 2020-04-27 11:58:44 MDT

Created attachment 14002 [details]
Juno Cgroup file

Comment 16 Anthony DelSorbo 2020-04-27 11:59:12 MDT

Created attachment 14003 [details]
Juno slurm.conf file

Comment 17 Anthony DelSorbo 2020-04-27 12:08:08 MDT

Created attachment 14004 [details]
Node Health Check File

NHC wrapper content:

[slurm@bqs7 etc]$ cat /apps/slurm/tools/bin/nhc-slurm
#!/bin/bash -l
export MARK_OFFLINE=1
export DEBUG=1
/apps/slurm/tools/dev/nhc/nhc -v -D /apps/slurm/tools/dev/nhc HELPERDIR="/apps/slurm/tools/dev/nhc/helpers" NHC_RM="slurm" MARK_OFFLINE=1 #DEBUG=1

Comment 18 Anthony DelSorbo 2020-04-27 12:11:33 MDT

Created attachment 14005 [details]
last week's slurmctld.log file

.gz file inside

Comment 19 Anthony DelSorbo 2020-04-27 12:13:49 MDT

(In reply to Anthony DelSorbo from comment #18)
> Created attachment 14005 [details]
> last week's slurmctld.log file
> 
> .gz file inside

Had a hard time submitting a .gz file.  Does your system not accept them?  Got some sort of a "POST" error when I tried to send it.

Tony

Comment 20 Marshall Garey 2020-04-27 12:50:30 MDT

We limit the file size of uploads to 300 mb. xz usually compresses files much smaller than gz, so you can use xz to help make it smaller. However, a 300 mb file compressed with gz is still a lot of logs for us to comb through, so we prefer to keep the time period of the logs to the smallest relevant time period. For this bug having logs from a single day or part of a day where a crash happened.


About getting cgroups from crashed nodes - I meant that after you reboot a crashed node, then you can look at the cgroups to see if there are any hanging around.

Thanks for all the logs. I'll let you know what I find

Comment 21 Anthony DelSorbo 2020-05-18 09:13:37 MDT

Marshall,

We actually had a user submit a helpticket on Friday for an issue with his job that had the following in his output file:

+ srun --ntasks 1 /scratch2/BMC/gsienkf/Philip.Pegion/fv3gfs/sorc/regrid_nemsio2d.fd/regrid_nemsio
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen

srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 7722678.1 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen
slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should never happen
srun: error: Timed out waiting for job step to complete

looking at the syslogs I found that shortly after his job ran the node spewed out:

 85758 May 15 14:08:05 h10c45 kernel: WARNING: CPU: 79 PID: 0 at kernel/rcutree.c:2221 rcu_process_callbacks+0x50c/0x570
 85759 May 15 14:08:05 h10c45 kernel: Modules linked in: xfs libcrc32c vtsspp(OE) sep5(OE) socperf3(OE) mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) osc(OE) lov(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) pax(OE) libcfs(OE) crc_t10dif        nfsv3 crct10dif_generic nfs_acl iTCO_wdt iTCO_vendor_support rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_p       clmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect pcspkr sysimgblt fb_sys_fops drm joydev drm_panel_orientation_quirks lpc_ich mei_me i2c_i801 mei        wmi mlx4_en(OE) ipmi_si ipmi_devintf ipmi_msghandler mlx4_ib(OE) acpi_power_meter acpi_pad mlx4_core(OE) pcc_cpufreq
 85760 May 15 14:08:05 h10c45 kernel: binfmt_misc knem(OE) ip_tables rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) vfio_mdev(OE) vfio_iommu_type1 vfio mdev(OE) mlx_compa       t(OE) crct10dif_pclmul crct10dif_common ixgbe mlxfw(OE) crc32c_intel devlink ahci dca mdio libahci ptp libata pps_core nfit libnvdimm sunrpc
 85761 May 15 14:08:05 h10c45 kernel: CPU: 79 PID: 0 Comm: swapper/79 Tainted: G           OE  ------------   3.10.0-1062.18.1.el7.x86_64 #1
 85762 May 15 14:08:05 h10c45 kernel: Hardware name: Intel Corporation S2600BPB/S2600BPB, BIOS SE5C620.86B.02.01.0008.C0001.031920191559 03/19/2019
 85763 May 15 14:08:05 h10c45 kernel: Call Trace:
 85764 May 15 14:08:05 h10c45 kernel: <IRQ>  [<ffffffff8417b416>] dump_stack+0x19/0x1b
 85765 May 15 14:08:05 h10c45 kernel: [<ffffffff83a9bab8>] __warn+0xd8/0x100
 85766 May 15 14:08:05 h10c45 kernel: [<ffffffff83a9bbfd>] warn_slowpath_null+0x1d/0x20
 85767 May 15 14:08:05 h10c45 kernel: [<ffffffff83b5a82c>] rcu_process_callbacks+0x50c/0x570
 85768 May 15 14:08:05 h10c45 kernel: [<ffffffff83aa5435>] __do_softirq+0xf5/0x280
 85769 May 15 14:08:05 h10c45 kernel: [<ffffffff8419142c>] call_softirq+0x1c/0x30
 85770 May 15 14:08:05 h10c45 kernel: [<ffffffff83a2f715>] do_softirq+0x65/0xa0
 85771 May 15 14:08:05 h10c45 kernel: [<ffffffff83aa57b5>] irq_exit+0x105/0x110
 85772 May 15 14:08:05 h10c45 kernel: [<ffffffff841929d8>] smp_apic_timer_interrupt+0x48/0x60
 85773 May 15 14:08:05 h10c45 kernel: [<ffffffff8418eefa>] apic_timer_interrupt+0x16a/0x170
 85774 May 15 14:08:05 h10c45 kernel: <EOI>  [<ffffffff83fc1c27>] ? cpuidle_enter_state+0x57/0xd0
 85775 May 15 14:08:05 h10c45 kernel: [<ffffffff83fc1d7e>] cpuidle_idle_call+0xde/0x230
 85776 May 15 14:08:05 h10c45 kernel: [<ffffffff83a37c6e>] arch_cpu_idle+0xe/0xc0
 85777 May 15 14:08:05 h10c45 kernel: [<ffffffff83b017da>] cpu_startup_entry+0x14a/0x1e0
 85778 May 15 14:08:05 h10c45 kernel: [<ffffffff83a5a0c7>] start_secondary+0x1f7/0x270
 85779 May 15 14:08:05 h10c45 kernel: [<ffffffff83a000d5>] start_cpu+0x5/0x14
 85780 May 15 14:08:05 h10c45 kernel: ---[ end trace 87d25452bc301feb ]---
 85781 May 15 14:08:06 h10c45 sh: abrt-dump-oops: Found oopses: 1
 85782 May 15 14:08:06 h10c45 sh: abrt-dump-oops: Creating problem directories
 85783 May 15 14:08:06 h10c45 sh: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on
 85784 May 15 14:08:06 h10c45 abrt-server: Looking for kernel package
 85785 May 15 14:08:07 h10c45 abrt-server: Kernel package kernel-3.10.0-1062.18.1.el7.x86_64 found
 85786 May 15 14:08:07 h10c45 abrt-dump-oops: Reported 1 kernel oopses to Abrt
 85787 May 15 14:08:08 h10c45 dbus[2280]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
 85788 May 15 14:08:08 h10c45 dbus[2280]: [system] Successfully activated service 'org.freedesktop.problems'
 85789 May 15 14:08:31 h10c45 slurmstepd[128820]: debug:  jag_common_poll_data: Task 0 pid 128827 ave_freq = 2604638 mem size/max 0/0 vmem size/max 4468736/4468736, disk read size/max (2012/2012), disk write size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0


And finally crashed and burned just short of a half hour after that.  

Please let me know if there's anything else I can provide you.

Best,

Tony.

Comment 23 Marshall Garey 2020-05-21 18:01:50 MDT

> slurmstepd: error: _prec_extra: Could not find task_memory_cg, this should
> never happen
This is a known error that has been fixed in 20.02 in commit 4c030c03778 (it will be in 20.02.3). It's harmless, so you don't need to worry about it.


> srun: error: timeout waiting for task launch, started 0 of 1 tasks
> srun: Job step 7722678.1 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete

This means the task didn't launch that srun tried to launch. You can increase MessageTimeout to potentially help with this. Your slurm.conf has that commented out, so I'm guessing you have it at the default of 10 seconds, which can be quite short for some systems, especially with heavier workloads. I'd recommend setting it to 30 seconds. However, this error is more likely a symptom of the actual problem you've been experiencing.


Thanks for getting some syslogs [1]. It gives some information, but can you get more syslogs from that node (or another node that has crashed)? That can hopefully give us more context about the crash.

Bug 8851 and bug 7839 are both really similar to this one. It looks like neither of those really got resolved. Some of the common threads between those bugs and this one are: Lustre, NHC on a shared filesystem, RealMemory, and a heavy workload. There are a couple things Felip suggested in those bugs that I don't see were ever done:

- Remove NHC from the shared filesystem. We want to eliminate the possibility of issues with NHC and Lustre and your heavy workload.
- Make sure RealMemory is lower than the node's actual physical memory by at least a few GiB [2]. In comment 21 you said that node h10c45 crashed. What is that node's actual physical memory, and what is RealMemory set to in slurm.conf? I don't see node h10c45 in the slurm.conf that you uploaded. Is this node in a different cluster?

Can you make those changes?

There's one more change I'd like to propose that you might try before those other ones. It's more of a workaround. Try setting SchedulerParameters=defer in slurm.conf. This is in response to what you said in comment 7:

> It's possible this is either a resource contention or race condition.  We are leaning in this
> direction only because if we slow down the number of jobs by inserting a sleep delay in the
> script - before it submits the chained job - the problem does not surface.

SchedulerParameters=defer effectively stops the main scheduler from running more frequently than every batch_sched_interval seconds (3 by default).
You can also try SchedulerParameters=setting sched_min_interval (I recommend 100000 microseconds or 100 ms) instead, which limits how frequently the main scheduler can run.

Whatever problem is happening seems like it's happening on the compute nodes, not the slurmctld node. But by limiting how quickly you schedule jobs perhaps that could alleviate some sort of race condition or resource contention with whatever is causing the problem.



[1] We'll do our best to eliminate Slurm as a possibility, and fix any issues with Slurm we find. But we're not kernel experts. Diagnosing this problem is really hard.
[2] See bug 7839 comment 27 to explain why you should set RealMemory to a few GB lower than the system's physical memory. This is something we recommend to every site. We  document this in man slurm.conf, but it could be more clear:
https://slurm.schedmd.com/slurm.conf.html#OPT_RealMemory
"Lowering RealMemory with the goal of setting aside some amount for the OS and not available for job allocations will not work as intended if Memory is not set as a consumable resource in SelectTypeParameters. So one of the *_Memory options need to be enabled for that goal to be accomplished."

Comment 24 Marshall Garey 2020-07-01 12:02:13 MDT

Hi,

I'm just wondering if you've had a chance to test any changes that have been suggested?

- Marshall

Comment 26 Marshall Garey 2020-08-20 11:32:36 MDT

I'm closing this as timed out. Please reopen it if you're still experiencing this issue.