| Summary: | 16.05.6 slurmctld crash | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | jfbotts |
| Version: | 16.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 16.05.7 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmctld log leading up to the crash
slurmctld log leading up to the crash kept uploading the wrong path, stupid Documents folder slurmctld for today |
||
|
Description
Doug Jacobsen
2016-11-01 15:44:56 MDT
Created attachment 3667 [details]
slurmctld log leading up to the crash
Doug, this log is from 10/17? I believe though the problem may actually be related to Bug 3211. Commit 9c0a2f2bd9b9 might fix this. Could you send out of frame 4 the output of p *ss[x].tmpjobs Created attachment 3668 [details]
slurmctld log leading up to the crash
Oh shoot, must have pulled down the wrong log. I'll check on that. I can't read bug #3211. (gdb) p *ss[x].tmpjobs $1 = {core_bitmap = 0x0, core_bitmap_used = 0x0, cpu_array_cnt = 1, cpu_array_value = 0x0, cpu_array_reps = 0x0, cpus = 0x0, cpus_used = 0x0, cores_per_socket = 0x0, memory_allocated = 0x0, memory_used = 0x0, nhosts = 1, node_bitmap = 0x0, node_req = 0, nodes = 0x0, ncpus = 1, sock_core_rep_count = 0x0, sockets_per_node = 0x0, whole_node = 0 '\000'} (gdb) print x $2 = 435 (gdb) On 11/1/16 2:57 PM, bugs@schedmd.com wrote: > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c2> on bug > 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny Auble > <mailto:da@schedmd.com> * > Doug, this log is from 10/17? > > I believe though the problem may actually be related toBug 3211 <show_bug.cgi?id=3211>. > > Commit 9c0a2f2bd9b9 might fix this. > > Could you send out of frame 4 the output of > > p *ss[x].tmpjobs > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hum, still only till the 17th :). Last 2 lines of both files... [2016-10-17T14:43:40.601] debug: sched: already tested 402 jobs, breaking out [2016-10-17T14:43:40.614] debug: sched: Running job scheduler I made 3211 possible to read now. Though the bug deals with gang scheduling I believe it is also related to this as well. If you didn't have any jobs trying to complete on unresponsive nodes then this might not fix the issue. Created attachment 3669 [details]
kept uploading the wrong path, stupid Documents folder
Third times' the charm! Could you send me the slurmd log from nid03799? I am sure you are aware there are a host of nodes running with different slurm.conf according to the slurmctld so I will only mention that once. hmmm. The job_ptr indicates it was on nid06138, which is still up.
next_step_id = 0,
nodes = 0x7f7b5c112b70 "nid06138",
node_addr = 0x7f7b5c1e4ad0,
nid02305:/var/tmp/slurm # sinfo --node=nid06138
PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST
system up 1-infini 30:00 48 2:12:2 1 idle nid06138
debug* up 1-infini 30:00 0 0:0:0 0 n/a
regular up 1-infini 4-00:00:00 0 0:0:0 0 n/a
regularx up 1-infini 2-00:00:00 0 0:0:0 0 n/a
realtime up 1-4 12:00:00 48 2:12:2 1 idle nid06138
shared up 1 2-00:00:00 48 2:12:2 1 idle nid06138
special up 1-infini 10-00:00:0 48 2:12:2 1 idle nid06138
nid02305:/var/tmp/slurm #
boot:~ # ssh nid06138
Welcome to the initramfs
# /sbin/chroot /dsl uptime
15:15pm up 12 days 7:03, 0 users, load average: 0.30, 0.30, 0.30
#
Guess I don't know how to use sacctmgr to see events:
id01605:~ # sacctmgr show event where nodes=nid06138
Cluster NodeName TimeStart TimeEnd State Reason User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
nid01605:~ # sacctmgr show event where nodes=nid06138 cluster=edison
Cluster NodeName TimeStart TimeEnd State Reason User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
nid01605:~ #
but:
mysql> select * from edison_event_table where node_name="nid06138" order by time_start limit 5;
+------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+
| time_start | time_end | node_name | cluster_nodes | reason | reason_uid | state | tres |
+------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+
| 1452239802 | 1452239956 | nid06138 | | Not responding | 0 | 34849 | 1=48,2=64523 |
| 1452239956 | 1452240176 | nid06138 | | Not responding | 0 | 2049 | 1=48,2=64523 |
| 1453482650 | 1453573226 | nid06138 | | Not responding | 0 | 2049 | 1=48,2=64523 |
| 1453573226 | 1453574719 | nid06138 | | Node unexpectedly rebooted | 0 | 1 | 1=48,2=64523 |
| 1453918943 | 1453919123 | nid06138 | | Not responding | 0 | 2049 | 1=48,2=64523 |
+------------+------------+-----------+---------------+----------------------------+------------+-------+--------------+
5 rows in set (0.00 sec)
mysql>
The last "event" on this node was completed (time_end non-zero).
The slurmctld log indicates it was actively starting jobs on that node immediately before the crash:
[2016-11-01T09:02:08.109] sched: Allocate JobID=2548645 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.112] sched: Allocate JobID=2548646 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.115] sched: Allocate JobID=2548647 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.155] sched: Allocate JobID=2548648 NodeList=nid06138 #CPUs=2 Partition=shared
[2016-11-01T09:02:08.157] sched: Allocate JobID=2548649 NodeList=nid06138 #CPUs=2 Partition=shared
That user has lots of jobs, but not many super short, but nid06138 was definitely active for them.
edique01:~ # sacct -u heikki --format=job,start,end,elapsed,exit,nodelist -X
JobID Start End Elapsed ExitCode NodeList
------------ ------------------- ------------------- ---------- -------- ---------------
2544790 2016-10-31T23:33:32 2016-11-01T01:33:00 01:59:28 0:0 nid06135
2544791 2016-10-31T23:33:32 2016-11-01T04:30:49 04:57:17 0:0 nid06132
...
2548729 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124
2548730 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124
2548731 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124
2548732 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124
2548733 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124
2548734 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06124
2548735 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06105
2548736 2016-11-01T09:02:08 2016-11-01T09:22:21 00:20:13 0:0 nid06105
2548737 2016-11-01T09:02:08 2016-11-01T09:22:22 00:20:14 0:0 nid06105
2548818 2016-11-01T09:54:21 2016-11-01T10:04:41 00:10:20 0:0 nid06143
2548820 2016-11-01T09:54:21 2016-11-01T10:04:15 00:09:54 0:0 nid06142
2548822 2016-11-01T09:54:21 2016-11-01T10:04:20 00:09:59 0:0 nid06142
2548823 2016-11-01T09:54:21 2016-11-01T10:05:57 00:11:36 0:0 nid06130
2548824 2016-11-01T09:54:21 2016-11-01T10:05:51 00:11:30 0:0 nid06130
2548825 2016-11-01T09:54:21 2016-11-01T10:05:20 00:10:59 0:0 nid06130
2548826 2016-11-01T09:54:21 2016-11-01T10:05:20 00:10:59 0:0 nid06130
2548827 2016-11-01T09:54:21 2016-11-01T10:04:49 00:10:28 0:0 nid06092
2548828 2016-11-01T09:54:21 2016-11-01T10:04:51 00:10:30 0:0 nid06092
2548829 2016-11-01T09:54:21 2016-11-01T10:04:46 00:10:25 0:0 nid06092
2548830 2016-11-01T09:54:21 2016-11-01T10:04:41 00:10:20 0:0 nid06092
2548831 2016-11-01T09:54:21 2016-11-01T10:04:41 00:10:20 0:0 nid06092
2548832 2016-11-01T09:54:21 2016-11-01T10:04:42 00:10:21 0:0 nid06092
...
little bit of weird missing data here from a few hours after the crash:
2551128 2016-11-01T11:49:32 2016-11-01T11:59:10 00:09:38 0:0 nid06116
2551129 2016-11-01T11:49:32 2016-11-01T11:59:23 00:09:51 0:0 nid06116
2551130 2016-11-01T11:49:32 2016-11-01T11:59:27 00:09:55 0:0 nid06116
2551131 2016-11-01T11:49:32 2016-11-01T11:59:26 00:09:54 0:0 nid06116
2551132 2016-11-01T11:49:32 2016-11-01T11:59:32 00:10:00 0:0 nid06116
2551133 2016-11-01T11:49:32 2016-11-01T11:59:17 00:09:45 0:0 nid06116
2551387 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551388 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551389 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551390 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551391 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551392 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551393 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551394 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551395 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551396 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551397 2016-11-01T12:07:40 2016-11-01T12:07:40 00:00:00 0:0 None assigned
2551660 2016-11-01T12:24:35 2016-11-01T12:33:38 00:09:03 0:0 nid06116
2551661 2016-11-01T12:24:35 2016-11-01T12:33:55 00:09:20 0:0 nid06142
2551662 2016-11-01T12:24:35 2016-11-01T12:35:27 00:10:52 0:0 nid06138
2551663 2016-11-01T12:24:35 2016-11-01T12:35:24 00:10:49 0:0 nid06138
...
...
and so on.
ah, start filter on that sacctmgr show event. that node did crash immediately before.
sadly the slurmd logs are in memory, and the node is gone:
sboot:~ # ssh nid03799
ssh: connect to host nid03799 port 22: Network is unreachable
boot:~ # sinfo --node=nid03799
PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST
system up 1-infini 30:00 48 2:12:2 1 down* nid03799
debug* up 1-infini 30:00 48 2:12:2 1 down* nid03799
regular up 1-infini 4-00:00:00 48 2:12:2 1 down* nid03799
regularx up 1-infini 2-00:00:00 48 2:12:2 1 down* nid03799
realtime up 1-4 12:00:00 48 2:12:2 1 down* nid03799
shared up 1 2-00:00:00 0 0:0:0 0 n/a
special up 1-infini 10-00:00:0 48 2:12:2 1 down* nid03799
boot:~ # sacctmgr show event nodes=nid03799
sacctmgr: error: Problem talking to the database: Network is unreachable
boot:~ # ssh ctl1 sacctmgr show event nodes=nid03799
Cluster NodeName TimeStart TimeEnd State Reason User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
edison nid03799 2016-10-30T02:47:37 2016-11-01T09:01:18 DOWN* Not responding root(0)
edison nid03799 2016-11-01T09:29:04 Unknown DOWN* Not responding root(0)
boot:~ #
I've just implemented rotation and transference of slurmd logs on cori, but that only happens at job completion, which is unlikely to happen if the node crashes midjob.
Regarding the mismatch of slurm.conf; yes this happens because we don't sighup slurmd between jobs. That's now happening on cori as part of log rotation, edison will get this when we move to CLE6.
It looks like the node actually went down on 10/30 and hopefully a job wasn't stuck completing for that long!
The database, at least, believes there wasn't:
nid01605:~ # sacct --node=nid03799
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
nid01605:~ #
sorry, to clarify my statement, the node didn't actually crash immediately before this, but it appears there was a state change in sinfo (for reasons I didn't dig into) (Written before comment 8) The job_ptr you printed out in the beginning isn't involved in the issue though ;). The bit_ffs is dying on ss[x].tmpjobs->node_bitmap which is "in theory" pointing to a job that was running on nid03799, I wouldn't expect nid06138 to be involved at this point. You can use sacct -Nnid03799 -S9:00 -E10:00 --format=jobid,nodelist,state,start,end to give hopefully you what you need. (Written afterwards) Could you see if the sacct above gives you anything? I am sad about the logs. That would had been the easiest way to figure things out. squeue might have a hung job sitting around, or a note in the slurmctld log after the fact, though it might be hard to figure out. But this does seem to sound like 3211 at this point (which fix is in 16.05.6). our console logs indicate that nid03799 crashed on 10/30 at 2:39am:
2016-10-30T02:39:41.896526-07:00 c3-2c2s5n3 [<ffffffff8124a3e4>] intel_idle+0xc4/0x150
2016-10-30T02:39:41.896554-07:00 c3-2c2s5n3 [<ffffffff813555a8>] cpuidle_idle_call+0x128/0x220
2016-10-30T02:39:41.896561-07:00 c3-2c2s5n3 [<ffffffff810020f6>] cpu_idle+0x76/0xd0
2016-10-30T02:39:41.896586-07:00 c3-2c2s5n3 [<ffffffff8141304d>] start_secondary+0x1dc/0x26f
2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0
sacct doesn't show anything on the node at that time:
nid01605:~ # sacct --node=nid03799 --start=2016-10-30 -X --format=job,start,end,state
JobID Start End State
------------ ------------------- ------------------- ----------
2161624 2016-10-29T22:46:00 2016-10-30T00:29:02 COMPLETED
2485097 2016-10-30T02:22:06 2016-10-30T02:23:42 FAILED
2518692 2016-10-30T01:00:05 2016-10-30T01:45:43 COMPLETED
2518812 2016-10-30T01:47:19 2016-10-30T02:06:50 COMPLETED
nid01605:~ #
I don't see any evidence was allocated for a job after 2485097 before it crashed.
bzgrep 3799 slurmctld.log-20161031.bz2
...
...
[2016-10-30T02:22:31.932] debug: _slurm_rpc_job_alloc_info_lite JobId=2485097 NodeList=nid0[0296-0323,0328-0378,0393-0451,0456-0497,0502-0511,0516-0579,0584-0613,0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1255-1279,1281-1321,1587-1603,1608-1637,1649-1663,1666-1717,2536-2687,2692-2755,2760-3071,3076-3139,3144-3455,3460-3523,3528-3647,3694-3839,3844-3907,3912-4124,4126-4209,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6076] usec=2
[2016-10-30T02:23:05.688] debug: _slurm_rpc_job_alloc_info_lite JobId=2485097 NodeList=nid0[0296-0323,0328-0378,0393-0451,0456-0497,0502-0511,0516-0579,0584-0613,
0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1255-1279,1281-1321,1587-1603,1608-1637,1649-1663,1666-1717,2536-2687,2692-2755,2760-3071,307
6-3139,3144-3455,3460-3523,3528-3647,3694-3839,3844-3907,3912-4124,4126-4209,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5
951,5956-6076] usec=1
[2016-10-30T02:23:25.461] debug: _slurm_rpc_job_alloc_info_lite JobId=2518665 NodeList=nid0[1248-1249] usec=1
[2016-10-30T02:25:12.554] debug: _slurm_rpc_job_alloc_info_lite JobId=2360338 NodeList=nid0[1248-1249] usec=858568
[2016-10-30T02:25:27.573] debug: _slurm_rpc_job_alloc_info_lite JobId=2518951 NodeList=nid00292 usec=641095
[2016-10-30T02:25:39.077] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1
[2016-10-30T02:25:41.119] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=2
[2016-10-30T02:28:36.357] debug: _slurm_rpc_job_alloc_info_lite JobId=2360339 NodeList=nid0[1248-1249] usec=519370
[2016-10-30T02:29:19.824] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=2
[2016-10-30T02:29:24.127] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1
[2016-10-30T02:32:34.602] debug: _slurm_rpc_job_alloc_info_lite JobId=2518952 NodeList=nid00292 usec=607561
[2016-10-30T02:32:58.390] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1
[2016-10-30T02:33:02.855] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1
[2016-10-30T02:38:05.341] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=2
[2016-10-30T02:38:12.040] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=2
[2016-10-30T02:39:32.328] debug: Note large processing time from _slurm_rpc_job_alloc_info_lite: usec=1247485 began=02:39:31.080
[2016-10-30T02:39:32.328] debug: _slurm_rpc_job_alloc_info_lite JobId=2518956 NodeList=nid000[08-23] usec=1247485
[2016-10-30T02:39:32.328] debug: _slurm_rpc_job_alloc_info_lite JobId=2518967 NodeList=nid00024 usec=31591
[2016-10-30T02:39:40.144] debug: _slurm_rpc_job_alloc_info_lite JobId=2518954 NodeList=nid0[0615-0767,0772-0835,0840-0895,0897-0963,0968-1127] usec=1
[2016-10-30T02:39:41.952] debug: _slurm_rpc_job_alloc_info_lite JobId=2518955 NodeList=nid0[5454-5759,5764-5827,5832-5861] usec=2
[2016-10-30T02:40:09.163] debug: _slurm_rpc_job_alloc_info_lite JobId=2518820 NodeList=nid00[296-323,328-383,388-431] usec=3
[2016-10-30T02:41:16.567] debug: _slurm_rpc_job_alloc_info_lite JobId=2518826 NodeList=nid0[5353-5436] usec=2
[2016-10-30T02:41:38.726] debug: _slurm_rpc_job_alloc_info_lite JobId=2517441 NodeList=nid00[498-501] usec=2
[2016-10-30T02:41:39.529] debug: _slurm_rpc_job_alloc_info_lite JobId=2518819 NodeList=nid0[4126-4209] usec=2
[2016-10-30T02:41:55.695] debug: _slurm_rpc_job_alloc_info_lite JobId=2518662 NodeList=nid0[1246-1247] usec=1
[2016-10-30T02:41:57.474] debug: _slurm_rpc_job_alloc_info_lite JobId=2518656 NodeList=nid0[3688-3689] usec=1
[2016-10-30T02:43:01.249] debug: _slurm_rpc_job_alloc_info_lite JobId=2516167 NodeList=nid0[5437-5443,5448] usec=899183
[2016-10-30T02:43:07.913] debug: _slurm_rpc_job_alloc_info_lite JobId=2518966 NodeList=nid0[4552-4607,4992-5335] usec=2
[2016-10-30T02:43:09.749] debug: _slurm_rpc_job_alloc_info_lite JobId=2161632 NodeList=nid0[2536-2687,2692-2755,2760-3071,3076-3139,3144-3238,3319-3455,3460-3523,
3528-3647,5336-5351] usec=2
[2016-10-30T02:43:09.760] debug: _slurm_rpc_job_alloc_info_lite JobId=2161634 NodeList=nid0[1323-1347,1352-1535,1540-1603,1608-1647,3829-3839,3844-3907,3912-4118,
4302-4550,5862-5951,5956-6045] usec=4968
nid02305:/var/tmp/slurm # bzgrep 3799 slurmctld.log-20161031.bz2
[2016-10-30T00:43:23.536] backfill test for JobID=2379803 Prio=73799 Partition=regular
[2016-10-30T00:46:48.402] backfill test for JobID=2379803 Prio=73799 Partition=regular
[2016-10-30T00:50:38.940] debug: reserved ports 63799 for step 2246906.72
[2016-10-30T02:00:56.224] Job 2512459 to start at 2016-10-31T07:05:00, end at 2016-10-31T19:05:00 on nid0[3790-3799]
[2016-10-30T02:34:24.605] error: Node nid03799 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
[2016-10-30T02:38:46.837] agent/is_node_resp: node:nid03799 RPC:REQUEST_PING : Socket timed out on send/recv operation
[2016-10-30T02:38:48.327] error: Nodes nid0[3239,3318,3799,3828] not responding
[2016-10-30T02:39:17.308] debug: Spawning ping agent for nid0[0008-0063,0072-0127,0136-0191,0200-0244,0246-0255,0264-0291,0293-0323,0328-0383,0388-0451,0456-0511,0516-0579,0584-0613,0615-0767,0772-0835,0840-0895,0897-0963,0968-1151,1156-1219,1224-1233,1235-1279,1281-1321,1323-1347,1352-1535,1540-1591,1593-1603,1608-1663,1666-1717,1719-1792,1794-1919,1924-1987,1992-2047,2050-2303,2308-2371,2376-2499,2504-2687,2692-2755,2760-3071,3076-3139,3144-3238,3240-3317,3319-3455,3460-3523,3528-3670,3672-3795,3799,3814-3827,3829-3831,3872-3907,3912-4223,4228-4229,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6104,6106-6114,6116-6121,6123-6129,6132,6134-6139,6141-6142]
[2016-10-30T02:40:17.594] Node nid03799 now responding
[2016-10-30T02:42:37.670] debug: Spawning ping agent for nid0[0008-0023,0025-0063,0072-0127,0136-0191,0200-0244,0246-0255,0264-0291,0293-0323,0328-0383,0388-0451,0456-0497,0502-0511,0516-0579,0584-0613,1124,1128-1151,1156-1219,1224-1233,1235-1279,1281-1321,1323-1347,1352-1535,1540-1591,1593-1603,1608-1663,1666-1717,1719-1792,1794-1919,1924-1987,1992-2047,2050-2303,2308-2371,2376-2499,2504-2687,2692-2755,2760-3012,3014-3071,3076-3139,3144-3238,3240-3317,3319-3455,3460-3523,3528-3670,3672-3795,3799,3814-3827,3829-3831,3854-3871,3912-4223,4228-4229,4231-4291,4296-4300,4302-4607,4992-5351,5353-5443,5448-5452,5454-5759,5764-5827,5832-5951,5956-6104,6106-6114,6116-6121,6123-6129,6132,6134-6139,6141]
...
...
It's a little odd that it keeps getting the "now responding" messages. And then going down again.
It looks like slurmdbd only recorded the initial down event 10 minutes after it crashed (which is about right given that we allow nodes 600s.
nid02305:/var/tmp/slurm # sacctmgr show event nodes=nid03799 start=2016-10-01
Cluster NodeName TimeStart TimeEnd State Reason User
---------- --------------- ------------------- ------------------- ------ ------------------------------ ----------
edison nid03799 2016-10-30T02:47:37 2016-11-01T09:01:18 DOWN* Not responding root(0)
edison nid03799 2016-11-01T09:29:04 Unknown DOWN* Not responding root(0)
nid02305:/var/tmp/slurm #
So it seems unlikely a job was completing on this node at the time of the slurmctld crash.
Yeah, I don't think the incident at 2:39 was related (and perhaps this node isn't related all together). According to your sacctmgr output the node came back up at 2016-11-01T09:01:18 and then went down again at 2016-11-01T09:29:04. That last down is the only time of interest. From the logs [2016-11-01T09:01:18.134] update_node: node nid03799 state set to IDLE something appears to have made that node idle again. But going through the logs I can't seem to find any evidence this node is related to the issue. (Meaning no jobs were running on it) I am not totally convinced this isn't related to 3211 though. In any case it sounds like things are up and running though and this only happened once. As you are going to put 16.05.6 on tomorrow the problem might not ever show up again. Well, I see the nid03799 set to IDLE. What's concerning about that is that the node was fully down 2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0 So it's unclear to me how slurmd could have responded at 09:01 this morning... On 11/1/16 4:38 PM, bugs@schedmd.com wrote: > > *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c13> on > bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny > Auble <mailto:da@schedmd.com> * > Yeah, I don't think the incident at 2:39 was related (and perhaps this node > isn't related all together). According to your sacctmgr output the node came > back up at 2016-11-01T09:01:18 and then went down again at 2016-11-01T09:29:04. > > That last down is the only time of interest. > > From the logs > > [2016-11-01T09:01:18.134] update_node: node nid03799 state set to IDLE > > something appears to have made that node idle again. > > But going through the logs I can't seem to find any evidence this node is > related to the issue. (Meaning no jobs were running on it) > > I am not totally convinced this isn't related to 3211 though. In any case it > sounds like things are up and running though and this only happened once. As > you are going to put 16.05.6 on tomorrow the problem might not ever show up > again. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > This looks like a scontrol thing, not anything from the node. On November 1, 2016 6:11:30 PM MDT, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=3230 > >--- Comment #14 from Doug Jacobsen <dmjacobsen@lbl.gov> --- >Well, I see the nid03799 set to IDLE. What's concerning about that is >that the node was fully down > >2016-10-30T02:39:41.896598-07:00 c3-2c2s5n3 Stop NMI detected on CPU 0 > > >So it's unclear to me how slurmd could have responded at 09:01 this >morning... > > >On 11/1/16 4:38 PM, bugs@schedmd.com wrote: >> >> *Comment # 13 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c13> on >> bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Danny >> Auble <mailto:da@schedmd.com> * >> Yeah, I don't think the incident at 2:39 was related (and perhaps >this node >> isn't related all together). According to your sacctmgr output the >node came >> back up at 2016-11-01T09:01:18 and then went down again at >2016-11-01T09:29:04. >> >> That last down is the only time of interest. >> >> From the logs >> >> [2016-11-01T09:01:18.134] update_node: node nid03799 state set to >IDLE >> >> something appears to have made that node idle again. >> >> But going through the logs I can't seem to find any evidence this >node is >> related to the issue. (Meaning no jobs were running on it) >> >> I am not totally convinced this isn't related to 3211 though. In any >case it >> sounds like things are up and running though and this only happened >once. As >> you are going to put 16.05.6 on tomorrow the problem might not ever >show up >> again. >> >------------------------------------------------------------------------ >> You are receiving this mail because: >> >> * You reported the bug. >> > >-- >You are receiving this mail because: >You are watching all bug changes. Odd, I wonder who/what/where is attempting to manipulate the node (cle52 didn't try to adjust the reason code nearly as much as cle6) We updated edison to slurm 16.05.6 yesterday. I suppose we can wait-and-see if this issue re-occurs. -Doug I've updated the title of the bug. The reoccurred in slurm 16.05.6
(gdb) bt
#0 0x00007fc03cdf2875 in raise () from /lib64/libc.so.6
#1 0x00007fc03cdf3e51 in abort () from /lib64/libc.so.6
#2 0x00007fc03cdeb740 in __assert_fail () from /lib64/libc.so.6
#3 0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484
#4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
job_ptr=0x5c00ee0) at select_cons_res.c:655
#5 0x00007fc030e2adfa in _rm_job_from_res (part_record_ptr=0x26e59d0,
node_usage=0x5391130, job_ptr=0x5c00ee0, action=0)
at select_cons_res.c:1273
#6 0x00007fc030e2d984 in select_p_job_fini (job_ptr=0x5c00ee0)
at select_cons_res.c:2397
#7 0x00007fc03c7ad11b in other_job_fini (job_ptr=0x5c00ee0)
at other_select.c:386
#8 0x00007fc03c7a89b6 in _job_fini (args=0x5c00ee0) at select_cray.c:1083
#9 0x00007fc03d143806 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fc03ce9e9bd in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) set print pretty
(gdb) up
#1 0x00007fc03cdf3e51 in abort () from /lib64/libc.so.6
(gdb) up
#2 0x00007fc03cdeb740 in __assert_fail () from /lib64/libc.so.6
(gdb) up
#3 0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484
484 bitstring.c: No such file or directory.
(gdb) up
#4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
job_ptr=0x5c00ee0) at select_cons_res.c:655
655 select_cons_res.c: No such file or directory.
(gdb) up
#5 0x00007fc030e2adfa in _rm_job_from_res (part_record_ptr=0x26e59d0,
node_usage=0x5391130, job_ptr=0x5c00ee0, action=0)
at select_cons_res.c:1273
1273 in select_cons_res.c
(gdb) down
#4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
job_ptr=0x5c00ee0) at select_cons_res.c:655
655 in select_cons_res.c
(gdb) down
#3 0x0000000000512f14 in bit_ffs (b=0x0) at bitstring.c:484
484 bitstring.c: No such file or directory.
(gdb) up
#4 0x00007fc030e28ecc in _build_row_bitmaps (p_ptr=0x4c62870,
job_ptr=0x5c00ee0) at select_cons_res.c:655
655 select_cons_res.c: No such file or directory.
(gdb) print *p_ptr
$1 = {
next = 0x262f5b0,
num_rows = 25,
part_ptr = 0x240f0c0,
row = 0x4041a60
}
(gdb) print *job_ptr
$2 = {
account = 0x2813690 "m1304",
alias_list = 0x0,
alloc_node = 0x44e1300 "edison06",
alloc_resp_port = 0,
alloc_sid = 37672,
array_job_id = 0,
array_task_id = 4294967294,
array_recs = 0x0,
assoc_id = 7309,
assoc_ptr = 0x16a8a80,
batch_flag = 1,
batch_host = 0x7fbff83da6b0 "nid06111",
bit_flags = 0,
burst_buffer = 0x0,
check_job = 0x0,
ckpt_interval = 0,
ckpt_time = 0,
comment = 0x0,
cpu_cnt = 0,
billable_tres = 2,
cr_enabled = 1,
deadline = 0,
db_index = 13707499,
---Type <return> to continue, or q <return> to quit---
derived_ec = 0,
details = 0x5c01220,
direct_set_prio = 0,
end_time = 1478181082,
end_time_exp = 4294967294,
epilog_running = false,
exit_code = 0,
front_end_ptr = 0x0,
gres = 0x5c01400 "craynetwork:0",
gres_list = 0x0,
gres_alloc = 0x7fbff83da5d0 "craynetwork:4",
gres_req = 0x7fbff83da700 "craynetwork:0",
gres_used = 0x0,
group_id = 62595,
job_id = 2571089,
job_next = 0x0,
job_array_next_j = 0x0,
job_array_next_t = 0x0,
job_resrcs = 0x7fbff8503d50,
job_state = 3,
kill_on_node_fail = 1,
licenses = 0x5c01430 "cscratch1:1,project:1",
license_list = 0x7fc012bcff60,
limit_set = {
---Type <return> to continue, or q <return> to quit---
qos = 0,
time = 0,
tres = 0x3c84680
},
mail_type = 0,
mail_user = 0x0,
magic = 4038539564,
mcs_label = 0x0,
name = 0x39585d0 "qsub_analyse-161103-RUN51186.sh",
network = 0x0,
next_step_id = 0,
nodes = 0x7fbff81e09a0 "nid06111",
node_addr = 0x7fbff83da630,
node_bitmap = 0x7fbff80ac6f0,
node_bitmap_cg = 0x7fbfd4173500,
node_cnt = 0,
node_cnt_wag = 1,
nodes_completing = 0x0,
other_port = 0,
partition = 0x46735c0 "shared",
part_ptr_list = 0x0,
part_nodes_missing = false,
part_ptr = 0x240f0c0,
power_flags = 0 '\000',
---Type <return> to continue, or q <return> to quit---
pre_sus_time = 0,
preempt_time = 0,
preempt_in_progress = false,
priority = 57222,
priority_array = 0x0,
prio_factors = 0x3c84610,
profile = 0,
qos_id = 1,
qos_ptr = 0x8f8520,
reboot = 0 '\000',
restart_cnt = 0,
resize_time = 0,
resv_id = 0,
resv_name = 0x0,
resv_ptr = 0x0,
requid = 4294967295,
resp_host = 0x0,
sched_nodes = 0x0,
select_jobinfo = 0x5c014e0,
spank_job_env = 0x47a82d0,
spank_job_env_size = 1,
start_protocol_ver = 7680,
start_time = 1478181072,
state_desc = 0x0,
---Type <return> to continue, or q <return> to quit---
state_reason = 0,
state_reason_prev = 1,
step_list = 0x456b700,
suspend_time = 0,
time_last_active = 1478181082,
time_limit = 59,
time_min = 0,
tot_sus_time = 0,
total_cpus = 2,
total_nodes = 1,
tres_req_cnt = 0x39fa0f0,
tres_req_str = 0x44e1010 "1=1,2=1300,4=1",
tres_fmt_req_str = 0x5c013a0 "cpu=1,mem=1300,node=1",
tres_alloc_cnt = 0x7fbff857d120,
tres_alloc_str = 0x7fbff857d230 "1=2,2=2600,4=1",
tres_fmt_alloc_str = 0x7fbff818c840 "cpu=2,mem=2600M,node=1",
user_id = 62595,
wait_all_nodes = 0,
warn_flags = 0,
warn_signal = 0,
warn_time = 0,
wckey = 0x0,
req_switch = 0,
wait4switch = 0,
---Type <return> to continue, or q <return> to quit---
best_switch = true,
wait4switch_start = 0
}
(gdb) print *job_ptr->details
$3 = {
acctg_freq = 0x0,
argc = 1,
argv = 0x3bdf730,
begin_time = 1478181026,
ckpt_dir = 0x5c01510 "/var/slurm/checkpoint",
contiguous = 0,
core_spec = 65534,
cpu_bind = 0x0,
cpu_bind_type = 65534,
cpu_freq_min = 4294967294,
cpu_freq_max = 4294967294,
cpu_freq_gov = 4294967294,
cpus_per_task = 1,
depend_list = 0x0,
dependency = 0x0,
orig_dependency = 0x0,
env_cnt = 0,
env_sup = 0x0,
exc_node_bitmap = 0x0,
exc_nodes = 0x0,
expanding_jobid = 0,
feature_list = 0x0,
features = 0x0,
---Type <return> to continue, or q <return> to quit---
magic = 0,
max_cpus = 4294967294,
max_nodes = 1,
mc_ptr = 0x276f5c0,
mem_bind = 0x0,
mem_bind_type = 0,
min_cpus = 1,
min_nodes = 1,
nice = 2147483648,
ntasks_per_node = 0,
num_tasks = 1,
open_mode = 0 '\000',
overcommit = 0 '\000',
plane_size = 0,
pn_min_cpus = 1,
pn_min_memory = 2147484948,
pn_min_tmp_disk = 0,
prolog_running = 0 '\000',
reserved_resources = 0,
req_node_bitmap = 0x0,
req_node_layout = 0x0,
preempt_start_time = 0,
req_nodes = 0x0,
requeue = 0,
---Type <return> to continue, or q <return> to quit---
restart_dir = 0x0,
share_res = 1 '\001',
std_err = 0x0,
std_in = 0x5c01460 "/dev/null",
std_out = 0x0,
submit_time = 1478181026,
task_dist = 8192,
usable_nodes = 0,
whole_node = 0 '\000',
work_dir = 0x5c01490 "/global/u1/e/epuesche/EBLtosummer2016/v480a/scripts/VTS"
}
(gdb)
(gdb)
(gdb) print *ss[x].tmpjobs
$4 = {
core_bitmap = 0x0,
core_bitmap_used = 0x0,
cpu_array_cnt = 1,
cpu_array_value = 0x0,
cpu_array_reps = 0x0,
cpus = 0x0,
cpus_used = 0x0,
cores_per_socket = 0x0,
memory_allocated = 0x0,
memory_used = 0x0,
nhosts = 1,
node_bitmap = 0x0,
node_req = 0,
nodes = 0x0,
ncpus = 1,
sock_core_rep_count = 0x0,
sockets_per_node = 0x0,
whole_node = 0 '\000'
}
(gdb)
Ah, well that is good confirmation this isn't related to bug 3211 then. We will continue to look into it. I am guessing this is more related to bug 3185 as previously thought. Doug could you post an updated slurmctld log for the hour before this incident happened? Created attachment 3675 [details]
slurmctld for today
sorry i didn't have time to filter it.
I've been reviewing code and testing all day, but unfortunately don't have a solution for you. What I can tell you is that the root problem is a data structure in the select/cons_res plugin contains a list of pointers to the active job's resource allocations. That structure contains a NULL node bitmap, which triggers the abort. The job in the backtrace is not the job with the bad data, but it triggeres the failure due to a pre-existing bad data structure. My best guess is the underlying issue is some sort of race condition involving NHC and job requeue, but that's just a guess. We could get you a patch that skips over the record with the NULL pointer, but at some point the pointer will likely be a invalid non-NULL pointer and that could result in severe and unpredictable consequences, so I'll concentrate on getting to the bottom of the problem and fixing it. Notes to self: 1. My best guess is the free_job_resources() in cr_job_test() in the job_test.c module is getting called on a job that is still running the NHC/cleaning. When NHC completes, that will try to remove the job's record via other_job_fini() called form _job_fini() in select_cray.c, but not find it and leave an old/bad pointer. 2. The use of "job_list" in select/cons_res (see _rm_job_from_res() in select_cons_res.c) is fast and efficient, but not robust in the case of failures. It should probably be changed to a job pointer. I was just able to reproduce this failure with the addition of some well placed "sleep" calls in the slurm logic. Any time a running job is requeued and the job/allocation NHC runs for longer than the Munge credential lifetime (default 5 minutes), a data structure is cleared and later referenced. There are a lot of moving parts and the failure can not easily be reproduced, but the requeue/NHC timing issue described above makes this failure possible. This commit fixes a race condition that I was able to exploit and reproduce this failure. I don't have sufficient information to be certain this is what caused your aborts, but it was definitely a bug that could cause this failure and I don't see any other possibilities. https://github.com/SchedMD/slurm/commit/69af50af965d70af5642e156b4186ed6cc13aaa8 OK, this seems promising. I'll test this on alva a bit tonight for basic functionality and deploy to edison tomorrow. Thanks, Doug On 11/7/16 3:48 PM, bugs@schedmd.com wrote: > > *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=3230#c23> on > bug 3230 <https://bugs.schedmd.com/show_bug.cgi?id=3230> from Moe > Jette <mailto:jette@schedmd.com> * > This commit fixes a race condition that I was able to exploit and reproduce > this failure. I don't have sufficient information to be certain this is what > caused your aborts, but it was definitely a bug that could cause this failure > and I don't see any other possibilities. > https://github.com/SchedMD/slurm/commit/69af50af965d70af5642e156b4186ed6cc13aaa8 > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Doug Jacobsen from comment #24) > OK, this seems promising. I'll test this on alva a bit tonight for > basic functionality and deploy to edison tomorrow. Since I haven't heard back, I'll close the ticket based upon the assumption this patch fixed the problem. Please re-open if necessary. Hi Moe, Sorry about that. I applied the patch on edison (but not cori) and we haven't had any problems since. The key difference between edison and cori is that cori doesn't run the NHC via the cray/slurm plugin mechanisms any more. Though I gather from reading the patch there are other pathways for finding this bug. I'll roll this patch into cori at the next opportunity. -Doug |