Ticket 1461

Summary:	slurmdbd dumps core after update to 14.11.4
Product:	Slurm	Reporter:	Don Lipari <lipari1>
Component:	slurmdbd	Assignee:	Brian Christiansen <brian>
Status:	RESOLVED FIXED	QA Contact:
Severity:	1 - System not usable
Priority:	---	CC:	brian, da
Version:	14.11.4
Hardware:	IBM BlueGene
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.5 15.08.0pre3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Same change in a patch file Complete log file

Description Don Lipari 2015-02-17 02:39:25 MST

After updating the slurmdbd on our dedicated host servicing the rzuseq, the slurmdbd fired up ok.  After updating rzuseq to v14.11.4 of slurm and starting the slurmctld, the dbd dumped core:

(gdb) bt full
#0  __strcasecmp_l_sse42 () at ../sysdeps/x86_64/multiarch/strcmp.S:259
No locals.
#1  0x00002aaaabad13a6 in as_mysql_node_down (mysql_conn=0x2aaab80334e0, node_ptr=0x2aaaaeb66bc0, event_time=1424190732, reason=0x0, reason_uid=101)
    at as_mysql_cluster.c:1085
        cpus = 8192
        rc = 0
        query = 0x0
        my_reason = 0x0
        result = 0x2aaab8019a00
        row = 0x2aaab8047828
#2  0x00002aaaabab29c1 in clusteracct_storage_p_node_down (mysql_conn=0x2aaab80334e0, node_ptr=0x2aaaaeb66bc0, event_time=1424190732, reason=0x0, reason_uid=101)
    at accounting_storage_mysql.c:2627
No locals.
#3  0x0000000000524898 in clusteracct_storage_g_node_down (db_conn=0x2aaab80334e0, node_ptr=0x2aaaaeb66bc0, event_time=1424190732, reason=0x0, reason_uid=101)
    at slurm_accounting_storage.c:703
No locals.
#4  0x000000000042ad2e in _node_state (slurmdbd_conn=0x2aaab40008d0, in_buffer=0x2aaab8019be0, out_buffer=0x2aaaaeb66dd8, uid=0x2aaaaeb66e10) at proc_req.c:2792
        node_state_msg = 0x2aaab8033b30
        node_ptr = {magic = 0, name = 0x2aaab8033bc0 "rzuseq0001", node_hostname = 0x0, node_state = 32801, not_responding = false, boot_time = 0, 
          slurmd_start_time = 0, last_response = 0, last_idle = 0, cpus = 8192, boards = 0, sockets = 0, cores = 0, cpu_spec_list = 0x0, core_spec_cnt = 0, 
          threads = 0, real_memory = 0, mem_spec_limit = 0, tmp_disk = 0, up_time = 0, config_ptr = 0x0, part_cnt = 0, part_pptr = 0x0, comm_name = 0x0, port = 0, 
          slurm_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, comp_job_cnt = 0, run_job_cnt = 0, 
          sus_job_cnt = 0, no_share_job_cnt = 0, reason = 0x0, reason_time = 1424190732, reason_uid = 101, features = 0x0, gres = 0x0, gres_list = 0x0, 
          weight = 0, arch = 0x0, os = 0x0, node_next = 0x0, node_rank = 0, energy = 0x0, ext_sensors = 0x0, select_nodeinfo = 0x0, cpu_load = 0, 
          cpu_load_time = 0, protocol_version = 0, version = 0x0, node_spec_bitmap = 0x0}
        rc = 0
        comment = 0x0
#5  0x0000000000424ba4 in proc_req (slurmdbd_conn=0x2aaab40008d0, msg=0x2aaab8033ac0 "\005\230", msg_size=43, first=false, out_buffer=0x2aaaaeb66dd8, 
    uid=0x2aaaaeb66e10) at proc_req.c:387
        rc = 0
        msg_type = 1432
        in_buffer = 0x2aaab8019be0
        comment = 0x0
        __PRETTY_FUNCTION__ = "proc_req"
#6  0x000000000043196a in _service_connection (arg=0x2aaab40008d0) at rpc_mgr.c:232
---Type <return> to continue, or q <return> to quit---
        conn = 0x2aaab40008d0
        nw_size = 721420288
        msg_size = 43
        uid = 101
        msg = 0x2aaab8033ac0 "\005\230"
        msg_read = 43
        offset = 43
        fini = false
        first = false
        buffer = 0x2aaab8019be0
        rc = 0
#7  0x00002aaaaaed89d1 in start_thread (arg=0x2aaaaeb67700) at pthread_create.c:301
        __res = <value optimized out>
        pd = 0x2aaaaeb67700
        now = <value optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {46912564000512, -3147220132254764684, 46912561892928, 46912564001216, 0, 3, -9149187324838465164, 
                -9149196586091335308}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <value optimized out>
        pagesize_m1 = <value optimized out>
        sp = <value optimized out>
        freesize = <value optimized out>
#8  0x00002aaaab1d79dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
No locals.

Comment 1 Moe Jette 2015-02-17 02:45:59 MST

It looks like at least one of the arguments to strcasecmp, namely my_reason, is NULL. This patch should get you going, but I'm not sure why my_reason would be NULL

diff --git a/src/plugins/accounting_storage/mysql/as_mysql_cluster.c b/src/plugins/accounting_storage/mysql/as_mysql_cluster.c
index a4361b8..3c4b255 100644
--- a/src/plugins/accounting_storage/mysql/as_mysql_cluster.c
+++ b/src/plugins/accounting_storage/mysql/as_mysql_cluster.c
@@ -1082,6 +1082,7 @@ extern int as_mysql_node_down(mysql_conn_t *mysql_conn,
 
        row = mysql_fetch_row(result);
        if (row && (node_ptr->node_state == slurm_atoul(row[0])) &&
+           my_reason && row[1] &&
            !strcasecmp(my_reason, row[1])) {
                debug("as_mysql_node_down: no change needed %u == %s "
                      "and %s == %s",

Comment 2 Moe Jette 2015-02-17 02:46:41 MST

Created attachment 1647 [details]
Same change in a patch file

Comment 3 Danny Auble 2015-02-17 02:49:00 MST

Don, in frame 1 could you send the output of

print *node_ptr

Comment 4 Danny Auble 2015-02-17 02:50:39 MST

Never mind, I can see it in frame 4.

Comment 5 Don Lipari 2015-02-17 02:52:14 MST

(In reply to Danny Auble from comment #3)
> Don, in frame 1 could you send the output of
> 
> print *node_ptr

print *node_ptr
$1 = {magic = 0, name = 0x2aaab8033bc0 "rzuseq0001", node_hostname = 0x0, node_state = 32801, not_responding = false, boot_time = 0, slurmd_start_time = 0, last_response = 0, last_idle = 0, cpus = 8192, 
  boards = 0, sockets = 0, cores = 0, cpu_spec_list = 0x0, core_spec_cnt = 0, threads = 0, real_memory = 0, mem_spec_limit = 0, tmp_disk = 0, up_time = 0, config_ptr = 0x0, part_cnt = 0, part_pptr = 0x0, 
  comm_name = 0x0, port = 0, slurm_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, comp_job_cnt = 0, run_job_cnt = 0, sus_job_cnt = 0, 
  no_share_job_cnt = 0, reason = 0x0, reason_time = 1424190732, reason_uid = 101, features = 0x0, gres = 0x0, gres_list = 0x0, weight = 0, arch = 0x0, os = 0x0, node_next = 0x0, node_rank = 0, energy = 0x0, 
  ext_sensors = 0x0, select_nodeinfo = 0x0, cpu_load = 0, cpu_load_time = 0, protocol_version = 0, version = 0x0, node_spec_bitmap = 0x0}

Also, here is what the slurmctld was doing that led up to the crash:

[2015-02-17T08:33:10.139] debug:  bluegene:submit_job: 1497171 mode=513 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=1-1-1
[2015-02-17T08:33:10.140] debug:  _find_best_block_match RMP14Ja182208697 <rzuseq0000>
[2015-02-17T08:33:10.140] debug:  513(1) can start job 1497171 at 1424190790 on RMP14Ja182208697(rzuseq0000) 0
[2015-02-17T08:33:13.150] debug:  _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 1497235 uid 52649
[2015-02-17T08:33:13.150] _job_signal: of pending JobID=1497235 State=0x4 NodeCnt=0 successful
[2015-02-17T08:33:14.513] debug:  slurmdbd: slurm_open_msg_conn to erzuseqi:6819: Connection refused
[2015-02-17T08:33:14.513] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused

Comment 6 Danny Auble 2015-02-17 02:57:13 MST

Moe's patch is correct.  We will try to reproduce the issue though.  I wasn't able to get the reason to be NULL in testing.  My guess is a prolog/epilog failed at the time?  The log you sent doesn't seem to have anything in it about a node going down.

According to the timestamp on the issue it appears it happened at

Tue Feb 17 08:32:12 2015

Could send the log around that time?

Comment 7 Don Lipari 2015-02-17 03:04:19 MST

(In reply to Danny Auble from comment #6)
> Moe's patch is correct.  We will try to reproduce the issue though.  I
> wasn't able to get the reason to be NULL in testing.  My guess is a
> prolog/epilog failed at the time?  The log you sent doesn't seem to have
> anything in it about a node going down.
> 
> According to the timestamp on the issue it appears it happened at
> 
> Tue Feb 17 08:32:12 2015
> 
> Could send the log around that time?

just mailed you the entire log starting with the slurmctld starting.

I notice something that is strange and could be related.  These entries:

[2015-02-17T08:32:10.719] debug:  not the right user 1336 != 2546
[2015-02-17T08:32:10.719] debug:  not the right user 1336 != 47546
[2015-02-17T08:32:52.948] debug:  not the right user 52649 != 55282

Comment 8 Danny Auble 2015-02-17 03:06:03 MST

Created attachment 1648 [details]
Complete log file

Comment 9 Danny Auble 2015-02-17 03:07:30 MST

Thanks Don, Those messages are from the association manager.  I don't think they are related in this situation, the code is just looking for the correct user, not saying anything is wrong.  Perhaps they should be debug2 messages instead of debug.

Comment 10 Danny Auble 2015-02-17 03:11:32 MST

The state of the node was 

DOWN

With flags

NODE_STATE_RES
NODE_STATE_MAINT

FYI.

Comment 11 Don Lipari 2015-02-17 03:30:22 MST

(In reply to Danny Auble from comment #10)
> The state of the node was 
> 
> DOWN
> 
> With flags
> 
> NODE_STATE_RES
> NODE_STATE_MAINT
> 
> FYI.

Ok.  The patched dbd worked.  It is staying up now.  I'll proceed with the post-install tests of the system.

Thank you!
Don

Comment 12 Brian Christiansen 2015-02-17 04:08:20 MST

Glad you're up and running. Will you send your slurmdbd logs as well from when it crashed? What version did you upgrade from?

Thanks,
Brian

Comment 13 Brian Christiansen 2015-02-17 04:12:17 MST

Will you also add your slurm.conf and bluegene.conf?

Comment 14 Don Lipari 2015-02-17 04:19:00 MST

(In reply to Brian Christiansen from comment #12)
> Glad you're up and running. Will you send your slurmdbd logs as well from
> when it crashed? What version did you upgrade from?
> 
> Thanks,
> Brian

We don't have the verbosity set very high, so the log is rather sparse.  One thing that puzzled me is, after I installed the patched version and started the slurmdbd, it had to update the job tables again.  Here's the first time leading up the point when it died:

[2015-02-17T08:21:51.426] debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2015-02-17T08:21:52.426] adding column array_task_str after account in table "rzdawndev_job_table"
[2015-02-17T08:21:52.426] adding column array_max_tasks after array_task_str in table "rzdawndev_job_table"
[2015-02-17T08:21:52.426] adding column array_task_pending after array_max_tasks in table "rzdawndev_job_table"
[2015-02-17T08:21:52.426] adding column id_array_job after id_assoc in table "rzdawndev_job_table"
[2015-02-17T08:21:52.440] adding column id_array_task after id_array_job in table "rzdawndev_job_table"
[2015-02-17T08:21:52.440] adding key array_job (id_array_job) to table "rzdawndev_job_table"
[2015-02-17T08:21:52.440] adding key reserv (id_resv) to table "rzdawndev_job_table"
[2015-02-17T08:21:52.440] debug:  Table "rzdawndev_job_table" has changed.  Updating...
[2015-02-17T08:21:57.608] Warning: Note very large processing time from make table current "rzdawndev_job_table": usec=5182283 began=08:21:52.426
[2015-02-17T08:21:57.795] debug:  Table "rzdawndev_step_table" has changed.  Updating...
[2015-02-17T08:21:58.880] adding column array_task_str after account in table "rzuseq_job_table"
[2015-02-17T08:21:58.880] adding column array_max_tasks after array_task_str in table "rzuseq_job_table"
[2015-02-17T08:21:58.880] adding column array_task_pending after array_max_tasks in table "rzuseq_job_table"
[2015-02-17T08:21:58.880] adding column id_array_job after id_assoc in table "rzuseq_job_table"
[2015-02-17T08:21:58.880] adding column id_array_task after id_array_job in table "rzuseq_job_table"
[2015-02-17T08:21:58.880] adding key array_job (id_array_job) to table "rzuseq_job_table"
[2015-02-17T08:21:58.880] adding key reserv (id_resv) to table "rzuseq_job_table"
[2015-02-17T08:21:58.880] debug:  Table "rzuseq_job_table" has changed.  Updating...
[2015-02-17T08:24:47.003] Warning: Note very large processing time from make table current "rzuseq_job_table": usec=168122854 began=08:21:58.880
[2015-02-17T08:24:47.236] debug:  Table "rzuseq_step_table" has changed.  Updating...
[2015-02-17T08:24:47.694] adding column min_cpus_per_job after usage_thres in table qos_table
[2015-02-17T08:24:47.694] debug:  Table qos_table has changed.  Updating...
[2015-02-17T08:24:47.994] Accounting storage MYSQL plugin loaded
[2015-02-17T08:24:48.082] slurmdbd version 14.11.4 started
[2015-02-17T08:25:11.329] debug:  DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:7
[2015-02-17T08:25:24.385] debug:  DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:8
[2015-02-17T08:25:33.025] debug:  DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:7
[2015-02-17T08:27:01.979] debug:  DBD_INIT: CLUSTER:rzuseq VERSION:6912 UID:43665 IP:192.168.10.6 CONN:8
[2015-02-17T08:32:09.504] debug:  DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:101 IP:192.168.10.2 CONN:7

Here's the log of the start following the install of the patched version:
[2015-02-17T09:20:57.174] debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2015-02-17T09:20:57.984] debug:  Table "rzdawndev_job_table" has changed.  Updating...
[2015-02-17T09:21:03.201] Warning: Note very large processing time from make table current "rzdawndev_job_table": usec=5217860 began=09:20:57.983
[2015-02-17T09:21:04.334] debug:  Table "rzuseq_job_table" has changed.  Updating...
[2015-02-17T09:23:40.985] Warning: Note very large processing time from make table current "rzuseq_job_table": usec=156652567 began=09:21:04.333
[2015-02-17T09:23:41.722] Accounting storage MYSQL plugin loaded
[2015-02-17T09:23:41.723] pidfile not locked, assuming no running daemon
[2015-02-17T09:23:41.845] slurmdbd version 14.11.4-2 started
[2015-02-17T09:24:45.456] debug:  DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:7
[2015-02-17T09:24:52.737] debug:  DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:8
[2015-02-17T09:25:28.512] debug:  DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:101 IP:192.168.10.2 CONN:7
[2015-02-17T09:25:28.654] DBD_JOB_COMPLETE: cluster not registered
[2015-02-17T09:25:30.272] debug:  as_mysql_node_down: no change needed 4 == 4 and Setting partial node down. == Setting partial node down.
[2015-02-17T09:25:33.512] debug:  Need to reroll usage from Tue Feb 17 08:32:10 2015 Job 1497234 from rzuseq became eligible then and we are just now hearing about it.
[2015-02-17T09:27:01.619] debug:  DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:43665 IP:192.168.10.6 CONN:8
[2015-02-17T09:29:38.234] debug:  DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:10
[2015-02-17T10:01:01.769] debug:  DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:43665 IP:192.168.10.6 CONN:8
[2015-02-17T10:08:02.089] debug:  DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:36075 IP:192.168.10.6 CONN:10

Comment 15 Don Lipari 2015-02-17 04:27:11 MST

(In reply to Brian Christiansen from comment #13)
> Will you also add your slurm.conf and bluegene.conf?

$ cat slurm.conf

ClusterName=rzuseq
ControlMachine=rzuseqsn
ControlAddr=rzuseqsn-io
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818

AuthType=auth/munge
AuthInfo=/var/run/munge/datmode_normal.socket.2
CacheGroups=0
CryptoType=crypto/munge
Epilog=/usr/sbin/slurm_epilog
FirstJobId=1
JobFileAppend=1
JobRequeue=0
Licenses=lscratchrzb:10000
MpiDefault=none
ProctrackType=proctrack/pgid
Prolog=/usr/sbin/slurm_prolog
ReconfigFlags=KeepPartState
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdSpoolDir=/var/spool/slurmd
StateSaveLocation=/home/slurm/state
SwitchType=switch/none

# TIMERS
InactiveLimit=3600
KillWait=330
MessageTimeout=30
MinJobAge=300
SlurmctldTimeout=300
SlurmdTimeout=300
Waittime=0

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/bluegene
FastSchedule=0
PreemptMode=CANCEL
PreemptType=preempt/qos
PriorityDecayHalfLife=7-0
PriorityFavorSmall=yes
PriorityFlags=DEPTH_OBLIVIOUS
PriorityMaxAge=14-0
PriorityType=priority/multifactor
PriorityWeightAge=50000
PriorityWeightFairshare=700000
PriorityWeightJobSize=250000
PriorityWeightQOS=2000000

# LOGGING
DebugFlags=SelectType
SlurmctldDebug=5
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=5
SlurmdLogFile=/var/log/slurm/slurmd.log

# ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=erzuseqi
AccountingStoragePass=/var/run/munge/moab.socket.2
AccountingStorageEnforce=limits,qos

# COMPUTE NODES
FrontendName=rzuseqlac[3-4] FrontendAddr=erzuseqlac[3-4]
NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN
NodeName=rzuseq[0000,0010x0011]
NodeName=rzuseq0001 State=DOWN

Include /etc/slurm/slurm.conf.updates

PartitionName=pbatch Nodes=rzuseq0010 Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=256 AllowGroups=langer1,leon,bgldev
PartitionName=pall Nodes=rzuseq[0000,0010,0011] Default=No State=Down Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=1536 AllowGroups=bgldev


$ cat /etc/slurm/slurm.conf.updates

PartitionName=pdebug Nodes=rzuseq[0000,0011] Default=YES State=UP Shared=FORCE DefaultTime=60 MaxTime=8:00:00 MaxNodes=64

$ cat /etc/slurm/bluegene.conf

MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware
AllowSubBlockAllocations=YES
IONodesPerMP=4
LayoutMode=DYNAMIC
MaxBlockInError=50
MidplaneNodeCnt=512
NodeCardNodeCnt=32
RebootQOSList=standby
SubMidplaneSystem=NO

BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=2

Comment 16 Don Lipari 2015-02-17 04:34:48 MST

(In reply to Brian Christiansen from comment #12)
> What version did you upgrade from?
> 
> Thanks,
> Brian

We upgraded from v14.03.11.

Comment 17 Brian Christiansen 2015-02-17 08:17:58 MST

I can reproduce the issue by either:

1. Setting a node down with no reason two times in a row.
2. Setting a node to down in slurm.conf and then creating a reservation on the node when the controller starts up.

Moe's patch is the correct solution. Let us know if you have any questions.

Comment 18 Don Lipari 2015-02-18 02:26:45 MST

(In reply to Brian Christiansen from comment #17)
> I can reproduce the issue by either:
> 
> 1. Setting a node down with no reason two times in a row.
> 2. Setting a node to down in slurm.conf and then creating a reservation on
> the node when the controller starts up.

The second matches our scenario.  I had a reservation in place before the update.
After starting the updated slurmctld, there could have been a moment when the
frontends had not registered, but the reservation was still in place.
 
> Moe's patch is the correct solution. Let us know if you have any questions.

Thanks!  That worked.