| Summary: | slurmdbd dumps core after update to 14.11.4 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | slurmdbd | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.4 | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.5 15.08.0pre3 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Same change in a patch file
Complete log file |
||
It looks like at least one of the arguments to strcasecmp, namely my_reason, is NULL. This patch should get you going, but I'm not sure why my_reason would be NULL
diff --git a/src/plugins/accounting_storage/mysql/as_mysql_cluster.c b/src/plugins/accounting_storage/mysql/as_mysql_cluster.c
index a4361b8..3c4b255 100644
--- a/src/plugins/accounting_storage/mysql/as_mysql_cluster.c
+++ b/src/plugins/accounting_storage/mysql/as_mysql_cluster.c
@@ -1082,6 +1082,7 @@ extern int as_mysql_node_down(mysql_conn_t *mysql_conn,
row = mysql_fetch_row(result);
if (row && (node_ptr->node_state == slurm_atoul(row[0])) &&
+ my_reason && row[1] &&
!strcasecmp(my_reason, row[1])) {
debug("as_mysql_node_down: no change needed %u == %s "
"and %s == %s",
Created attachment 1647 [details]
Same change in a patch file
Don, in frame 1 could you send the output of print *node_ptr Never mind, I can see it in frame 4. (In reply to Danny Auble from comment #3) > Don, in frame 1 could you send the output of > > print *node_ptr print *node_ptr $1 = {magic = 0, name = 0x2aaab8033bc0 "rzuseq0001", node_hostname = 0x0, node_state = 32801, not_responding = false, boot_time = 0, slurmd_start_time = 0, last_response = 0, last_idle = 0, cpus = 8192, boards = 0, sockets = 0, cores = 0, cpu_spec_list = 0x0, core_spec_cnt = 0, threads = 0, real_memory = 0, mem_spec_limit = 0, tmp_disk = 0, up_time = 0, config_ptr = 0x0, part_cnt = 0, part_pptr = 0x0, comm_name = 0x0, port = 0, slurm_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, comp_job_cnt = 0, run_job_cnt = 0, sus_job_cnt = 0, no_share_job_cnt = 0, reason = 0x0, reason_time = 1424190732, reason_uid = 101, features = 0x0, gres = 0x0, gres_list = 0x0, weight = 0, arch = 0x0, os = 0x0, node_next = 0x0, node_rank = 0, energy = 0x0, ext_sensors = 0x0, select_nodeinfo = 0x0, cpu_load = 0, cpu_load_time = 0, protocol_version = 0, version = 0x0, node_spec_bitmap = 0x0} Also, here is what the slurmctld was doing that led up to the crash: [2015-02-17T08:33:10.139] debug: bluegene:submit_job: 1497171 mode=513 Connection=N,N,N,N Reboot=no Rotate=yes Geometry=0x0x0x0 Block_ID=unassigned mps=1-1-1 [2015-02-17T08:33:10.140] debug: _find_best_block_match RMP14Ja182208697 <rzuseq0000> [2015-02-17T08:33:10.140] debug: 513(1) can start job 1497171 at 1424190790 on RMP14Ja182208697(rzuseq0000) 0 [2015-02-17T08:33:13.150] debug: _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 1497235 uid 52649 [2015-02-17T08:33:13.150] _job_signal: of pending JobID=1497235 State=0x4 NodeCnt=0 successful [2015-02-17T08:33:14.513] debug: slurmdbd: slurm_open_msg_conn to erzuseqi:6819: Connection refused [2015-02-17T08:33:14.513] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused Moe's patch is correct. We will try to reproduce the issue though. I wasn't able to get the reason to be NULL in testing. My guess is a prolog/epilog failed at the time? The log you sent doesn't seem to have anything in it about a node going down. According to the timestamp on the issue it appears it happened at Tue Feb 17 08:32:12 2015 Could send the log around that time? (In reply to Danny Auble from comment #6) > Moe's patch is correct. We will try to reproduce the issue though. I > wasn't able to get the reason to be NULL in testing. My guess is a > prolog/epilog failed at the time? The log you sent doesn't seem to have > anything in it about a node going down. > > According to the timestamp on the issue it appears it happened at > > Tue Feb 17 08:32:12 2015 > > Could send the log around that time? just mailed you the entire log starting with the slurmctld starting. I notice something that is strange and could be related. These entries: [2015-02-17T08:32:10.719] debug: not the right user 1336 != 2546 [2015-02-17T08:32:10.719] debug: not the right user 1336 != 47546 [2015-02-17T08:32:52.948] debug: not the right user 52649 != 55282 Created attachment 1648 [details]
Complete log file
Thanks Don, Those messages are from the association manager. I don't think they are related in this situation, the code is just looking for the correct user, not saying anything is wrong. Perhaps they should be debug2 messages instead of debug. The state of the node was DOWN With flags NODE_STATE_RES NODE_STATE_MAINT FYI. (In reply to Danny Auble from comment #10) > The state of the node was > > DOWN > > With flags > > NODE_STATE_RES > NODE_STATE_MAINT > > FYI. Ok. The patched dbd worked. It is staying up now. I'll proceed with the post-install tests of the system. Thank you! Don Glad you're up and running. Will you send your slurmdbd logs as well from when it crashed? What version did you upgrade from? Thanks, Brian Will you also add your slurm.conf and bluegene.conf? (In reply to Brian Christiansen from comment #12) > Glad you're up and running. Will you send your slurmdbd logs as well from > when it crashed? What version did you upgrade from? > > Thanks, > Brian We don't have the verbosity set very high, so the log is rather sparse. One thing that puzzled me is, after I installed the patched version and started the slurmdbd, it had to update the job tables again. Here's the first time leading up the point when it died: [2015-02-17T08:21:51.426] debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded [2015-02-17T08:21:52.426] adding column array_task_str after account in table "rzdawndev_job_table" [2015-02-17T08:21:52.426] adding column array_max_tasks after array_task_str in table "rzdawndev_job_table" [2015-02-17T08:21:52.426] adding column array_task_pending after array_max_tasks in table "rzdawndev_job_table" [2015-02-17T08:21:52.426] adding column id_array_job after id_assoc in table "rzdawndev_job_table" [2015-02-17T08:21:52.440] adding column id_array_task after id_array_job in table "rzdawndev_job_table" [2015-02-17T08:21:52.440] adding key array_job (id_array_job) to table "rzdawndev_job_table" [2015-02-17T08:21:52.440] adding key reserv (id_resv) to table "rzdawndev_job_table" [2015-02-17T08:21:52.440] debug: Table "rzdawndev_job_table" has changed. Updating... [2015-02-17T08:21:57.608] Warning: Note very large processing time from make table current "rzdawndev_job_table": usec=5182283 began=08:21:52.426 [2015-02-17T08:21:57.795] debug: Table "rzdawndev_step_table" has changed. Updating... [2015-02-17T08:21:58.880] adding column array_task_str after account in table "rzuseq_job_table" [2015-02-17T08:21:58.880] adding column array_max_tasks after array_task_str in table "rzuseq_job_table" [2015-02-17T08:21:58.880] adding column array_task_pending after array_max_tasks in table "rzuseq_job_table" [2015-02-17T08:21:58.880] adding column id_array_job after id_assoc in table "rzuseq_job_table" [2015-02-17T08:21:58.880] adding column id_array_task after id_array_job in table "rzuseq_job_table" [2015-02-17T08:21:58.880] adding key array_job (id_array_job) to table "rzuseq_job_table" [2015-02-17T08:21:58.880] adding key reserv (id_resv) to table "rzuseq_job_table" [2015-02-17T08:21:58.880] debug: Table "rzuseq_job_table" has changed. Updating... [2015-02-17T08:24:47.003] Warning: Note very large processing time from make table current "rzuseq_job_table": usec=168122854 began=08:21:58.880 [2015-02-17T08:24:47.236] debug: Table "rzuseq_step_table" has changed. Updating... [2015-02-17T08:24:47.694] adding column min_cpus_per_job after usage_thres in table qos_table [2015-02-17T08:24:47.694] debug: Table qos_table has changed. Updating... [2015-02-17T08:24:47.994] Accounting storage MYSQL plugin loaded [2015-02-17T08:24:48.082] slurmdbd version 14.11.4 started [2015-02-17T08:25:11.329] debug: DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:7 [2015-02-17T08:25:24.385] debug: DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:8 [2015-02-17T08:25:33.025] debug: DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:7 [2015-02-17T08:27:01.979] debug: DBD_INIT: CLUSTER:rzuseq VERSION:6912 UID:43665 IP:192.168.10.6 CONN:8 [2015-02-17T08:32:09.504] debug: DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:101 IP:192.168.10.2 CONN:7 Here's the log of the start following the install of the patched version: [2015-02-17T09:20:57.174] debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded [2015-02-17T09:20:57.984] debug: Table "rzdawndev_job_table" has changed. Updating... [2015-02-17T09:21:03.201] Warning: Note very large processing time from make table current "rzdawndev_job_table": usec=5217860 began=09:20:57.983 [2015-02-17T09:21:04.334] debug: Table "rzuseq_job_table" has changed. Updating... [2015-02-17T09:23:40.985] Warning: Note very large processing time from make table current "rzuseq_job_table": usec=156652567 began=09:21:04.333 [2015-02-17T09:23:41.722] Accounting storage MYSQL plugin loaded [2015-02-17T09:23:41.723] pidfile not locked, assuming no running daemon [2015-02-17T09:23:41.845] slurmdbd version 14.11.4-2 started [2015-02-17T09:24:45.456] debug: DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:7 [2015-02-17T09:24:52.737] debug: DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:8 [2015-02-17T09:25:28.512] debug: DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:101 IP:192.168.10.2 CONN:7 [2015-02-17T09:25:28.654] DBD_JOB_COMPLETE: cluster not registered [2015-02-17T09:25:30.272] debug: as_mysql_node_down: no change needed 4 == 4 and Setting partial node down. == Setting partial node down. [2015-02-17T09:25:33.512] debug: Need to reroll usage from Tue Feb 17 08:32:10 2015 Job 1497234 from rzuseq became eligible then and we are just now hearing about it. [2015-02-17T09:27:01.619] debug: DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:43665 IP:192.168.10.6 CONN:8 [2015-02-17T09:29:38.234] debug: DBD_INIT: CLUSTER:rzuseqi VERSION:7168 UID:0 IP:192.168.10.1 CONN:10 [2015-02-17T10:01:01.769] debug: DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:43665 IP:192.168.10.6 CONN:8 [2015-02-17T10:08:02.089] debug: DBD_INIT: CLUSTER:rzuseq VERSION:7168 UID:36075 IP:192.168.10.6 CONN:10 (In reply to Brian Christiansen from comment #13) > Will you also add your slurm.conf and bluegene.conf? $ cat slurm.conf ClusterName=rzuseq ControlMachine=rzuseqsn ControlAddr=rzuseqsn-io SlurmUser=slurm SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge AuthInfo=/var/run/munge/datmode_normal.socket.2 CacheGroups=0 CryptoType=crypto/munge Epilog=/usr/sbin/slurm_epilog FirstJobId=1 JobFileAppend=1 JobRequeue=0 Licenses=lscratchrzb:10000 MpiDefault=none ProctrackType=proctrack/pgid Prolog=/usr/sbin/slurm_prolog ReconfigFlags=KeepPartState ReturnToService=1 SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd StateSaveLocation=/home/slurm/state SwitchType=switch/none # TIMERS InactiveLimit=3600 KillWait=330 MessageTimeout=30 MinJobAge=300 SlurmctldTimeout=300 SlurmdTimeout=300 Waittime=0 # SCHEDULING SchedulerType=sched/backfill SelectType=select/bluegene FastSchedule=0 PreemptMode=CANCEL PreemptType=preempt/qos PriorityDecayHalfLife=7-0 PriorityFavorSmall=yes PriorityFlags=DEPTH_OBLIVIOUS PriorityMaxAge=14-0 PriorityType=priority/multifactor PriorityWeightAge=50000 PriorityWeightFairshare=700000 PriorityWeightJobSize=250000 PriorityWeightQOS=2000000 # LOGGING DebugFlags=SelectType SlurmctldDebug=5 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=5 SlurmdLogFile=/var/log/slurm/slurmd.log # ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=erzuseqi AccountingStoragePass=/var/run/munge/moab.socket.2 AccountingStorageEnforce=limits,qos # COMPUTE NODES FrontendName=rzuseqlac[3-4] FrontendAddr=erzuseqlac[3-4] NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN NodeName=rzuseq[0000,0010x0011] NodeName=rzuseq0001 State=DOWN Include /etc/slurm/slurm.conf.updates PartitionName=pbatch Nodes=rzuseq0010 Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=256 AllowGroups=langer1,leon,bgldev PartitionName=pall Nodes=rzuseq[0000,0010,0011] Default=No State=Down Shared=FORCE DefaultTime=60 MaxTime=4:00:00 MaxNodes=1536 AllowGroups=bgldev $ cat /etc/slurm/slurm.conf.updates PartitionName=pdebug Nodes=rzuseq[0000,0011] Default=YES State=UP Shared=FORCE DefaultTime=60 MaxTime=8:00:00 MaxNodes=64 $ cat /etc/slurm/bluegene.conf MloaderImage=/bgsys/drivers/ppcfloor/boot/firmware AllowSubBlockAllocations=YES IONodesPerMP=4 LayoutMode=DYNAMIC MaxBlockInError=50 MidplaneNodeCnt=512 NodeCardNodeCnt=32 RebootQOSList=standby SubMidplaneSystem=NO BridgeAPILogFile=/var/log/slurm/bridgeapi.log BridgeAPIVerbose=2 (In reply to Brian Christiansen from comment #12) > What version did you upgrade from? > > Thanks, > Brian We upgraded from v14.03.11. I can reproduce the issue by either: 1. Setting a node down with no reason two times in a row. 2. Setting a node to down in slurm.conf and then creating a reservation on the node when the controller starts up. Moe's patch is the correct solution. Let us know if you have any questions. (In reply to Brian Christiansen from comment #17) > I can reproduce the issue by either: > > 1. Setting a node down with no reason two times in a row. > 2. Setting a node to down in slurm.conf and then creating a reservation on > the node when the controller starts up. The second matches our scenario. I had a reservation in place before the update. After starting the updated slurmctld, there could have been a moment when the frontends had not registered, but the reservation was still in place. > Moe's patch is the correct solution. Let us know if you have any questions. Thanks! That worked. |
After updating the slurmdbd on our dedicated host servicing the rzuseq, the slurmdbd fired up ok. After updating rzuseq to v14.11.4 of slurm and starting the slurmctld, the dbd dumped core: (gdb) bt full #0 __strcasecmp_l_sse42 () at ../sysdeps/x86_64/multiarch/strcmp.S:259 No locals. #1 0x00002aaaabad13a6 in as_mysql_node_down (mysql_conn=0x2aaab80334e0, node_ptr=0x2aaaaeb66bc0, event_time=1424190732, reason=0x0, reason_uid=101) at as_mysql_cluster.c:1085 cpus = 8192 rc = 0 query = 0x0 my_reason = 0x0 result = 0x2aaab8019a00 row = 0x2aaab8047828 #2 0x00002aaaabab29c1 in clusteracct_storage_p_node_down (mysql_conn=0x2aaab80334e0, node_ptr=0x2aaaaeb66bc0, event_time=1424190732, reason=0x0, reason_uid=101) at accounting_storage_mysql.c:2627 No locals. #3 0x0000000000524898 in clusteracct_storage_g_node_down (db_conn=0x2aaab80334e0, node_ptr=0x2aaaaeb66bc0, event_time=1424190732, reason=0x0, reason_uid=101) at slurm_accounting_storage.c:703 No locals. #4 0x000000000042ad2e in _node_state (slurmdbd_conn=0x2aaab40008d0, in_buffer=0x2aaab8019be0, out_buffer=0x2aaaaeb66dd8, uid=0x2aaaaeb66e10) at proc_req.c:2792 node_state_msg = 0x2aaab8033b30 node_ptr = {magic = 0, name = 0x2aaab8033bc0 "rzuseq0001", node_hostname = 0x0, node_state = 32801, not_responding = false, boot_time = 0, slurmd_start_time = 0, last_response = 0, last_idle = 0, cpus = 8192, boards = 0, sockets = 0, cores = 0, cpu_spec_list = 0x0, core_spec_cnt = 0, threads = 0, real_memory = 0, mem_spec_limit = 0, tmp_disk = 0, up_time = 0, config_ptr = 0x0, part_cnt = 0, part_pptr = 0x0, comm_name = 0x0, port = 0, slurm_addr = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, comp_job_cnt = 0, run_job_cnt = 0, sus_job_cnt = 0, no_share_job_cnt = 0, reason = 0x0, reason_time = 1424190732, reason_uid = 101, features = 0x0, gres = 0x0, gres_list = 0x0, weight = 0, arch = 0x0, os = 0x0, node_next = 0x0, node_rank = 0, energy = 0x0, ext_sensors = 0x0, select_nodeinfo = 0x0, cpu_load = 0, cpu_load_time = 0, protocol_version = 0, version = 0x0, node_spec_bitmap = 0x0} rc = 0 comment = 0x0 #5 0x0000000000424ba4 in proc_req (slurmdbd_conn=0x2aaab40008d0, msg=0x2aaab8033ac0 "\005\230", msg_size=43, first=false, out_buffer=0x2aaaaeb66dd8, uid=0x2aaaaeb66e10) at proc_req.c:387 rc = 0 msg_type = 1432 in_buffer = 0x2aaab8019be0 comment = 0x0 __PRETTY_FUNCTION__ = "proc_req" #6 0x000000000043196a in _service_connection (arg=0x2aaab40008d0) at rpc_mgr.c:232 ---Type <return> to continue, or q <return> to quit--- conn = 0x2aaab40008d0 nw_size = 721420288 msg_size = 43 uid = 101 msg = 0x2aaab8033ac0 "\005\230" msg_read = 43 offset = 43 fini = false first = false buffer = 0x2aaab8019be0 rc = 0 #7 0x00002aaaaaed89d1 in start_thread (arg=0x2aaaaeb67700) at pthread_create.c:301 __res = <value optimized out> pd = 0x2aaaaeb67700 now = <value optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {46912564000512, -3147220132254764684, 46912561892928, 46912564001216, 0, 3, -9149187324838465164, -9149196586091335308}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <value optimized out> pagesize_m1 = <value optimized out> sp = <value optimized out> freesize = <value optimized out> #8 0x00002aaaab1d79dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 No locals.