Ticket 729

Summary:	slurmctld core dump
Product:	Slurm	Reporter:	Don Lipari <lipari1>
Component:	slurmctld	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	da
Version:	14.03.0
Hardware:	IBM BlueGene
OS:	Linux
Site:	LLNL	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.03.1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Don Lipari 2014-04-17 04:27:52 MDT

slurm 14.03.0 built following the last commit on April 7.

Not a critical problem in that the slurmctld was restarted and continues to stay up.

From the slurmctld.log:
[2014-04-16T08:45:15.997] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 344543.0 nodes 0-0 rc=256 uid=41557
[2014-04-16T08:45:15.997] debug:  step completion 344543.0 was received after job allocation is already completing, no c
leanup needed
[2014-04-16T08:45:15.997] sched: _slurm_rpc_step_complete StepId=344543.0 usec=352
[2014-04-16T08:45:16.001] error: Orphan job 344528.4294967294 reported on rzuseqlac4

From gdb:
(gdb) bt full
#0  0x00000000100791dc in abort_job_on_node (job_id=344528, job_ptr=0x0, node_name=0x10373ac8 "rzuseqlac4")
    at job_mgr.c:9569
        agent_info = 0x40060000a98
        kill_req = 0x40060003048
#1  0x0000000010098fec in validate_nodes_via_front_end (reg_msg=0x40060002998, protocol_version=6912, 
    newly_up=0x4002f7fe623) at node_mgr.c:2285
        error_code = 0
        i = 0
        j = 0
        rc = 1610624608
        update_node_state = false
        job_ptr = 0x0
        config_ptr = 0x100000000000000
        node_ptr = 0x4004c01e468
        now = 1397663115
        job_iterator = 0x660002998
        reg_hostlist = 0x0
        host_str = 0x0
        reason_down = 0x0
        node_flags = 12159
        front_end_ptr = 0x1072eca8
#2  0x00000000100bbbac in _slurm_rpc_node_registration (msg=0x40060011618) at proc_req.c:2249
        tv1 = {tv_sec = 1397663115, tv_usec = 999390}
        tv2 = {tv_sec = 4398843422304, tv_usec = 4399657135720}
        tv_str = '\000' <repeats 19 times>
        delta_t = 4399657136616
        error_code = 0
        newly_up = false
        node_reg_stat_msg = 0x40060002998
        job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK}
        uid = 0
#3  0x00000000100b442c in slurmctld_req (msg=0x40060011618) at proc_req.c:282
No locals.
#4  0x00000000100386b8 in _service_connection (arg=0x40010003678) at controller.c:1023
        conn = 0x40010003678
        return_code = 0x0
        msg = 0x40060011618
#5  0x000004000012c21c in .start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#6  0x000004000028a57c in .__clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) up
#1  0x0000000010098fec in validate_nodes_via_front_end (reg_msg=0x40060002998, protocol_version=6912, 
    newly_up=0x4002f7fe623) at node_mgr.c:2285
2285                            abort_job_on_node(reg_msg->job_id[i],
(gdb) p *reg_msg
$1 = {arch = 0x40060002eb8 "ppc64", cores = 1, cpus = 16, cpu_load = 116, energy = 0x40060009a58, 
  gres_info = 0x40060000a48, hash_val = 490134518, job_count = 3, job_id = 0x40060009aa8, 
  node_name = 0x40060002728 "rzuseqlac4", boards = 1, os = 0x4006000e1a8 "Linux", real_memory = 30776, 
  slurmd_start_time = 1396969446, status = 0, startup = 1, step_id = 0x4006000e1f8, sockets = 4, 
  switch_nodeinfo = 0x0, threads = 4, timestamp = 1397663115, tmp_disk = 12095, up_time = 1893154, version = 0x0}
(gdb) p job_ptr
$2 = (struct job_record *) 0x0
(gdb) p reg_msg->job_id[i]
$3 = 344528

Comment 1 Danny Auble 2014-04-17 04:32:05 MDT

Don could you send any logs about that job?

Is it possible to get access to this core file?

Comment 2 Danny Auble 2014-04-17 04:34:53 MDT

Nevermind, I can easily see the issue.

Comment 3 Danny Auble 2014-04-17 04:39:25 MDT

This is fixed in commit e4eb6a6c86c8199b780721829c18671b2cd6fd3d