| Summary: | slurmctld core dump | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | slurmctld | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.0 | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.03.1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
Don could you send any logs about that job? Is it possible to get access to this core file? Nevermind, I can easily see the issue. This is fixed in commit e4eb6a6c86c8199b780721829c18671b2cd6fd3d |
slurm 14.03.0 built following the last commit on April 7. Not a critical problem in that the slurmctld was restarted and continues to stay up. From the slurmctld.log: [2014-04-16T08:45:15.997] debug: Processing RPC: REQUEST_STEP_COMPLETE for 344543.0 nodes 0-0 rc=256 uid=41557 [2014-04-16T08:45:15.997] debug: step completion 344543.0 was received after job allocation is already completing, no c leanup needed [2014-04-16T08:45:15.997] sched: _slurm_rpc_step_complete StepId=344543.0 usec=352 [2014-04-16T08:45:16.001] error: Orphan job 344528.4294967294 reported on rzuseqlac4 From gdb: (gdb) bt full #0 0x00000000100791dc in abort_job_on_node (job_id=344528, job_ptr=0x0, node_name=0x10373ac8 "rzuseqlac4") at job_mgr.c:9569 agent_info = 0x40060000a98 kill_req = 0x40060003048 #1 0x0000000010098fec in validate_nodes_via_front_end (reg_msg=0x40060002998, protocol_version=6912, newly_up=0x4002f7fe623) at node_mgr.c:2285 error_code = 0 i = 0 j = 0 rc = 1610624608 update_node_state = false job_ptr = 0x0 config_ptr = 0x100000000000000 node_ptr = 0x4004c01e468 now = 1397663115 job_iterator = 0x660002998 reg_hostlist = 0x0 host_str = 0x0 reason_down = 0x0 node_flags = 12159 front_end_ptr = 0x1072eca8 #2 0x00000000100bbbac in _slurm_rpc_node_registration (msg=0x40060011618) at proc_req.c:2249 tv1 = {tv_sec = 1397663115, tv_usec = 999390} tv2 = {tv_sec = 4398843422304, tv_usec = 4399657135720} tv_str = '\000' <repeats 19 times> delta_t = 4399657136616 error_code = 0 newly_up = false node_reg_stat_msg = 0x40060002998 job_write_lock = {config = READ_LOCK, job = WRITE_LOCK, node = WRITE_LOCK, partition = NO_LOCK} uid = 0 #3 0x00000000100b442c in slurmctld_req (msg=0x40060011618) at proc_req.c:282 No locals. #4 0x00000000100386b8 in _service_connection (arg=0x40010003678) at controller.c:1023 conn = 0x40010003678 return_code = 0x0 msg = 0x40060011618 #5 0x000004000012c21c in .start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x000004000028a57c in .__clone () from /lib64/libc.so.6 No symbol table info available. (gdb) up #1 0x0000000010098fec in validate_nodes_via_front_end (reg_msg=0x40060002998, protocol_version=6912, newly_up=0x4002f7fe623) at node_mgr.c:2285 2285 abort_job_on_node(reg_msg->job_id[i], (gdb) p *reg_msg $1 = {arch = 0x40060002eb8 "ppc64", cores = 1, cpus = 16, cpu_load = 116, energy = 0x40060009a58, gres_info = 0x40060000a48, hash_val = 490134518, job_count = 3, job_id = 0x40060009aa8, node_name = 0x40060002728 "rzuseqlac4", boards = 1, os = 0x4006000e1a8 "Linux", real_memory = 30776, slurmd_start_time = 1396969446, status = 0, startup = 1, step_id = 0x4006000e1f8, sockets = 4, switch_nodeinfo = 0x0, threads = 4, timestamp = 1397663115, tmp_disk = 12095, up_time = 1893154, version = 0x0} (gdb) p job_ptr $2 = (struct job_record *) 0x0 (gdb) p reg_msg->job_id[i] $3 = 344528