Ticket 10178

Summary:	slurmctld nonresponsive from RPCs
Product:	Slurm	Reporter:	Zach Weidner <zweidner>
Component:	slurmctld	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	bart, colbykd
Version:	19.05.2
Hardware:	Linux
OS:	Linux
Site:	Purdue	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld logs slurm.conf

Description Zach Weidner 2020-11-08 23:49:20 MST

Created attachment 16559 [details]
slurmctld logs

Around 4:00 PM yesterday afternoon our scheduler began responding slowly and eventually has stopped responding to commands altogether.

In the logs for slurmctld we have a lot of:
[2020-11-08T16:11:57.075] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=6725835, this should never happen
[2020-11-08T16:11:57.075] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=6725256, this should never happen
[2020-11-08T16:11:57.076] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=6723822, this should never happen
[2020-11-08T16:11:57.076] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=6725268, this should never happen
[2020-11-08T16:11:57.076] error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=6725257, this should never happen

I ensured that munge was restarted on all nodes, checked that their clocks had not drifted, restarted slurmctld, slurmdbd, mariadbd, munged, and finally the whole system.

I then found and followed many of the steps I found on Bug #8978 by creating a core dump, attaching gdb, finding the only thread that isn't in a waiting condition and checking the contents of detail_ptr in _pack_default_job_details function:

(gdb) info thread
  267  Thread 0x7fb231820740 (LWP 107401) 0x00007fb230e0f35e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
  266  Thread 0x7fb0163e3700 (LWP 172823) 0x00007fb230e0f9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  265  Thread 0x7faf208c8700 (LWP 172801) 0x00007fb230e0f9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  264  Thread 0x7faf209c9700 (LWP 172800) 0x00007fb230e0f9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  263  Thread 0x7faf20aca700 (LWP 172799) 0x00007fb2313a2013 in _makespace (str=str@entry=0x7faf20ac98c0, str_len=str_len@entry=-1, needed=needed@entry=1)
    at xstring.c:110
  262  Thread 0x7faf20bcb700 (LWP 172797) 0x00007fb230e0f9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  261  Thread 0x7faf20ccc700 (LWP 172796) 0x00007fb230e0f9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
....

(gdb)  t 263
[Switching to thread 263 (Thread 0x7faf20aca700 (LWP 172799))]
#0  0x00007fb2313a2013 in _makespace (str=str@entry=0x7faf20ac98c0, str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110
110     xstring.c: No such file or directory.

(gdb) f 2
#2  0x000000000044033b in _pack_default_job_details (job_ptr=job_ptr@entry=0x2a4cd30, buffer=buffer@entry=0x7fb034002920,
    protocol_version=protocol_version@entry=8704) at job_mgr.c:10613
10613   job_mgr.c: No such file or directory.

(gdb) p *detail_ptr
$2 = {magic = 0, acctg_freq = 0x0, accrue_time = 1604867949, argc = 41473, argv = 0x2a4d7a0, begin_time = 1604867949,
  ckpt_dir = 0x2a4d730 "/scratch/brown/jain163/pfp/nile_rat", cluster_features = 0x0, contiguous = 0, core_spec = 65534, cpu_bind = 0x0,
  cpu_bind_type = 65534, cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov = 4294967294, cpus_per_task = 65534,
  orig_cpus_per_task = 65534, depend_list = 0x0, dependency = 0x0, orig_dependency = 0x0, env_cnt = 0, env_sup = 0x0, exc_node_bitmap = 0x0,
  exc_nodes = 0x0, expanding_jobid = 0, extra = 0x0, feature_list = 0x0, features = 0x0, max_cpus = 4294967294, orig_max_cpus = 4294967294, max_nodes = 1,
  mc_ptr = 0x2a4d770, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 1, orig_min_cpus = 1, min_gres_cpu = 0, min_nodes = 1, nice = 2147483648,
  ntasks_per_node = 0, num_tasks = 0, open_mode = 0 '\000', overcommit = 0 '\000', plane_size = 0, pn_min_cpus = 1, orig_pn_min_cpus = 1,
  pn_min_memory = 9223372036854778880, orig_pn_min_memory = 9223372036854778880, pn_min_tmp_disk = 0, prolog_running = 0 '\000', reserved_resources = 0,
  req_node_bitmap = 0x0, preempt_start_time = 0, req_nodes = 0x0, requeue = 1, restart_dir = 0x0, share_res = 254 '\376', std_err = 0x0,
  std_in = 0x2a4d660 "/dev/null", std_out = 0x0, submit_time = 1604867949, task_dist = 8192, usable_nodes = 0, whole_node = 0 '\000',
  work_dir = 0x2a4d6f0 "/scratch/brown/jain163/pfp/nile_rat", x11 = 0, x11_magic_cookie = 0x0, x11_target = 0x0, x11_target_port = 0}

(gdb) p *job_ptr
$1 = {magic = 4038539564, account = 0x2a4d3a0 "standby", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x2a4d360 "brown-fe01.rcac.purdue.edu",
  alloc_resp_port = 0, alloc_sid = 141292, array_job_id = 0, array_task_id = 4294967294, array_recs = 0x0, assoc_id = 3502, assoc_ptr = 0x13d33c0,
  batch_features = 0x0, batch_flag = 1, batch_host = 0x32770d0 "brown-a015", billable_tres = 1, bit_flags = 503580672, burst_buffer = 0x0,
  burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 1, cpus_per_tres = 0x0,
  cr_enabled = 0, db_flags = 8, db_index = 25253441, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x2a4d140, direct_set_prio = 0,
  end_time = 1604869917, end_time_exp = 4294967294, epilog_running = false, exit_code = 0, fed_details = 0x0, front_end_ptr = 0x0, gres_list = 0x0,
  gres_alloc = 0x2a4d3c0 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x2a4d3e0 "", gres_used = 0x0, group_id = 132, job_id = 6727262,
  job_next = 0x0, job_array_next_j = 0x0, job_array_next_t = 0x0, job_resrcs = 0x2a4d460, job_state = 32771, kill_on_node_fail = 1,
  last_sched_eval = 1604868940, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x2a4cd00}, mail_type = 0, mail_user = 0x0,
  mem_per_tres = 0x0, mcs_label = 0x0, name = 0x2a4d310 "XP_034339501_sub", network = 0x0, next_step_id = 0, nodes = 0x2a4cc30 "brown-a015",
  node_addr = 0x2a485c0, node_bitmap = 0x32771e0, node_bitmap_cg = 0x3277170, node_cnt = 1, node_cnt_wag = 0, nodes_completing = 0x2a4cc00 "brown-a015",
  origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0,
  partition = 0x2a4cc60 "brown-standby", part_ptr_list = 0x0, part_nodes_missing = false, part_ptr = 0x1deb7a0, power_flags = 0 '\000', pre_sus_time = 0,
  preempt_time = 0, preempt_in_progress = false, priority = 385, priority_array = 0x0, prio_factors = 0x2a4cb60, profile = 0, qos_id = 1, qos_ptr = 0x0,
  qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 0, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295,
  resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x2a4d430, site_factor = 2147483648, spank_job_env = 0x0, spank_job_env_size = 0,
  start_protocol_ver = 8704, start_time = 1604868940, state_desc = 0x0, state_reason = 35, state_reason_prev = 35, state_reason_prev_db = 35,
  step_list = 0x1d48fd0, suspend_time = 0, system_comment = 0x0, time_last_active = 1604897540, time_limit = 60, time_min = 0, tot_sus_time = 0,
  total_cpus = 1, total_nodes = 1, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0,
  tres_req_cnt = 0x2e6eb90, tres_req_str = 0x2e6a8c0 "1=1,2=3072,4=1,5=1", tres_fmt_req_str = 0x2e6a8f0 "cpu=1,mem=3G,node=1,billing=1",
  tres_alloc_cnt = 0x2e6a930, tres_alloc_str = 0x2e6a830 "1=1,2=3072,3=18446744073709551614,4=1,5=1",
  tres_fmt_alloc_str = 0x2e6a880 "cpu=1,mem=3G,node=1,billing=1", user_id = 544857, user_name = 0x2a4d340 "jain163", wait_all_nodes = 0, warn_flags = 0,
  warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0}


As with Bug 8978, we have a job with an incredibly high number of arguments (41473) that appears to be causing all the other threads to stall while attempting to pack the details of that job.  Unfortunately, unlike the problem in 8978 this job has completed and we have no way of cancelling it.  Is there a way for us to clear this job to get the scheduler back up and running? Any help is appreciated here, let me know if I've gone completely the wrong direction and if you need other troubleshooting steps.

Thanks,
Zach

Comment 1 Zach Weidner 2020-11-08 23:49:51 MST

Created attachment 16560 [details]
slurm.conf

Comment 2 Kevin Colby 2020-11-08 23:56:19 MST

Zach Weidner is a sysadmin on our team here at Purdue who has not posted here before.  Please let me know if there is anything official required to list him with our staff in your systems.  Thanks!

Comment 3 Dominik Bartkiewicz 2020-11-09 02:19:39 MST

Hi

You can try to connect with gdb to slurmctld process and null one of the elements in argv.

eg.:
in _pack_default_job_details() frame 

set detail_ptr->argv[i+10]=0
set detail_ptr->argv[10]=0



Dominik

Comment 4 Zach Weidner 2020-11-09 06:21:42 MST

Thanks Dominik, that appears to have done the trick after a couple rounds of it.  My fuzzy recollection from last night is that pack_all_jobs() would run need to run after a certain period of time had elapsed... if so that makes sense to need to do it a couple of times.  Either way, slurmctld appears to have caught up and is serving requests normally for us now.

Out of curiousity, how long does it take for a job like this to end up rotating out of the slurmctld saved state, and is there anything we can do to protect ourselves from a situation like this in the future?

Comment 5 Dominik Bartkiewicz 2020-11-09 07:17:21 MST

Hi

This should be permanent and destructive for modified jobs.

The best way to protect from this issue is to update to the current 20.02 version.
Other solution can be updated to 19.5.7 and locally applying https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87

Dominik

Comment 6 Zach Weidner 2020-11-09 07:22:54 MST

Excellent... just noticed further down the thread in 8978 that you said exactly that, sorry.  We're scheduling updates for our next regular downtime and we can close this issue. 

Thanks again,
Zach

Comment 7 Dominik Bartkiewicz 2020-11-09 07:51:02 MST

Hi

I'm glad this helped.
Let us know if there's anything else we can do to help.

Dominik

*** This ticket has been marked as a duplicate of ticket 8978 ***