Hello, This morning we've had a several different controller segfaults in free() or malloc() operations pointing to some kind of nasty memory corruption. Two of the traces were in job submissions for the same user, and match this thread: (gdb) bt #0 0x00007f1efc669f67 in raise () from /lib64/libc.so.6 #1 0x00007f1efc66b33a in abort () from /lib64/libc.so.6 #2 0x00007f1efc6a90f4 in __libc_message () from /lib64/libc.so.6 #3 0x00007f1efc6ae646 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f1efc6af393 in _int_free () from /lib64/libc.so.6 #5 0x00007f1efcf58e1e in slurm_xfree (item=item@entry=0x7f1e74003460, file=file@entry=0x7f1eebdfa981 "burst_buffer_cray.c", line=line@entry=2735, func=func@entry=0x7f1eebdfc853 <__func__.20441> "_xlate_batch") at xmalloc.c:244 #6 0x00007f1eebdf3f9f in _xlate_batch (job_desc=0x7f1e74003400) at burst_buffer_cray.c:2735 #7 _parse_bb_opts (submit_uid=78672, bb_size=<synthetic pointer>, job_desc=0x7f1e74003400) at burst_buffer_cray.c:2553 #8 bb_p_job_validate (job_desc=0x7f1e74003400, submit_uid=78672) at burst_buffer_cray.c:3259 #9 0x0000000000427ed3 in bb_g_job_validate (job_desc=job_desc@entry=0x7f1e74003400, submit_uid=submit_uid@entry=78672) at burst_buffer.c:373 #10 0x0000000000454485 in _job_create (protocol_version=8448, err_msg=0x7f1ee8e08b90, submit_uid=78672, job_pptr=<synthetic pointer>, will_run=0, allocate=0, job_desc=0x7f1e74003400) at job_mgr.c:6879 #11 job_allocate (job_specs=job_specs@entry=0x7f1e74003400, immediate=0, will_run=will_run@entry=0, resp=resp@entry=0x0, allocate=allocate@entry=0, submit_uid=submit_uid@entry=78672, job_pptr=job_pptr@entry=0x7f1ee8e08b88, err_msg=err_msg@entry=0x7f1ee8e08b90, protocol_version=8448) at job_mgr.c:4828 #12 0x0000000000496349 in _slurm_rpc_submit_batch_job (msg=0x7f1ee8e08ec0) at proc_req.c:4111 #13 slurmctld_req (msg=msg@entry=0x7f1ee8e08ec0, arg=arg@entry=0x7f1ecc000ac0) at proc_req.c:440 #14 0x000000000042ad61 in _service_connection (arg=0x7f1ecc000ac0) at controller.c:1282 #15 0x00007f1efcbe6724 in start_thread () from /lib64/libpthread.so.0 #16 0x00007f1efc721e8d in clone () from /lib64/libc.so.6 (gdb) I've confirmed this (misusing Chris's account - who handily had a persistent reservation handy on gerty) on gerty, and can easily segfault the controller like so: gerty:~ # su - csamuel csamuel@gerty:~> vi /tmp/test.dmj.sh csamuel@gerty:~> sbatch -C haswell --bb '#DW persistentdw name=csamueltest' /tmp/test.dmj.sh sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received csamuel@gerty:~> logout gerty:~ # (get the sequence from above). I strongly suspect that the memory corruption is occurring in the _add_bb_to_script() function of burst_buffer_cray.c (may be wrong): static void _add_bb_to_script(char **script_body, char *burst_buffer_file) { char *orig_script = *script_body; char *new_script, *sep, save_char; int i; if (!burst_buffer_file || (burst_buffer_file[0] == '\0')) return; /* No burst buffer file or empty file */ if (!orig_script) { *script_body = xstrdup(burst_buffer_file); return; } i = strlen(burst_buffer_file) - 1; if (burst_buffer_file[i] != '\n') /* Append new line as needed */ xstrcat(burst_buffer_file, "\n"); if (orig_script[0] != '#') { /* Prepend burst buffer file */ new_script = xstrdup(burst_buffer_file); xstrcat(new_script, orig_script); *script_body = new_script; return; } sep = strchr(orig_script, '\n'); if (sep) { save_char = sep[1]; sep[1] = '\0'; new_script = xstrdup(orig_script); xstrcat(new_script, burst_buffer_file); sep[1] = save_char; xstrcat(new_script, sep + 1); *script_body = new_script; return; } else { new_script = xstrdup(orig_script); xstrcat(new_script, "\n"); xstrcat(new_script, burst_buffer_file); *script_body = new_script; return; } } At minimum the original value of *script_body is being leaked. But if something else has a reference to it, it might be getting weird somewhere down the line. Thanks, Doug
Created attachment 9815 [details] Valgrind log from gerty when accepting a test job I'm attaching a valgrind log from Gerty in case it's useful. valgrind /usr/sbin/slurmctld -D 2>&1 | tee /tmp/valgrind.log I did it that way so you can see the valgrind warnings in context of the slurmctld log messages. Submitted a job that would normally fail, shown below (failing): csamuel@gert01:~> sbatch -C haswell --bb '#DW persistentdw name=csamueltest' --wrap hostname sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received
looking through the valgrind, there are probably a few bugs -- maybe some minor, but let's work through the list. src/plugins/node_features/knl_cray/node_features_knl_cray.c` line 912 ```_free_script_argv(script_argv); if (!WIFEXITED(status) || (WEXITSTATUS(status) != 0)) { error("%s: %s %s %s status:%u response:%s", __func__, script_argv[0], script_argv[1], script_argv[2], status, resp_msg); }``` script_argv is freed, then reported in the error()
there is a similar bug in _load_numa_type() (line 973)
found the bug in _add_bb_to_script() -- it should not be updating burst_buffer_file - the xstrcat realloc()s the string which causes it to be xfree'd. we're building a patch for both the knl_cray (minor issue) and the bb_cray (major issue) problems presently.
Created attachment 9816 [details] patch to correct multiple issues with burst_buffer/cray and node_features/knl_cray This patch corrects a double-free in burst_buffer/cray wherein an input argument was augmented with xstrcat (which may free it). It also fixes a memory leak of the script content. It also fixes a minor use-after-free issue with node_features/knl_cray
Hi Thanks for reporting this and patch. We add this fix in following commits: f1e4527decb22fe98ff6ee9bad72fd6a81317336 burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes. f21be235fc81c542af76163cd6771684f21b3d7b burst_buffer/cray - fix memory leak due to unfreed job script content. 97c833b3041ec37b34022b0f2070893bd73a0c7f node_features/knl_cray - fix script_argv use-after-free. 81b9d7bd10ca99d99ba253cf321e812bce5b3b2c (HEAD -> slurm-18.08, burst_buffer/cray - fix script_argv use-after-free. We also noticed some issues with --bbf option in srun/sbatch/salloc and now we are working on fixing it. I will drop severity to 4 now when the fix will be in the repo. Dominik