Ticket 6807 - using `--bb` flag for persistent dw requests segfaults controller
Summary: using `--bb` flag for persistent dw requests segfaults controller
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-04-06 12:41 MDT by Doug Jacobsen
Modified: 2020-02-06 01:42 MST (History)
4 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 18.08.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Valgrind log from gerty when accepting a test job (46.01 KB, text/x-log)
2019-04-06 15:06 MDT, Chris Samuel (NERSC)
Details
patch to correct multiple issues with burst_buffer/cray and node_features/knl_cray (3.50 KB, patch)
2019-04-06 16:26 MDT, Doug Jacobsen
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Doug Jacobsen 2019-04-06 12:41:02 MDT
Hello,

This morning we've had a several different controller segfaults in free() or malloc() operations pointing to some kind of nasty memory corruption.

Two of the traces were in job submissions for the same user, and match this thread:

(gdb) bt
#0  0x00007f1efc669f67 in raise () from /lib64/libc.so.6
#1  0x00007f1efc66b33a in abort () from /lib64/libc.so.6
#2  0x00007f1efc6a90f4 in __libc_message () from /lib64/libc.so.6
#3  0x00007f1efc6ae646 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f1efc6af393 in _int_free () from /lib64/libc.so.6
#5  0x00007f1efcf58e1e in slurm_xfree (item=item@entry=0x7f1e74003460,
    file=file@entry=0x7f1eebdfa981 "burst_buffer_cray.c", line=line@entry=2735,
    func=func@entry=0x7f1eebdfc853 <__func__.20441> "_xlate_batch") at xmalloc.c:244
#6  0x00007f1eebdf3f9f in _xlate_batch (job_desc=0x7f1e74003400) at burst_buffer_cray.c:2735
#7  _parse_bb_opts (submit_uid=78672, bb_size=<synthetic pointer>, job_desc=0x7f1e74003400)
    at burst_buffer_cray.c:2553
#8  bb_p_job_validate (job_desc=0x7f1e74003400, submit_uid=78672) at burst_buffer_cray.c:3259
#9  0x0000000000427ed3 in bb_g_job_validate (job_desc=job_desc@entry=0x7f1e74003400,
    submit_uid=submit_uid@entry=78672) at burst_buffer.c:373
#10 0x0000000000454485 in _job_create (protocol_version=8448, err_msg=0x7f1ee8e08b90, submit_uid=78672,
    job_pptr=<synthetic pointer>, will_run=0, allocate=0, job_desc=0x7f1e74003400) at job_mgr.c:6879
#11 job_allocate (job_specs=job_specs@entry=0x7f1e74003400, immediate=0, will_run=will_run@entry=0,
    resp=resp@entry=0x0, allocate=allocate@entry=0, submit_uid=submit_uid@entry=78672,
    job_pptr=job_pptr@entry=0x7f1ee8e08b88, err_msg=err_msg@entry=0x7f1ee8e08b90, protocol_version=8448)
    at job_mgr.c:4828
#12 0x0000000000496349 in _slurm_rpc_submit_batch_job (msg=0x7f1ee8e08ec0) at proc_req.c:4111
#13 slurmctld_req (msg=msg@entry=0x7f1ee8e08ec0, arg=arg@entry=0x7f1ecc000ac0) at proc_req.c:440
#14 0x000000000042ad61 in _service_connection (arg=0x7f1ecc000ac0) at controller.c:1282
#15 0x00007f1efcbe6724 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f1efc721e8d in clone () from /lib64/libc.so.6
(gdb)


I've confirmed this (misusing Chris's account - who handily had a persistent reservation handy on gerty) on gerty, and can easily segfault the controller like so:

gerty:~ # su - csamuel
csamuel@gerty:~> vi /tmp/test.dmj.sh
csamuel@gerty:~> sbatch -C haswell --bb '#DW persistentdw name=csamueltest' /tmp/test.dmj.sh
sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received
csamuel@gerty:~> logout
gerty:~ #


(get the sequence from above).


I strongly suspect that the memory corruption is occurring in the _add_bb_to_script() function of burst_buffer_cray.c (may be wrong):

static void  _add_bb_to_script(char **script_body, char *burst_buffer_file)
{
        char *orig_script = *script_body;
        char *new_script, *sep, save_char;
        int i;

        if (!burst_buffer_file || (burst_buffer_file[0] == '\0'))
                return; /* No burst buffer file or empty file */

        if (!orig_script) {
                *script_body = xstrdup(burst_buffer_file);
                return;
        }

        i = strlen(burst_buffer_file) - 1;
        if (burst_buffer_file[i] != '\n')       /* Append new line as needed */
                xstrcat(burst_buffer_file, "\n");

        if (orig_script[0] != '#') {
                /* Prepend burst buffer file */
                new_script = xstrdup(burst_buffer_file);
                xstrcat(new_script, orig_script);
                *script_body = new_script;
                return;
        }
        sep = strchr(orig_script, '\n');
        if (sep) {
                save_char = sep[1];
                sep[1] = '\0';
                new_script = xstrdup(orig_script);
                xstrcat(new_script, burst_buffer_file);
                sep[1] = save_char;
                xstrcat(new_script, sep + 1);
                *script_body = new_script;
                return;
        } else {
                new_script = xstrdup(orig_script);
                xstrcat(new_script, "\n");
                xstrcat(new_script, burst_buffer_file);
                *script_body = new_script;
                return;
        }
}




At minimum the original value of *script_body is being leaked.  But if something else has a reference to it, it might be getting weird somewhere down the line.

Thanks,
Doug
Comment 1 Chris Samuel (NERSC) 2019-04-06 15:06:47 MDT
Created attachment 9815 [details]
Valgrind log from gerty when accepting a test job

I'm attaching a valgrind log from Gerty in case it's useful.

valgrind  /usr/sbin/slurmctld -D 2>&1 | tee /tmp/valgrind.log

I did it that way so you can see the valgrind warnings in context of the slurmctld log messages.

Submitted a job that would normally fail, shown below (failing):

csamuel@gert01:~> sbatch -C haswell --bb '#DW persistentdw name=csamueltest' --wrap hostname
sbatch: error: Batch job submission failed: Zero Bytes were transmitted or received
Comment 2 Doug Jacobsen 2019-04-06 15:21:43 MDT
looking through the valgrind, there are probably a few bugs -- maybe some minor, but let's work through the list.

src/plugins/node_features/knl_cray/node_features_knl_cray.c` line 912

        ```_free_script_argv(script_argv);
        if (!WIFEXITED(status) || (WEXITSTATUS(status) != 0)) {
                error("%s: %s %s %s status:%u response:%s", __func__,
                      script_argv[0], script_argv[1], script_argv[2],
                      status, resp_msg);
        }```

script_argv is freed, then reported in the error()
Comment 3 Doug Jacobsen 2019-04-06 15:23:59 MDT
there is a similar bug in _load_numa_type() (line 973)
Comment 4 Doug Jacobsen 2019-04-06 15:54:15 MDT
found the bug in _add_bb_to_script() -- it should not be updating burst_buffer_file - the xstrcat realloc()s the string which causes it to be xfree'd.

we're building a patch for both the knl_cray (minor issue) and the bb_cray (major issue) problems presently.
Comment 5 Doug Jacobsen 2019-04-06 16:26:03 MDT
Created attachment 9816 [details]
patch to correct multiple issues with burst_buffer/cray and node_features/knl_cray

This patch corrects a double-free in burst_buffer/cray wherein an input argument was augmented with xstrcat (which may free it).  It also fixes a memory leak of the script content.

It also fixes a minor use-after-free issue with node_features/knl_cray
Comment 15 Dominik Bartkiewicz 2019-04-10 07:48:06 MDT
Hi

Thanks for reporting this and patch.

We add this fix in following commits:
f1e4527decb22fe98ff6ee9bad72fd6a81317336
burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes.

f21be235fc81c542af76163cd6771684f21b3d7b
burst_buffer/cray - fix memory leak due to unfreed job script content.

97c833b3041ec37b34022b0f2070893bd73a0c7f
node_features/knl_cray - fix script_argv use-after-free.

81b9d7bd10ca99d99ba253cf321e812bce5b3b2c (HEAD -> slurm-18.08,
burst_buffer/cray - fix script_argv use-after-free.

We also noticed some issues with --bbf option in srun/sbatch/salloc and now we are working on fixing it. I will drop severity to 4 now when the fix will be in the repo.

Dominik