Ticket 4434

Summary:	slurmstepd abort in _send_launch_failure
Product:	Slurm	Reporter:	David Gloe <david.gloe>
Component:	slurmstepd	Assignee:	Danny Auble <da>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	17.02.9
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3176
Site:	CRAY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	Cray Internal	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.10 17.11.0 18.08.0-0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd log showing issue Fix segfault and cancel job when extern step fails.

Description David Gloe 2017-11-27 10:02:04 MST

I have a 17.02.9 slurmstepd core file due to a divide by zero error in _send_launch_failure. It looks like msg->num_resp_port is 0, causing the issue.

Core was generated by `/opt/slurm/17.02.9/sbin/slurmstepd'.
Program terminated with signal SIGFPE, Arithmetic exception.
...
(gdb) bt full
#0  0x0000000000436890 in _send_launch_failure (msg=0x86d9d0, cli=0x867eb0, rc=4006, protocol_version=7936) at mgr.c:2316
        resp_msg = {address = {sin_family = 2, sin_port = 41678, sin_addr = {s_addr = 4261380874}, 
            sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, buffer = 0x0, conn = 0x0, conn_fd = -1, data = 0x0, 
          data_size = 0, flags = 0, msg_index = 0, msg_type = 65534, protocol_version = 65534, forward = {cnt = 0, init = 65534, 
            nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = {
              s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0}
        resp = {return_code = 0, node_name = 0xff00000000000000 <error: Cannot access memory at address 0xff00000000000000>, 
          srun_node_id = 1, count_of_pids = 0, local_pids = 0xba9813bf546c2800, task_ids = 0x7fffffffec10}
        nodeid = 0
        name = 0x8745e0 "nid00032"
        __func__ = "_send_launch_failure"
#1  0x0000000000431f7d in mgr_launch_tasks_setup (msg=0x86d9d0, cli=0x867eb0, self=0x867ee0, protocol_version=7936) at mgr.c:231
        fail = 4006
        job = 0x0
#2  0x0000000000431cd4 in _step_setup (cli=0x867eb0, self=0x867ee0, msg=0x86d930) at slurmstepd.c:587
        job = 0x0
#3  0x000000000042f4d6 in main (argc=1, argv=0x7fffffffeda8) at slurmstepd.c:128
        cli = 0x867eb0
        self = 0x867ee0
        msg = 0x86d930
        job = 0x0
        ngids = 1
        gids = 0x84ae60
        rc = 0
        launch_params = 0x7fffffffeda0 "\001"
        __func__ = "main"
(gdb) print *msg
$6 = {job_id = 587415, job_step_id = 4294967295, mpi_jobid = 0, mpi_nnodes = 0, mpi_ntasks = 0, mpi_stepfnodeid = 0, 
  mpi_stepftaskid = 0, mpi_stepid = 0, nnodes = 2, ntasks = 2, ntasks_per_board = 0, ntasks_per_core = 0, ntasks_per_socket = 0, 
  packjobid = 0, packstepid = 0, uid = 26523, user_name = 0x874280 "tstusr02", gid = 11121, job_mem_lim = 26740, step_mem_lim = 26740, 
  tasks_to_launch = 0x84ac20, envc = 0, argc = 0, node_cpus = 0, cpus_per_task = 1, env = 0x0, argv = 0x0, 
  cwd = 0x86e010 "/lus/scratch/ostest.vers/alsorun.20171110210548.21737.tiger/SNL_pcmd1.9.pgrLZx.1510371444/pcmd1_01.2631", 
  cpu_bind_type = 0, cpu_bind = 0x0, mem_bind_type = 0, mem_bind = 0x0, accel_bind_type = 0, num_resp_port = 0, resp_port = 0x0, 
  task_dist = 0, flags = 0, global_task_ids = 0x86df80, orig_addr = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 0}, 
    sin_zero = "\000\000\000\000\000\000\000"}, open_mode = 0 '\000', acctg_freq = 0x0, cpu_freq_min = 0, cpu_freq_max = 0, 
  cpu_freq_gov = 0, job_core_spec = 0, ofname = 0x86e090 "/dev/null", efname = 0x86e0c0 "/dev/null", ifname = 0x86e0f0 "/dev/null", 
  num_io_port = 0, io_port = 0x0, profile = 0, task_prolog = 0x0, task_epilog = 0x0, slurmd_debug = 0, cred = 0x86dba0, 
  switch_job = 0x86e120, options = 0x86e170, complete_nodelist = 0x86e1a0 "nid000[32-33]", ckpt_dir = 0x0, pelog_env = 0x0, 
  pelog_env_size = 0, restart_dir = 0x0, spank_job_env = 0x86dfb0, spank_job_env_size = 1, select_jobinfo = 0x86e1d0, 
  alias_list = 0x0, partition = 0x84a860 "workq"}

Comment 1 David Gloe 2017-11-27 10:41:38 MST

We've run into this same abort several times; I've seen around 6 core files with the same signature.

Comment 2 David Gloe 2017-11-27 11:01:07 MST

Created attachment 5625 [details]
slurmd log showing issue

Here's a slurmd log showing the problem. I think the trouble starts around:

[2017-11-10T21:38:46.781] error: uid 26523 is not a member of gid 11121
[2017-11-10T21:38:46.781] debug:  sending launch failure message: Group ID not found on host
...
[2017-11-10T21:38:46.800] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-11-10T21:38:46.800] error: _remove_starting_step: step 587415.4294967295 not found
[2017-11-10T21:38:46.800] error: Error cleaning up starting_step list

Comment 3 Danny Auble 2017-11-27 11:15:46 MST

Thanks David, I can see how this can happen.  As you have guessed it appears the extern step isn't setup to handle this error path.  I think I'll be able to reproduce fairly easily.  I'll report back what I find.  It should be easy enough not to segfault here.  I am not sure at the moment what will happen though to the allocation.  I'll report back when I have more.  I don't think I need much more from you on the matter right now.

Comment 4 Danny Auble 2017-11-27 11:45:22 MST

Created attachment 5627 [details]
Fix segfault and cancel job when extern step fails.

David, here is a patch that will fix the segfault as well as cancel the allocation if/when this happens.  At the moment I felt this would be the best reaction otherwise you would be running the allocation without the extern step.  In your particular case it makes even more sense as the node doesn't have the gid the job needs.

Note, this also drains the node which I think is the right call as well.

Comment 5 Danny Auble 2017-11-28 09:37:44 MST

David, this has been committed to 17.02 commit 919854087d56e.  Please reopen this if the failure still happens.