| Summary: | slurmstepd abort in _send_launch_failure | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Gloe <david.gloe> |
| Component: | slurmstepd | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 17.02.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=3176 | ||
| Site: | CRAY | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | Cray Internal |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02.10 17.11.0 18.08.0-0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmd log showing issue
Fix segfault and cancel job when extern step fails. |
||
We've run into this same abort several times; I've seen around 6 core files with the same signature. Created attachment 5625 [details]
slurmd log showing issue
Here's a slurmd log showing the problem. I think the trouble starts around:
[2017-11-10T21:38:46.781] error: uid 26523 is not a member of gid 11121
[2017-11-10T21:38:46.781] debug: sending launch failure message: Group ID not found on host
...
[2017-11-10T21:38:46.800] error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: No error
[2017-11-10T21:38:46.800] error: _remove_starting_step: step 587415.4294967295 not found
[2017-11-10T21:38:46.800] error: Error cleaning up starting_step list
Thanks David, I can see how this can happen. As you have guessed it appears the extern step isn't setup to handle this error path. I think I'll be able to reproduce fairly easily. I'll report back what I find. It should be easy enough not to segfault here. I am not sure at the moment what will happen though to the allocation. I'll report back when I have more. I don't think I need much more from you on the matter right now. Created attachment 5627 [details]
Fix segfault and cancel job when extern step fails.
David, here is a patch that will fix the segfault as well as cancel the allocation if/when this happens. At the moment I felt this would be the best reaction otherwise you would be running the allocation without the extern step. In your particular case it makes even more sense as the node doesn't have the gid the job needs.
Note, this also drains the node which I think is the right call as well.
David, this has been committed to 17.02 commit 919854087d56e. Please reopen this if the failure still happens. |
I have a 17.02.9 slurmstepd core file due to a divide by zero error in _send_launch_failure. It looks like msg->num_resp_port is 0, causing the issue. Core was generated by `/opt/slurm/17.02.9/sbin/slurmstepd'. Program terminated with signal SIGFPE, Arithmetic exception. ... (gdb) bt full #0 0x0000000000436890 in _send_launch_failure (msg=0x86d9d0, cli=0x867eb0, rc=4006, protocol_version=7936) at mgr.c:2316 resp_msg = {address = {sin_family = 2, sin_port = 41678, sin_addr = {s_addr = 4261380874}, sin_zero = "\000\000\000\000\000\000\000"}, auth_cred = 0x0, buffer = 0x0, conn = 0x0, conn_fd = -1, data = 0x0, data_size = 0, flags = 0, msg_index = 0, msg_type = 65534, protocol_version = 65534, forward = {cnt = 0, init = 65534, nodelist = 0x0, timeout = 0, tree_width = 0}, forward_struct = 0x0, orig_addr = {sin_family = 0, sin_port = 0, sin_addr = { s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, ret_list = 0x0} resp = {return_code = 0, node_name = 0xff00000000000000 <error: Cannot access memory at address 0xff00000000000000>, srun_node_id = 1, count_of_pids = 0, local_pids = 0xba9813bf546c2800, task_ids = 0x7fffffffec10} nodeid = 0 name = 0x8745e0 "nid00032" __func__ = "_send_launch_failure" #1 0x0000000000431f7d in mgr_launch_tasks_setup (msg=0x86d9d0, cli=0x867eb0, self=0x867ee0, protocol_version=7936) at mgr.c:231 fail = 4006 job = 0x0 #2 0x0000000000431cd4 in _step_setup (cli=0x867eb0, self=0x867ee0, msg=0x86d930) at slurmstepd.c:587 job = 0x0 #3 0x000000000042f4d6 in main (argc=1, argv=0x7fffffffeda8) at slurmstepd.c:128 cli = 0x867eb0 self = 0x867ee0 msg = 0x86d930 job = 0x0 ngids = 1 gids = 0x84ae60 rc = 0 launch_params = 0x7fffffffeda0 "\001" __func__ = "main" (gdb) print *msg $6 = {job_id = 587415, job_step_id = 4294967295, mpi_jobid = 0, mpi_nnodes = 0, mpi_ntasks = 0, mpi_stepfnodeid = 0, mpi_stepftaskid = 0, mpi_stepid = 0, nnodes = 2, ntasks = 2, ntasks_per_board = 0, ntasks_per_core = 0, ntasks_per_socket = 0, packjobid = 0, packstepid = 0, uid = 26523, user_name = 0x874280 "tstusr02", gid = 11121, job_mem_lim = 26740, step_mem_lim = 26740, tasks_to_launch = 0x84ac20, envc = 0, argc = 0, node_cpus = 0, cpus_per_task = 1, env = 0x0, argv = 0x0, cwd = 0x86e010 "/lus/scratch/ostest.vers/alsorun.20171110210548.21737.tiger/SNL_pcmd1.9.pgrLZx.1510371444/pcmd1_01.2631", cpu_bind_type = 0, cpu_bind = 0x0, mem_bind_type = 0, mem_bind = 0x0, accel_bind_type = 0, num_resp_port = 0, resp_port = 0x0, task_dist = 0, flags = 0, global_task_ids = 0x86df80, orig_addr = {sin_family = 2, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, open_mode = 0 '\000', acctg_freq = 0x0, cpu_freq_min = 0, cpu_freq_max = 0, cpu_freq_gov = 0, job_core_spec = 0, ofname = 0x86e090 "/dev/null", efname = 0x86e0c0 "/dev/null", ifname = 0x86e0f0 "/dev/null", num_io_port = 0, io_port = 0x0, profile = 0, task_prolog = 0x0, task_epilog = 0x0, slurmd_debug = 0, cred = 0x86dba0, switch_job = 0x86e120, options = 0x86e170, complete_nodelist = 0x86e1a0 "nid000[32-33]", ckpt_dir = 0x0, pelog_env = 0x0, pelog_env_size = 0, restart_dir = 0x0, spank_job_env = 0x86dfb0, spank_job_env_size = 1, select_jobinfo = 0x86e1d0, alias_list = 0x0, partition = 0x84a860 "workq"}