3176 – slurmstepd trap divide error

Ticket 3176 - slurmstepd trap divide error

Summary: slurmstepd trap divide error

Status:	RESOLVED TIMEDOUT

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmstepd (show other tickets)
Version:	16.05.5
Hardware:	Cray XC Linux

Severity:	4 - Minor Issue
Assignee:	Alejandro Sanchez
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2016-10-14 06:49 MDT by Doug Jacobsen
Modified:	2018-07-31 08:44 MDT (History)
CC List:	4 users (show)

See Also:	4434
Site:	NERSC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Doug Jacobsen 2016-10-14 06:49:45 MDT

Hello,

I've been noticing a lot of trap divide errors for slurmstepd:

nid11777:/var/spool/slurmd # dmesg | tail
...
[239323.393144] traps: slurmstepd[6367] trap divide error ip:435650 sp:7fffffffe9f0 error:0 in slurmstepd[400000+1fd000]traps:
...


We've had plenty of this error:
corismw1:/var/opt/cray/log/p0-current # grep slurmstepd console-2016101* | grep "trap divide" | wc -l
35103
corismw1:/var/opt/cray/log/p0-current #


nid11777:/var/spool/slurmd # addr2line -e /usr/sbin/slurmstepd 435650
/usr/src/packages/BUILD/slurm-16.05.5-20161001055654priomgmt/src/slurmd/slurmstepd/mgr.c:2300
nid11777:/var/spool/slurmd #

It's in _send_launch_failure(), apparently msg->num_resp_port must be zero (divide by zero error).

The root cause of all these launch failures is probably an ongoing ldap issue we're having.

Thanks,
-Doug

Comment 2 Alejandro Sanchez 2016-10-19 08:05:47 MDT

Doug - I believe I see what's going on here. _send_launch_failure() is called if stepd_step_rec_create() returns the step record can't be created. Among other things, stepd_step_rec_create() validates uid and gid, and thus query with getpwuid_r. So if you are experiencing ongoing ldap issues maybe somes steps can't be created for this reason.

Now, the problem is how we end up doing this division by zero:

slurm_set_addr(&resp_msg.address, msg->resp_port[nodeid % msg->num_resp_port], NULL);

and thus msg->num_resp_port = 0. num_resp_port is calculated in this function:

static inline int
_estimate_nports(int nclients, int cli_per_port)
{
        div_t d;
        d = div(nclients, cli_per_port);
        return d.rem > 0 ? d.quot + 1 : d.quot;
}

If I'm not wrong, since cli_per_port is a constant of 48, msg->num_resp_port can only be 0 if nclients param is 0 too. nclients is calculated when creating the slurm step context (slurm_step_ctx_create). We can always put a check so that if msg->num_resp_port is 0 then don't do the division and raise an error, but I wanna go to the root cause and understand which cases lead to a slurm step context creation with nclients being 0.

Comment 6 Alejandro Sanchez 2016-10-24 11:24:59 MDT

Doug - do you have a solid reproducer for this? It'd be very helpful to know a client request example which triggers the trap.

Comment 7 Alejandro Sanchez 2016-10-27 10:47:27 MDT

Doug - besides a reproducer command, can you also look for this message:

"sending launch failure message:"

in any of the slurmd.logs where you think the trap occurred? Maybe we'll get a hint on why the step context could not be created.

Comment 8 Doug Jacobsen 2016-10-27 10:50:06 MDT

I don't think I have a reproducer, I haven't tried to reproduce it 
manually, seemed to be occurring enough on its own.  I'll see if I can 
come up with something once the system is up.

-Doug

Comment 9 Doug Jacobsen 2016-10-27 10:51:22 MDT

OK, I should be able to preserve slurmd logs starting this weekend (up 
until now they've just been destroyed), so I expect that we should be 
able to possibly increase debug levels and get more information.

Comment 11 Alejandro Sanchez 2016-11-08 10:19:26 MST

(In reply to Doug Jacobsen from comment #9)
> OK, I should be able to preserve slurmd logs starting this weekend (up 
> until now they've just been destroyed), so I expect that we should be 
> able to possibly increase debug levels and get more information.

Doug, is this still happening? Do you have updated slurmd.log files? Thanks.

Comment 13 Alejandro Sanchez 2016-11-21 14:12:07 MST

Doug - maybe after SC you've some time this week to check if this is still happening, if so to possibly increase the debug levels and get more information and maybe also double check if any SPANK plugin could affect the value of ctx->step_resp->step_layout->node_cnt which leads to the division by zero. Thanks!

Comment 14 Alejandro Sanchez 2016-11-29 07:12:45 MST

ping

Comment 15 Doug Jacobsen 2016-11-29 07:14:32 MST

I'll check the slurmd logs today.  We've been booted long enough I 
expect we might have some hits now.


On 11/29/16 6:12 AM, bugs@schedmd.com wrote:
>
> *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=3176#c14> on 
> bug 3176 <https://bugs.schedmd.com/show_bug.cgi?id=3176> from 
> Alejandro Sanchez <mailto:alex@schedmd.com> *
> ping
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 16 Alejandro Sanchez 2016-12-14 09:54:37 MST

(In reply to Doug Jacobsen from comment #15)
> I'll check the slurmd logs today.  We've been booted long enough I 
> expect we might have some hits now.

Doug - any updated slurmd logs which could help further troubleshoot this?

Comment 17 Alejandro Sanchez 2016-12-20 06:19:39 MST

Haven't seen updated slurmd.log with increased debug levels nor feedback since 3 weeks ago; marking this as resolved/timedout. Please reopen if this is still an active concern.