| Summary: | slurmstepd trap divide error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmstepd | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED TIMEDOUT | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex, brian.gilmer, miguel.gila, sathishkumar.ranganathan |
| Version: | 16.05.5 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4434 | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Doug Jacobsen
2016-10-14 06:49:45 MDT
Doug - I believe I see what's going on here. _send_launch_failure() is called if stepd_step_rec_create() returns the step record can't be created. Among other things, stepd_step_rec_create() validates uid and gid, and thus query with getpwuid_r. So if you are experiencing ongoing ldap issues maybe somes steps can't be created for this reason.
Now, the problem is how we end up doing this division by zero:
slurm_set_addr(&resp_msg.address, msg->resp_port[nodeid % msg->num_resp_port], NULL);
and thus msg->num_resp_port = 0. num_resp_port is calculated in this function:
static inline int
_estimate_nports(int nclients, int cli_per_port)
{
div_t d;
d = div(nclients, cli_per_port);
return d.rem > 0 ? d.quot + 1 : d.quot;
}
If I'm not wrong, since cli_per_port is a constant of 48, msg->num_resp_port can only be 0 if nclients param is 0 too. nclients is calculated when creating the slurm step context (slurm_step_ctx_create). We can always put a check so that if msg->num_resp_port is 0 then don't do the division and raise an error, but I wanna go to the root cause and understand which cases lead to a slurm step context creation with nclients being 0.
Doug - do you have a solid reproducer for this? It'd be very helpful to know a client request example which triggers the trap. Doug - besides a reproducer command, can you also look for this message: "sending launch failure message:" in any of the slurmd.logs where you think the trap occurred? Maybe we'll get a hint on why the step context could not be created. I don't think I have a reproducer, I haven't tried to reproduce it manually, seemed to be occurring enough on its own. I'll see if I can come up with something once the system is up. -Doug OK, I should be able to preserve slurmd logs starting this weekend (up until now they've just been destroyed), so I expect that we should be able to possibly increase debug levels and get more information. (In reply to Doug Jacobsen from comment #9) > OK, I should be able to preserve slurmd logs starting this weekend (up > until now they've just been destroyed), so I expect that we should be > able to possibly increase debug levels and get more information. Doug, is this still happening? Do you have updated slurmd.log files? Thanks. Doug - maybe after SC you've some time this week to check if this is still happening, if so to possibly increase the debug levels and get more information and maybe also double check if any SPANK plugin could affect the value of ctx->step_resp->step_layout->node_cnt which leads to the division by zero. Thanks! ping I'll check the slurmd logs today. We've been booted long enough I expect we might have some hits now. On 11/29/16 6:12 AM, bugs@schedmd.com wrote: > > *Comment # 14 <https://bugs.schedmd.com/show_bug.cgi?id=3176#c14> on > bug 3176 <https://bugs.schedmd.com/show_bug.cgi?id=3176> from > Alejandro Sanchez <mailto:alex@schedmd.com> * > ping > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Doug Jacobsen from comment #15) > I'll check the slurmd logs today. We've been booted long enough I > expect we might have some hits now. Doug - any updated slurmd logs which could help further troubleshoot this? Haven't seen updated slurmd.log with increased debug levels nor feedback since 3 weeks ago; marking this as resolved/timedout. Please reopen if this is still an active concern. |